Identifying and collecting data

Identifying data

Data is essential to your research for the following reasons:

  • Without appropriate data, it will be impossible to answer set research questions;
  • Data needs to be chosen for its capacity to answer the research questions; and
  • If you have more than one question, data needs to provide information relevant to each of your research questions.

Before starting on a project, it is important to make sure the data you will need is available, or can be collected. Existing or secondary data sources are widely available and commonly used for research. Large publicly funded organisations (Australian Bureau of Statistices, OECD, World Bank) tend to make their data available to the public. Another place from where to sources data is the Australian Data Archive (ADA), which makes digital research data available for secondary analysis by academic researchers.

If you are working with secondary data, it is best to choose data appropriate to question, rather than modify the question to adapt to available data. However, this may not always be possible and you may need to compromise.

Quantitative data collection

Quantitative data is generated using a method that generates numeric data or data that can be translated in numbers, enabling quantitative analysis.

Methods include:

  • Interviews – constructed closed ended questionnaires
  • Standardised observation
  • Surveys
  • Rating scales, e.g. 5 point Likert scale
  • Experiment – test the impact of an intervention
  • Quasi experiment
  • Random assignment of subjects to treatment conditions

Note that you do not have to develop existing data collection tools. You may want to use existing surveys or scales. The advantage of this is that they will have been tested in existing research, but this is not always possible. Each research project is unique, so you may need to develop your own tool, or adapt it from an existing tool.

Note also that in some cases you may need several sets of data, for example:

  • An evaluation may require before and after data
  • If there is a control group you need a second set of data
  • You could also have a mix of primary and secondary data. For example to compare results from national surveys with results from your survey
  • When doing exploratory data analysis (aims at elucidating patterns and trends) followed by confirmatory data analysis (tests patterns found), you need to use different data sets (or split data you have) for the exploratory and confirmatory data analysis (because if you have enough data, trends may show just by coincidence).

Resources for accessing quantitative data

Qualitative data collection

Qualitative data can include words, texts, images and sounds and are generally collected in a natural setting. Multiple data sources can be used. You just have to be clear what data you use for what purpose and how they will complement each other.

Just as the fisherman cast their lines into likely fishing holes, rather than randomly select places to fish, so qualitative researchers deliberately select participants for their studies (Richards and Morse, 2007).

Qualitative data collection methods

  • Interviews (individual or group interviews)
  • Focus groups
  • Stories
  • Observation
  • Documents
  • Text and images (photography, drawing) (existing or elicited)
  • Videos (existing or elicited)
  • Group work (e.g. world café, Delphi Technique)

Resources for interviewing

Gibbs; Graham R How to do a research interview

Gill, P, Stewart, K., Treasure, E. & B. Chadwick (2008). ‘Methods of data collection in qualitative research: interviews and focus groups’, British Dental Journal 204, 291-295. Doi:10.1038/bdj.2008.192. A pqper on interviezs qnd focus groups

Using Interviews in Research : a handout on doing interviews.

University of Illinois, Key informant interviewing 

Working with digital data

MANTRA,  a free online course for those who manage digital data as part of their research project.

Data sampling

Sampling is the key to good research. Your research entirely rests on the sample you have used. Any conclusions you make depend on the nature of the sample and how it was collected. Sampling decisions therefore need to be made carefully, be justified and explained in the methodology.

readResearch knowledge data base – sampling.

Basic terms

  • Sample: the segment of population that is selected for investigation
  • Population (theoretical): the universe of units you want to generalise your research to
  • Population (practical): the universe of units that is accessible
  • Sampling frame: list of accessible units from which your sample is selected
  • Representative sample: a sample that reflects the population accurately – this is crucial for research that aims to generalise findings
  • Probability sample: sample selected using random selection
  • Non-probability sample: sample selected not using random selection
  • Sampling error: difference between sample and population
  • Sample bias: distortion in the representativeness of the sample
  • Saturation point: point when you have enough data – determined by when data stops telling you new things (a term mainly used in qualitative research).
  • Non-sampling error: findings of research affected by e.g. non response, poor questions.
  • Non-response: when members of sample are unable or refuse to take part
  • Census: data collected from an entire population

Sampling and generalisation

Quantitative research tends to aim to generalise from a sample to a corresponding population.
Generalisation is often not sought in qualitative research. Qualitative research may in fact focus on worst, best or typical cases, rather than average cases and experiences. This makes generalisation impossible. Reference can be made to transferability rather than generalisability.

Unlike in quantitative research, in qualitative research sampling may happen at different stages and sampling decisions not always made in advance.

Sample sizes

Ideal sample size depends on:

  • Project
  • Methodology
  • Availability

Quantitative research tends to work with large samples. In qualitative research ‘a small sample is not a limitation if the sample allows you to address the research questions, and if the assumptions underlying the sampling approach are not violated by the claims you make about the data you have gathered’ (Goodrick, 2011, p 21).

Sampling techniques

Sampling techniques can be divided into two main categories:

  • Non-probability sampling or non random sampling
  • Random sampling (probability sampling)

Non-probability sampling or non random sampling


  • Purposive sampling
  • Quota sampling
  • Snowball sampling
  • Convenience sampling

A good participant is someone who ‘has the knowledge and experience the researcher requires, has the ability to reflect, is articulate, has the time to be interviewed, and is willing to participate’ (Morse, 1994, p 228).

Purposive sampling (non random)

In purpose sampling participants/units/cases/documents are selected because they can best provide the information required to answer a question. The selection is therefore based on who (what) you as a researcher believe will be the most useful in terms of providing information relevant to the question. Samples tend to be small subsets of a population.

Sampling methods can involve:

  • extreme or deviant sampling: selecting extreme cases (e.g. addicts in research about drug use)
  • intensity sampling: e.g. people with a lot of experience on something
  • maximum variety sampling: many different people/places
  • homogenous sampling: focus on one characteristic group
  • typical case sampling
  • criterion sampling: all meet one criteria
Quota sampling (non random)

Quota sampling is sometimes considered as a type of purposive sampling. Purposive and quota sampling are similar in that they both seek to identify participants based on specific criteria. Quota sampling is more specific with respect to sizes and proportions of subsamples, as subgroups are chosen to reflect corresponding proportions in a population. The total sample has to have the same distribution of characteristics assumed to exist in the population studied. For example, research on health in migrant groups may seek balance of age and ethnic groups in a given place, possibly depending on ratio in general population or in population in a place.

Numbers and characteristics (quotas) are decided when designing the study. Characteristics are chosen to enable a sample to be proportionately representative of a population’s social categories. Characteristics might include age, place of residence, gender, class, profession, marital status, health history, criminal history and migration. Criteria have to enable focussing on people most likely to have the experience and insight needed.

Quota sampling requires using recruitment strategies (appropriate to the location, culture, and study population) to find people needed to meet quota desired. Terms used in this context are:

  • Primary selection: all participants required could be selected using a pre-planned method
  • Secondary selection: not enough participants could be selected using a pre-planned method, so other selection methods had to be used to find enough participants with required criteria.
Snowball sampling (non random)

Snowballing is also known as chain referral sampling. Here participants with whom contact has already been made are used to find other people who could potentially participate to the study.  Snowball sampling is useful to find and recruit ‘hidden populations’ or groups not easily accessible through other sampling strategies.

Convenience sampling (non random)

Only available participants are in a sample. Participants are therefore not representative of a larger group. Convenience sampling does not allow to generalise findings.

Theoretical sampling

Developed by Glaser and Strauss (1967) theoretical sampling is used in grounded theory to extend a sample once you start understanding what is emerging. The selection is directed by the emerging analysis and used to work inductively (and discover) or deductively (to verify). In this case sampling is done to refine theory, not to increase sample size.

Random sampling (probability sampling)

Random sampling is generally used for large numbers, such as large scale surveys. The methods relies on random selection, which gives each participant/unit an equal chance of being selected. The  sample must also contain the same variations that exist in the population. Bias may arise if those selected are not typical, nor representative of the larger population.

Probability theory, which covers the relative frequency with which certain events occur, is at the basis of random sampling and will direct how many participants/units have to be sampled.

Probability as the systematic study of uncertainty is important as it allows the researcher to use information in a sample to make inferences about the population from which the sample was obtained. Finding can be then generalised from a sample of the population.

watch video iconProbability


Advantages of probability sampling:

  • Representativeness
  • Avoids bias

Types of probability samples:

  • Simple random sample
  • Systematic random sample
  • Stratified random sample
  • Cluster sample
Simple random sampling

A type of probability sampling in which the units composing a population are assigned numbers.
A set of random numbers is generated and the units having those numbers are included in the sample.
Not necessarily the most accurate sampling method.

Simple random sample example (from Bryman & Bell, 2011):
Population: all full time employees = 9000
Sample frame: excludes part-timers, those working outside
Sample size = 450
All employees are given a number from 1 – 9000
Select randomly 450 different numbers (e.g using software)
These 450 constitute the sample

Systematic random sample

Units are selected from sample frame in systematic way, for example, chose people from a phone book, make random start and then pick every 20th person.

Stratified sample

Stratification is a modification to simple random and systematic sample methods. Stratification refers to the grouping of the units composing a population into homogenous groups (strata) before sampling. Population is stratified by criterion (e.g. gender, department, position level).
In terms of representativeness, this is a slightly more accurate method compared to simple random sampling, but it requires having the characteristics of a population.

Cluster sample or multi-stage cluster sample

Sample made up of groupings of units of the population to be sampled. This is done in stages. Initially natural groups are sampled initially. Members of each selected group are then being sub-sampled afterward.

Cluster sampling is used when it is not practical or possible to create a list of all elements that compose the target population, for example with widely dispersed populations. It is very efficient, but less precise.

Cluster sampling – example (from Bryman & Bell):
A sample of 5000 employees working for 100 largest companies in the country is needed. Random or systematic sampling would require a lot of travel. This is avoided by cluster sampling: companies are sampled first. In this case 10 companies are chosen (using probability sampling) to generate 10 clusters. In each cluster 500 employees are then sampled, also using probability sampling.

Equivalence sampling

Example from Bryman & Bell:
The research aims to validate a measurement tool used in the US and in Belgium. The required sample does not need to be representative of whole population, but only of the target population. In the US, 20,000 invitations to participate were sent. In Belgium, 11,500 invitation were sent. Sample not proportionate but does not matter in this case.

Sample bias

Sample bias will negatively affect any claim you can make in your research. Watch out for a type of bias: selection, measurement or response, or non-response bias.

Selection bias

Units selected are not representative of population researched. E.g. using a sample constructed through phone book – excludes people who have no phone.

Measurement or response bias

When observation produces values that are different from the real value. E.g. a satisfaction survey asking the respondent to indicate where he is satisfied, dissatisfied, or very dissatisfied (loaded towards dissatisfaction), OR when people do not respond truthfully because they want to be liked, or what they are doing is illegal.

Non-response bias

When responses are not obtained from all individuals – effect greatest if response rate is low. Non response rate vary, but overall mail surveys have a high non-response rate. Some sampling bias is unavoidable. Probability sampling helps minimise sampling bias (compared with non-probability sampling).

Data mining

watch video iconText and data mining in the humanities and social sciences, a free online webinar presented by the Center for Research Libraries Global Resources Network. The accompanying presentation is available from