Knowing that a population sample (non-random) is biased in terms of its demographics, what are the best practices to correct for this issue?
That is, let's say that I can attach an array of demographics to the sample, and that I wish to transform this sample so that they resemble that of the population these results where picked. Later on, this adjusted sample will be used for mathematical modeling.
As I see it, it is quite straightforward to correct for one certain aspect. If males are under represented by 50 %
, all males are assigned a weight of 2
. But what if one wants to take into account several variables at the same time? Is building a n
-dimensional array the way to go? Are there better solutions?
Are there readily available methods for this? An R
-package?
Best Answer
As Tim pointed out, you should use survey weighting.
In your case, more specifically, if all the auxiliary variables (your demographic variables) you want to use to make your sample match your population are qualitative variables you will use:
- Post-stratification: If you have the full joint distribution of these variables on the population
- Raking: If you only have the marginal distributions of these variables on the population
More generally, if you have qualitative and quantitative auxiliary variables, you can use a Calibration approach.
Tim also pointed out the survey
package in R
. There you can find three functions that implements these methods:
- Post-stratification:
postStratify
- Raking:
rake
- Calibration:
calibrate
There is the sampling
package in R
containing the function for weighting.
- Calibration:
calib
It is important to note though that these weighting methods were originally developed under a probability sampling framework, which does not appear to be your case (you referred to your sample as "non-random"). These methods might mitigate some potential bias in your estimates, as long as the auxiliary variables used in the weighting adjustments are related to your outcome variables and to the selection mechanism of your sample. See this paper by Little and Vartivarian for a similar discussion in survey nonresponse.