Solved – Techniques to sample skewed distributions

I am trying to understand and study the techniques that are available for sampling data in skewed distribution. For the purpose of the question, lets assume we have data collected from a city where we have attributes age, gender, salary.

Lets say the age of the population has more multi-modal distribution, salary has a long tail and gender is left skewed distribution.

What are the techniques that can be used to obtain a balanced sample from this population. Is stratified sampling the only technique? Are there different techniques to sample from skewed data (latest development / research)?

Assuming you're doing probability sampling and employing a design-based philosophy, then the properties of the distribution that generates the population units aren't particularly important. Horvitz-Thompson Estimation doesn't make assumptions about the underlying distribution of the population.

You mention that you want to ensure that your estimator is approximately normal – this is ensured by the Finite Sampling Central Limit Theorem (proven by Erdos and Renyi (1959) and Hajek (1960) – yes, that Erdos).

If you're generally trying to ensure that your sample is in some sense "representative" (that is, is likely to have similar properties to the whole population – see Basu's Elephants for an example of a sampling scheme that doesn't ensure this), then you have a range of strategies:

  1. You mention stratification already. If you can split your population into pieces and survey them separately, then this is usually a very good strategy for dealing with units that will have different magnitude responses. In survey that have highly skewed population responses (e.g. business surveys), then businesses that are likely to have large responses are often completely enumerated. This is because the largest units will have the largest effect on the estimate.

  2. Systematic sampling can also accomplish selecting more representative samples. If you rank your units by expect response size and then skip through this list evenly, then you are more likely to select a range of different units in your sample, which protects you from selecting too many small units, or too many large units. If badly done this can introduce a systematic bias to your samples, though, so be careful.

  3. Probability proportional to size sampling, where larger units are more likely to be selected is another way to control how representative your sample is. If you have a higher probability of selecting larger units, then these units have a smaller weight, which means they have a lower contribution to the variance of the estimator.

  4. If your population is skewed because very few of the units in your population are likely to give informative answers (e.g. if you're asking a sample from the general population what they purchase with their pension), then you can design a multi-phase survey where many units are originally approached, but only those who identify as pensioners are actually followed up for more clarification.

Similar Posts:

Rate this post

Leave a Comment