I have been told that mean imputation of missing values is inappropriate when the variables underlying distribution is non-normal. my variable is contiunous (but bound at 100) and most observations are either 98 or 99 (with the odd few in the lower 90's), hence the distribution is highly skewed. how would i best input for missing values?
The typical modern approach in this sort of situation is to use some form of multiple imputation.
The general idea is that instead of imputing just one 'best' value to the missing data, you repeat an imputation process many times using the known statistics of the missing value (including considering whether the missing data might be correlated with other variables that are not missing), generating multiple distinct sets of data. Then you run your analyses on each imputed copy of the data, and finally pool those analyses together.
MICE is a popular implementation in R.
You can expect that other major statistical packages like Stata will also have functions for multiple imputation.
- Solved – Optimal scaling / CATREG (categorical regression) for imputed data
- Solved – Missing data and covariate analysis
- Solved – Is it better to use data imputation for missing data or an analysis that is not affected by missing data (e.g., HLM/mixed effects modelling)
- Solved – Uncertainty in random forest imputations from R missForest package
- Solved – Multiple imputation vs single imputation