I have a dataset (4,898 X 17,000) that follows 4898 mothers, fathers, and their children over a period of 15 years. The interviews have been conducted at baseline (when the child was born), year-1, year-3, year-5, year-9, and year-15. I want to predict GPA (at year-15) using random forest from a set of features (Gender, household income, household education, Parental involvement in studies, parental expectation, family structure, cognitive and non-cognitive variables. All of these variables except cognitive variables have been captured at year-15. Cognitive variables have been captured at year-9. Year-9 has about 1300 values missing whereas year-15 has about 1454 values missing due to non-response in year-9 and 15. I am quite new to imputation and I am not sure how to use multiple imputations here (especially when there are more than 1000 rows that have missing values for all the columns). Any help in this regard would be very helpful.
Best Answer
If you decide to "delete" the missing data prior to analysis, that is called a "complete-case analysis" (i.e., you are only using data points that have complete information). That is quite a simple and common method of analysis, but it has some risks. In particular, if the variables under analysis are statistically related to the "missingness" then ignoring the missing data will induce bias in your inferences.
Imputation methods are created in order to try to approximately model statistical dependence between missing values and the "missingness" in the data. In cases where entire classes of data points are missing, it may be the case that there is no information available to support imputation, in whih case you may have to fall back on complete-case analysis, with appropriate caveats and caution in your conclusions. In any case missing data methods require quite a bit of learning to implement correctly, but imputation methods can perform better than complete-case analysis in a wide variety of problems where there is sufficient information to estimate relationships between missing data values and the "missingness" indicators.
If you would like to learn more about missing data methods, you can find a simple educational introduction in Pigott (2001) and a more detailed exposition in Little and Rubin (2002).