Solved – Missing value imputation and Outlier treatment

Should missing value imputation and outlier treatment be done prior to splitting data into training and validation data sets? Suppose, i have split my data into training and validation data. I have done median imputation for missing values and capped data at 1 and 99th percentile in training data set. While imputing missing data and outlier treatment in validation data set, should i use the same median and capping value that were calculated in training data. Would it be fine if i calculate the median and percentile scores according to validation data set? In future, the same process will hold for a new data set in which we do scoring?

The imputation strategy and methodology for handling outliers should be developed using the training dataset and then applied to the validation dataset. It wouldn't make sense to use all your data in the development of an imputation procedure using all the data, then build your model only with the training dataset, and then apply it to the the validation dataset. The procedure should be built only with the training data and then independently applied to the training dataset and then to the validation dataset.

Similar Posts:

Rate this post

Leave a Comment