I have a set of x, y data I'm using to build a random forest. The x data is a vector of values that includes some NAs. So I use rfImpute
to handle the missing data and create a random forest. Now I have a new unseen observation x (with an NA) and I want to predict y. How do I impute the missing value so that I may use the random forest that I have already grown? The rfImpute
function seems to require x and y. I only have x for prediction purposes.
My question is similar (but different) to this question. And for example, I can use the same iris dataset. If I've correctly interpreted the code in the answer to the question I reference, the code iris.na[148, , drop=FALSE]
in the statement iris.na2 = rbind(iris.imputed, iris.na[148, , drop=FALSE])
represents the new data which includes the Species
(the Y value). In my problem I would not know the Species
—I want to use the random forest to predict that. I would have the 4 independent variables, but some might be NA
for a given row. To continue the analogy, imagine I have 3 of the 4 variables (one is missing). I want to impute that value. Then I want to predict the species which I do not know.
In response to gung's comment that I should add an illustration, let me put it in terms of the iris data set. Imagine I have the following data on a flower. I know it's Sepal.Length
, Sepal.Width
, Petal.Length
, but not the Petal.Width
. I'd like to impute the Petal.Width
and then use those 4 values within a RF model to predict the Species
.
Best Answer
I think you need an unsupervised imputing method. That is one which do not use the target values for imputation. If you only have few prediction feature vectors, it may be difficult to uncover a data structure. Instead you could mix your predictions with already imputed training feature vectors and use this structure to impute once again. Notice this procedure may violate assumptions of independence, therefore wrap the entire procedure in an outer cross-validation to check for serious overfitting.
I just learned about missForest from a comment to this question. missForest seems to do the trick. I simulated your problem on the iris data. (without outer cross-validation)
rm(list=ls()) data("iris") set.seed(1234) n.train = 100 train.index = sample(nrow(iris),n.train) feature.train = as.matrix(iris[ train.index,1:4]) feature.test = as.matrix(iris[-train.index,1:4]) #simulate 40 NAs in train n.NAs = 40 NA.index = sample(length(feature.train),n.NAs) NA.feature.train = feature.train; NA.feature.train[NA.index] = NA #imputing 40 NAs unsupervised library(missForest) imp.feature.train = missForest(NA.feature.train)$ximp #check how well imputation went, seems promsing for this data set plot( feature.train[NA.index],xlab="true value", imp.feature.train[NA.index],ylab="imp value",) #simulate random NAs in feature test feature.test[sample(length(feature.test),20)] = NA #mix feature.test with imp.feature.train nrow.test = nrow(feature.test) mix.feature = rbind(feature.test,imp.feature.train) imp.feature.test = missForest(mix.feature)$ximp[1:nrow.test,] #train RF and predict library(randomForest) rf = randomForest(imp.feature.train,iris$Species[train.index]) pred.test = predict(rf,imp.feature.test) table(pred.test, iris$Species[-train.index]) Printing... ----------------- pred.test setosa versicolor virginica setosa 12 0 0 versicolor 0 20 2 virginica 0 1 15
Similar Posts:
- Solved – caret preProcess knnImpute error more nearest neighbours than there are points
- Solved – Including dependent variables in multiple imputation model when they have missing values
- Solved – e1071 svm predict – missing predictions
- Solved – Different randomForest results via caret and randomForest package using seeds on train control
- Solved – Difference between Random forest vs Bagging in sklearn