I am running ridge regression on the Boston dataset. There are many write-ups online for how to do ridge regression.
I will write up the two methods and then pose my question
Initialize with the dataset
library('mlbench') data(BostonHousing)
First method: According to the Stanford Open course on statistics
library('glmnet') library('dplyr') #initialize data matrix for glmnnet z <- colnames(BostonHousing) z <- z[-10] x <- BostonHousing %>% select(one_of(z)) %>% data.matrix() #create ridge regression with every possible lambda fit <- glmnet(x,BostonHousing$medv, alpha=0) #use 10-fold cross validation to choose the lambda with the lowest MSE cv_fit <- cv.glmnet(x,BostonHousing$medv, alpha = 0) #create a ridge regression model with the best lambda fit <- cv_fit$glmnet.fit #calculate training MSE for best ridge regression model min(cv_fit$cvm)
Second method: According to this tutorial
library('caret') #take a random sample of half of the data split <- createDataPartition(y=BostonHousing$medv, p = 0.5, list = FALSE) #create training and test sets train <- BostonHousing[split,] test <- BostonHousing[-split,] #calculate ridge regression on every lambda with the training set ridge <- train(medv ~., data = train, method='ridge', lambda = 4,preProcess=c('scale', 'center')) #use the model to predict values of the test set ridge.pred <- predict(ridge, test) #mse for the test error mean(ridge.pred - test$medv)^2 #select lambda fitControl <- trainControl(method = "cv", number = 10) lambdaGrid <- expand.grid(lambda = 10^seq(10, -2, length=100)) #do ridge regression with the best lambda ridge <- train(medv~., data = train, method='ridge', trControl = fitControl, # tuneGrid = lambdaGrid preProcess=c('center', 'scale') ) #predict the test set using the model from the training set ridge.pred <- predict(ridge, test) #calculate test mse sqrt(mean(ridge.pred - test$medv)^2)
I have a few questions, I hope that's alright.
1- Assuming I use the first method, can I estimate the test error of the ridge model with k-fold cross validation?
It only gives me the training error and I'd like to approximate test error.
2- The second approach uses a validation set. Is that desirable in situations with small sample sizes?
The BostonHousing data is 506 rows by 14 variables.
3- Here is the output in the second method
ridge Ridge Regression 254 samples 10 predictor Pre-processing: centered (10), scaled (10) Resampling: Cross-Validated (10 fold) Summary of sample sizes: 230, 229, 228, 229, 229, 229, ... Resampling results across tuning parameters: lambda RMSE Rsquared MAE 0e+00 0.5963179 0.6835195 0.4131819 1e-04 0.5963073 0.6835296 0.4131761 1e-01 0.5920124 0.6891727 0.4120725 RMSE was used to select the optimal model using the smallest value. The final value used for the model was lambda = 0.1.
Why is ridge regression using resampling? How did they get a lambda of 0.1
when the first method got a lambda of 0.0501
?
Best Answer
First question: In case that you tuned you ridge regression with cross validation, you have used all your training data to find the best lambda and the error will therefore be biased downwards if you test it on training data.
Second question: Yes you can still do that. Try the same procedure with different random seeds, that is, perform repeated cross validation. So, split the model in training and test set, tune lambda with CV on training set, determine performance on test set, repeat with different split.
Third question: Did you scale your data in the first method? There might also be different ways to determine lambda because there is no exact method, resulting is different outcomes.
Similar Posts:
- Solved – Identifying structural breaks in regression with Chow test
- Solved – Reported Coefficients for Glmnet using Caret
- Solved – Ridge regression results different in using lm.ridge and glmnet
- Solved – K-fold Cross Validation and Training/CV/Test set Techniques for choosing regularization parameter of Regression
- Solved – Why is Lasso and Ridge not giving better results than OLS