Solved – R: What does train() do when it calculates ridge regression?

I am running ridge regression on the Boston dataset. There are many write-ups online for how to do ridge regression.

I will write up the two methods and then pose my question

Initialize with the dataset

library('mlbench') data(BostonHousing) 

First method: According to the Stanford Open course on statistics

library('glmnet') library('dplyr')  #initialize data matrix for glmnnet z <- colnames(BostonHousing) z <- z[-10] x <- BostonHousing %>% select(one_of(z)) %>% data.matrix()  #create ridge regression with every possible lambda fit <- glmnet(x,BostonHousing$medv, alpha=0)  #use 10-fold cross validation to choose the lambda with the lowest MSE cv_fit <- cv.glmnet(x,BostonHousing$medv, alpha = 0)  #create a ridge regression model with the best lambda fit <- cv_fit$glmnet.fit  #calculate training MSE for best ridge regression model min(cv_fit$cvm) 

Second method: According to this tutorial

library('caret')  #take a random sample of half of the data split <- createDataPartition(y=BostonHousing$medv, p = 0.5, list = FALSE)  #create training and test sets train <- BostonHousing[split,] test <- BostonHousing[-split,]   #calculate ridge regression on every lambda with the training set ridge <- train(medv ~., data = train, method='ridge',                lambda = 4,preProcess=c('scale', 'center'))  #use the model to predict values of the test set ridge.pred <- predict(ridge, test)  #mse for the test error mean(ridge.pred - test$medv)^2  #select lambda fitControl <- trainControl(method = "cv", number = 10) lambdaGrid <- expand.grid(lambda = 10^seq(10, -2, length=100))  #do ridge regression with the best lambda ridge <- train(medv~., data = train, method='ridge',                trControl = fitControl,                #                tuneGrid = lambdaGrid                preProcess=c('center', 'scale') )  #predict the test set using the model from the training set ridge.pred <- predict(ridge, test)  #calculate test mse sqrt(mean(ridge.pred - test$medv)^2) 

I have a few questions, I hope that's alright.

1- Assuming I use the first method, can I estimate the test error of the ridge model with k-fold cross validation?

It only gives me the training error and I'd like to approximate test error.

2- The second approach uses a validation set. Is that desirable in situations with small sample sizes?

The BostonHousing data is 506 rows by 14 variables.

3- Here is the output in the second method

ridge  Ridge Regression   254 samples  10 predictor  Pre-processing: centered (10), scaled (10)  Resampling: Cross-Validated (10 fold)  Summary of sample sizes: 230, 229, 228, 229, 229, 229, ...  Resampling results across tuning parameters:    lambda  RMSE       Rsquared   MAE         0e+00   0.5963179  0.6835195  0.4131819   1e-04   0.5963073  0.6835296  0.4131761   1e-01   0.5920124  0.6891727  0.4120725  RMSE was used to select the optimal model using  the smallest value. The final value used for the model was lambda = 0.1. 

Why is ridge regression using resampling? How did they get a lambda of 0.1 when the first method got a lambda of 0.0501?

First question: In case that you tuned you ridge regression with cross validation, you have used all your training data to find the best lambda and the error will therefore be biased downwards if you test it on training data.

Second question: Yes you can still do that. Try the same procedure with different random seeds, that is, perform repeated cross validation. So, split the model in training and test set, tune lambda with CV on training set, determine performance on test set, repeat with different split.

Third question: Did you scale your data in the first method? There might also be different ways to determine lambda because there is no exact method, resulting is different outcomes.

Similar Posts:

Rate this post

Leave a Comment