I want to find the most important predictors for a binomial dependent variable out of a set of more than 43,000 independent variables (These form the columns of my input dataset). The number of observations is more than 45,000 (these form the rows of my input dataset). Most of the independent variables are unigrams, bigrams and trigrams of words, so there is high degree of collinearity among them. There is a lot of sparsity in my dataset as well. I am using the logistic regression from the glmnet package, which works for the kind of dataset I have. Here is some code:

`library('glmnet') data <- read.csv('datafile.csv', header=T) mat = as.matrix(data) X = mat[,1:ncol(mat)-1] y = mat[,ncol(mat)] fit <- cv.glmnet(X,y, family="binomial", type.measure = "class") betacoeff = as.matrix(fit$glmnet.fit$beta[,ncol(fit$glmnet.fit$beta)]) `

`betacoeff`

returns the betas for all the independent variables. I am thinking of showing the predictors corresponding to the top 50 betas as the most important predictors.

My questions are:

`glmnet`

picks one good predictor out of a bunch of highly correlated good predictors. So I am not sure how much I can rely on the betas returned by the above model run.Should I manually sample the data (say 10 times) and each time run the above model, get the list of predictors with the top betas and then find those which are present in all 10 repetitions? Is there any standard way of doing this? What is the standard way of sampling in this case?

My other question is about

`cvm`

(cross validation error) returned by the above model. Since I use`type.measure = "class"`

,`cvm`

gives the misclassification error for different values of lambda. How do I report the misclassification error for the entire model? Is it the`cvm`

corresponding to`lambda.min`

?

**Contents**hide

#### Best Answer

- set
`alpha = 0`

in`cv.glmnet()`

to use ridge instead of lasso.

"It is known that the ridge penalty shrinks the coefficients of correlated predictors towards each other while the lasso tends to pick one of them and discard the others." glmnet manual

You are already sampling the data by using

`cv.glmnet()`

(as opposed to simply using`glmnet()`

)It is my understanding that for each lambda, you have a model. So

`lambda.min`

is the lambda-value for the model with the lowest error.

User *Jason* has example code posted in another question, that I believe will help: https://stats.stackexchange.com/a/92167

### Similar Posts:

- Solved – How to report most important predictors using glmnet
- Solved – Interpreting glmnet cox coefficients
- Solved – How to interpret glmnet
- Solved – Why does lambda.min value in glmnet tuning cross-validation change, when repeating test
- Solved – Why does lambda.min value in glmnet tuning cross-validation change, when repeating test