I want to find the most important predictors for a binomial dependent variable out of a set of more than 43,000 independent variables (These form the columns of my input dataset). The number of observations is more than 45,000 (these form the rows of my input dataset). Most of the independent variables are unigrams, bigrams and trigrams of words, so there is high degree of collinearity among them. There is a lot of sparsity in my dataset as well. I am using the logistic regression from the glmnet package, which works for the kind of dataset I have. Here is some code:
library('glmnet') data <- read.csv('datafile.csv', header=T) mat = as.matrix(data) X = mat[,1:ncol(mat)-1] y = mat[,ncol(mat)] fit <- cv.glmnet(X,y, family="binomial", type.measure = "class") betacoeff = as.matrix(fit$glmnet.fit$beta[,ncol(fit$glmnet.fit$beta)])
betacoeff
returns the betas for all the independent variables. I am thinking of showing the predictors corresponding to the top 50 betas as the most important predictors.
My questions are:
glmnet
picks one good predictor out of a bunch of highly correlated good predictors. So I am not sure how much I can rely on the betas returned by the above model run.Should I manually sample the data (say 10 times) and each time run the above model, get the list of predictors with the top betas and then find those which are present in all 10 repetitions? Is there any standard way of doing this? What is the standard way of sampling in this case?
My other question is about
cvm
(cross validation error) returned by the above model. Since I usetype.measure = "class"
,cvm
gives the misclassification error for different values of lambda. How do I report the misclassification error for the entire model? Is it thecvm
corresponding tolambda.min
?
Best Answer
- set
alpha = 0
incv.glmnet()
to use ridge instead of lasso.
"It is known that the ridge penalty shrinks the coefficients of correlated predictors towards each other while the lasso tends to pick one of them and discard the others." glmnet manual
You are already sampling the data by using
cv.glmnet()
(as opposed to simply usingglmnet()
)It is my understanding that for each lambda, you have a model. So
lambda.min
is the lambda-value for the model with the lowest error.
User Jason has example code posted in another question, that I believe will help: https://stats.stackexchange.com/a/92167
Similar Posts:
- Solved – How to report most important predictors using glmnet
- Solved – Interpreting glmnet cox coefficients
- Solved – How to interpret glmnet
- Solved – Why does lambda.min value in glmnet tuning cross-validation change, when repeating test
- Solved – Why does lambda.min value in glmnet tuning cross-validation change, when repeating test