I am new in bioinformatics and machine learning. I am trying to predict a disease outcome using cv.glmnet to choose the best lambda for the prediction. The problem I have is that outcome groups are uneven (30 samples for outcome 0 and 14 samples for outcome 1). Therefore, in a 10-fold CV (even in a 5-fold), there will be a high probability of having groups with only one outcome.
Does cv.glmnet take into account this difference in numbers (since the outcome vector is specified) and it always randomly pick samples from both groups. If not, what is the best way to perform CV for uneven groups?
Thank you all.
Best Answer
glmnet
does not take this into account when assigning folds of cross-validation. If it has a fold with too few samples from one class, it will make the model, but it will throw a warning. In the example below we train a binomial classifier with 98 examples from class A
and 2 from class B
, making it impossible for most folds to contain an example of class B
.
cv.glmnet(matrix(rnorm(200),ncol=2), c(rep("A", 97),rep("B",3)), family = "binomial", nfolds = 10)
The model is built, but it gives the following warning repeated 11 times
In lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, ... : one multinomial or binomial class has fewer than 8 observations; dangerous ground
If you want to protect from this you can manually assign your cross-validation folds using the foldid
parameter. For example, if we have a set up like you described, we could do the following:
# make our example data x <- matrix(rnorm(88),ncol=2) y <- c(rep(0, 30), rep(1, 14)) nfold <- 5 # assign folds evenly using the modulus operator fold0 <- sample.int(sum(y==0)) %% nfold fold1 <- sample.int(sum(y==1)) %% nfold foldid <- numeric(length(y)) foldid[y==0] <- fold0 foldid[y==1] <- fold1 foldid <- foldid + 1 # perform cross-validation cv.glmnet(x, y, foldid = foldid, family = "binomial")
One other parameter you may want to consider is the weights
parameter. This parameter weights the error metric and can be used to help balance the classifier by increasing the impact of a misclassification on the class with lower representation. To make the misclassification error balanced for the two classes you would set the weight for each instance equal to 1 – (fraction of total) of its class.
# calculate what fraction of the total each class has fraction <- table(y)/length(y) # assign 1 - that value to a "weights" vector weights <- 1 - fraction[as.character(y)] # make the model cv.glmnet(x, y, foldid = foldid, family = "binomial", weights = weights)