Solved – Do average coefficients in k fold cross validation resemble coefficients when trained on entire set

If you perform, say 10 fold cv with logistic regression and then average the coefficient vectors from each turn, does that average roughly equal the coefficient vector you would get by fitting a logistic regression to the whole dataset?

Here is an example using a sklearn logistic regression that performs a random search over C on a very small dataset (1000 obs) with 20% positive cases. There are 5 features. Y axis is coefficient value. One set of red lines are the coefficients on the full set, the others are the coefficients over the folds (it should be clear which are medians).

enter image description here

In general: no, they don't need to be the same – but it would often be desirable. So models that achieve (roughly) equal parameter vectors for all surrogate models and for the model fitted on the whole data set earn the label stable.

Long explanation: one way of defining stability is to look at the variance of the model parameters when the training data is slightly changed (perturbed). Such a change can be e.g. what happens between the surrogate models of the cross validation.

Now, for cross validation, typically two assumptions are used:

  1. The surrogate models are assumed to be equal (or equivalent) to the model trained on the whole data set.
    This assumption, however, is often noticeably violated, leading e.g. to the well-known pessimistic bias of cross validation.
  2. In that case, the second weaker assumption is that at least the surrogate models are equal (or equivalent) to each other, so it is permitted to take the average of the figure of merit observed for all surrogate models (and use this as approximation to that figure of merit measured on unknown cases for the model fit on all data.

Frequently used figures of merit like accuracy or mean squared error just need stable predictions, which may still be achieved with certain changes in the model parameters – so this equivalence of the surrogate models (= equal predictions, no restriction on the parameters) is a weaker condition than asking for models that have equal parameters.

Also stability can be discussed in terms of parameters or in terms of predictions. While looking at stability of the predictions usually is sensible (as long as you are looking at predictive models), stability of the parameters is more difficult: you need to take into account the characteristics of your data (in particular: correlated variates) and the characteristics of your model (e.g. flipping of loadings in PCR or PLS plus you need to decide whether rotation does count as unstable or not, LDA is invariant to rotation, translation and flipping, etc.). However, logistic regression should be fine in this respect.

Similar Posts:

Rate this post

Leave a Comment