I have a Hidden Markov Model for binary classification and two datasets:
- positive instances
- negative instances (way more data than the positive ones)
In order to evaluate the performance of the model I did the following:
- Do leave one out cross validation over the positive instances. Basically remove an instance from the positive set, train over the rest, then evaluate the instance I removed and saved the result; Repeat for each instance.
- Train over all the positive instances and then evaluate each negative instance. Save results
- Plot ROC curve with the data from 1 and 2.
This approach is pretty time intensive as I have to train my model N+1 times where N equals the number of positive instances.
Someone suggested that I combine both data sets and then divide them:
- 2/3 training set
- 1/3 evaluation set
and maintain in both sets the same percentage of positive/negative instances.
Maybe I understood something wrong but I am a bit confused as to how this helps exactly when I have negative instances in the training data!?
Wouldn't that negatively bias my classifier when evaluating over the remaining 1/3 instances? Moreover, I would also get less data points for the ROC curve?
Can anybody help clarify the approach or suggest a better one?
Classifiers usually try to find the best fit for all the data. In the case of imbalance where you have much more negative than positive samples the classifier will pay more attention to the negative class in order to obtain a small overall error. Imbalance can be intrinsic or extrinsic, i.e. intrinsic imbalances are a direct result caused by the nature of the data space (e.g. rare diseases) and extrinsic imbalances are a result of certain limitations (time, space, money, etc.) where the data space is in reality not imbalanced. In addition, it might happen that only either the training or the testing data set are imbalanced. Personally, I would start with stratified cross-validation where it is ensured that the ratio between positive and negative class is the same in each fold and the same as in the overall data set.
To address the imbalance itself there are several methods that do this. A simple way would be to increase the weight of samples from the positive class compared to the negative class, this makes the classifier kind of cost-sensitive. An introduction to all the available methods can be found in
- Solved – Ratio between positive and negative examples in a training problem
- Solved – When to use stratified k-fold
- Solved – How tonterpret AUROC score
- Solved – When is a dataset “too imbalanced” for AUC ROC and PR is preferred
- Solved – the effect of training a model on an imbalanced dataset & using it on a balanced dataset