I'm working on a random forest classifier with 10-folds CV to aestimate the hyperparameter 'mtry' (chosen by maximizing AUROC). I decided to pre-split the training set in 8 samples equals in size (about 100 observations each).
I iterated the train process 8 times. At each iteration I added one sample to my actual training set. So, the first train model was fitted using a training set of 100 observations, the second had 200…the last one has the entire original training set (about 800 observations).
At each iteration I measured the missclassification error (FP + FN) for both training and test set in %.
I finally plot this learning curve. On the x axis I have the training set size, on the y axis the missclassification error in % (black for the training set, red for the test set).
The gap between train and test performance suggests me I have high variance in the data (overfitting) but I really cannot understand…why does the training error start so high, then suddenly drop, then start to rise again as training set size increases?
If I have a overfitting problem, how can I address it (close the gap) if I cannot obtain more training examples?
I don't know if you split your dataset randomly (each sample receives a random subset of observations) or not. I assume that your split is random.
why does the training error start so high, then suddenly drop, then start to rise again as training set size increases?
This is merely a noise caused by small size of your training and test sets, as well as the random nature of the random forest model.
As for the gap between the training and test set, such a gap is common for a model as complex as random forest, and the data of only 800 training examples (how many explanatory variables you have, by the way?)