I have a small set of labeled data (diagnosis in individual subjects):
- ~50 of "sick" observations
- ~100 of "healthy" observations
In reality, only ~1% of the observations are expected to be considered "sick".
I have 10-30 variables (I'm still working on them), some of them are related to each other, so I prefer a classifier that can take into account non-linear functions of variables, or can handle large number of variable (so I'll just define additional variables as the relations I think might be explanatory)
Unfortunately it could be that the classification is independent on any of the variables, so I need to be careful about overfitting.
Since the training set is really small, running time are not important. I can also run multiple different methods and make a decision based on all of them.
A method that can identify an uncertainty conditions is preferable over a method that just classify. In case of uncertainty some value to indicate the level of certainty should be returned or at least a warning flag.
Based on the mentioned considerations and constraints, what is the best classifier(s) I can choose?
What method can I use to classify based on the results of different algorithms?
Update:
I'm looking for answers with a specific suggested method.
I'm using matlab, so existing solutions implemented in matlab are preferred but it's not a must. I have some knowledge of matlab, but almost no knowledge in statistics or machine learning. Please describe an actual method that uses some different algorithms/ensembles and decide on a final classification based on them.
Best Answer
Standard candidates that fit your requirements would be ensembles, which are available in Matlab. Most likely you should try ensembles of decision trees (such as Random Forests) or of linear classifiers using boosting or bagging along with Matlab's fitensemble method.
Furthermore, Support Vector Machines are found among the top performing classifiers quite often.
Finally, Naive Bayes is worth a shot as it is easy but sometimes yields surprisingly good results.
Make sure you do not just look at the accuracy when comparing the results but also at the receiver operating characteristic and its area under curve.
All these techniques are available in Matlab, provided that you have the right toolboxes (should be the statistics toolbox) and your version is not too old. Be prepared that others here will have even more suggestions for you, like logistic regression etc. 😉
Here is a paper that compares several classification algorithms, in case you are interested: http://www.cs.cornell.edu/~caruana/ctp/ct.papers/caruana.icml06.pdf
Update:
Some more concrete suggestions for your next steps. Run
Ensemble = fitensemble(X,Y,Method,NLearn,Learner)
Set the Method to 'AdaBoostM1', NLearn to 20 (you can vary that number) and Learner to 'Tree' (or 'Discriminant'). X is your training data and Y is a 1 for every data point belonging to a sick subject and a 0 if the subject is healthy. You can extend that approach to provide a cross validation for you:
Ensemble = fitensemble(X,Y,Method,NLearn,Learner,'CrossVal','On')
Use the same parameters as stated above. This gives you a cross-validated model (i.e. an estimate on how good your generalisation actually is). To query the estimated error of the model, use the following method:
Loss = kfoldLoss(Ensemble,'lossfun','classiferror')