I want to use ensemble classifiers for classification of 300 samples (15 positive samples and 285 negative samples, it means binary classification). I extracted 18 features from these samples, all of them are numerical, and there is some correlation between the features.

I am new to MATLAB, and I tried using fitensemble but I don’t know which method to use: `AdaBoostM1`

, `LogitBoost`

, `GentleBoost`

, `RobustBoost`

, `Bag`

or `Subspace`

.

As the numbers of features is 18, I don’t know weather boosting algorithms can help me or not. On the other hand, I have problems with the number of the learners. How many learners are suitable for this problem, and how can I get the optimal classification?

I would appreciate your help.

**Contents**hide

#### Best Answer

This MATLAB documentation page gives a pretty comprehensive answer.

The following is my conclusion, assuming your samples are iid. (!), you do not have label noise (some samples have the wrong label), and you will not have much much more samples in application:

Use `AdaBoostM1`

.

`X = randn(300,18); Y = X*randn(18,1) > 4; % use your data instead cost=[0 100 ; 10 0]; % think seriously about these values. 100 is cost classifying a positive sample as negative and 10 the cost of the other error. MinLeaf=length(Y)/2; %determines the individual tree size. the higher, the smaller are the trees. length(Y)/2 means only one split. nTrees=200; ens = fitensemble(X, Y, 'AdaBoostM1',nTrees,ClassificationTree.template('MinLeaf',MinLeaf),'nprint',1,'crossval','on','k',5,'cost',cost,'classnames',[true, false]); figure; for i=1:ens.KFold cumlossTrain(1:ens.NTrainedPerFold(i),i)=loss(ens.Trainable{i},ens.X(ens.Partition.training(i),:),ens.Y(ens.Partition.training(i)),'mode','cumulative'); lossTrain(1,i)=loss(ens.Trainable{i},ens.X(ens.Partition.training(i),:),ens.Y(ens.Partition.training(i))); cumlossCV(1:ens.NTrainedPerFold(i),i)=loss(ens.Trainable{i},ens.X(ens.Partition.test(i),:),ens.Y(ens.Partition.test(i)),'mode','cumulative'); lossCV(1,i)=loss(ens.Trainable{i},ens.X(ens.Partition.test(i),:),ens.Y(ens.Partition.test(i))); scatter(1:1:ens.NTrainedPerFold(i),cumlossTrain(:,i),'r'); hold on;hold all; scatter(1:1:ens.NTrainedPerFold(i),cumlossCV(:,i),'b'); end `

Use the graph to determine whether more trees would improve classification (look at the blue dots).

Now, vary `MinLeaf`

or `nTrees`

to increase or decrease model complexity. If your model is too static blue and red dots will overlap, if its is to complex training error will be zero. Find good values by ploting `lossCV`

and `lossTrain`

against `MinLeaf`

. Note that you do not need to find the exact maximum; it will be noise anyway.

To improve classification find out whether more features or more samples will improve classification more by removing some of the existing and observing the effect (again using plots) and add more of them.

To get a final optimal classifier stop doing CV for training and use all the data you have. Get a new really independent test set if you want/need to report error rates. This is the only way to ensure unbiased errors, because as soon as you use a test set twice it is corrupted.

You could also use `Bag`

, use the oob error instead of CV, and vary `minleaf`

. Doesn't matter from my experience.

### Similar Posts:

- Solved – error, cost, loss, risk, are those 4 terms the same in the context of machine learning
- Solved – Best Validation accuracy occurs early on in the training process
- Solved – Imbalanced data classification using boosting algorithms
- Solved – Interpretation of learning curves – large gap between train and validation loss
- Solved – Logistic Regression Cost Function issue in Matlab