Is there any classifer that can natively support unbalanced datasets?
Or what best practices you can suggest to handle such datasets?
For example I want to solve task called "pedestrian detection" classical approach use linear SVM, but it can't handle unbalanced dataset (lots of background examples, small number of positive examples – people).Maybe there is something better than SVM? (I already know about undersampling/oversampling and weighted SVM).
It would be great if in answer you link to some scikit-learn classification algorithm.
Best Answer
Most classifiers in sklearn
support unbalanced datasets, through the sample_weight
parameter in the clf.fit
methods. If you need to fit unbalanced data with a classifier that does not support this option, you can use sampling with replacement to enlarge the smaller class to match the larger one.
Here is an adapted version of the sklearn
SVM example demonstrating the sample_weight
approach:
import numpy as np import pylab as pl from sklearn import svm np.random.seed(0) X = np.r_[2*np.random.randn(20, 2) - [2, 2], 2*np.random.randn(200, 2) + [2, 2]] Y = [0] * 20 + [1] * 200 wt = [1/20.]*20 + [1/200.]*200 # fit the model clf = svm.SVC(kernel='linear') clf.fit(X, Y, sample_weight=wt)
This question about unbalanced classification using RandomForestClassifier
has some additional details.