# Solved – SVM heavily over fits the data (classifying Highly Unbalanced data )

I have a huge training set from which I am supposed to regress and classify, i.e I classify whether an event will occur or not and another task is to regress the intensity of the event in future.
The problem I am battling with is that there are very few positive instances for classification in my training and test set (2% to be accurate). As a result, whatever method I try, my precision and recall for the rarer class do not increase more than 35% and 10% respectively. I also tried using the class weights or sample weights but to no avail. When I try svm using Scipy's SVC module, it heavily overfits the data, i.e gives more than 90% accuracy for both the classes on training data but gives 0 precision and 0 recall Similarly in the regression problem since there are a lot of 0's in the training set. Regressed values do not make any sense at all.

So my question is two fold , first what could be the reason for SVM to overfit to the data?and second What can I use to increase the precision and recall of rarer event more (I tried random forest which gives 62% precision and 55% recall), I have tried giving sample weights but it doesnt work (It increases precision to 63% in RF but drops recall)

Even Giving class 1 a weight of 100, class_weight = {1:100} doesnt solve the problem

Contents

The usual solution to imbalanced data is to use class-weighted SVM, which has two misclassification penalties \$C_{pos}\$ and \$C_{neg}\$ instead of one. You assign a higher misclassification penalty to the minority class. A common heuristic is to keep the ratio as follows: \$\$C_{pos} times n_{pos} = C_{neg} times n_{neg}\$\$ where \$n_X\$ is the size of \$X\$. You can assign these by scaling \$C\$ via coefficients in sklearn (e.g. the `class_weight` parameters in SVC).