Say I have a dataset with this distribution

Class A: 10 Examples

Class B: 100 Examples

Class C: 1000 Examples

Hence, i am trying to build a classifier using linear SVM. Baring all concerns about accuracy. I would like to understand the effect of weights on SVM.

For example i would be giving class A a weight of 1, class B a weight of 0.1 and Class C a weight of 0.01.

How does these weights affect the accuracy of the results in the sense how does it affect the output of the hinge loss.

Sorry if my question is badly phrased as i am very new to machine learning.

**Contents**hide

#### Best Answer

First of all, just to be clear, specifying a class weight is the same as specifying a weight to each observation of that class. That is, the weight $w_A$ for class A is the same as $w_i=w_A$ for each observation $i$ such that $y_i=A$. In fact, that is what machine learning packages such as sklearn do: you tell them class weights, and what they do is assign that weight to each observation.

Keeping this in mind helps making it clear what is going on.

In fact, adding weights to a class is the same as oversampling or undersampling every observation of that class (if weight > 1 or 0 < weight < 1, respectively). For instance, setting $w_A=2$ is the same as duplicating all observations of class A in your training set. The margin will therefore benefit that class.

This can be more clearly seen if you consider how the cost function is computed. The cost function in SVM is the average hinge loss for each observation $L=frac{1}{N}sum_il(i)$, where $l(i)=max(0, 1-y_i(mathbf{v}^Tcdot mathbf{x}_i+b))$. When using weights, it is the weighted average hinge loss: $L=frac{1}{sum_iw_i}sum_iw_i,l(i)$.

As far as I know, there is no version of the hinge loss with weights because there is no need for one. As for the derivative, usually, $frac{partial L}{partial v_i}=sum_i -yx_i$, and, when using weights, it should be $frac{partial L}{partial v_i}=sum_i -w_iyx_i$. I have implemented linear SVM with weights using gradient descent with this derivative with great success.

(I have used $v$ for SVM's normal vector, to not confuse with $w$ for the sample weights.)