Consider the dual with no offset and not slack.
In the dual we have that for data points that are support vectors:
$$alpha_t > 0$$
and that the constraint is satisfied with equality (since a support vector!):
$$y^{(t)}sum^n_{t'=1} alpha_{t'}y^{(t')}x^{(t')} cdot x^{(t)} = 1$$
However, just because $alpha_t > 0$ its not obvious to me if all the support vector have the same value of $alpha_t$. Do all of the support vectors have the same value of $alpha_t$? I was kind of looking for a mathematical justification or intuition for my question.
My guess is that they would have the same $alpha_t$ value in the absence of slack.
My other guess was that, $alpha_i$ intuitively, says the important of that point, right? (maybe thats wrong) If thats true then, not all support vector have the same value of $alpha_i$ because not all support vectors are essential support vectors. What I mean is, that some support vectors will lie on the margin boundary but removing them will not necessarily change the margin (for example, consider the case where you have 3 SVs in a line, you can remove the middle one and keep the same boundary). So do essential support vectors get larger value of $alpha_t$?
Also, I was wondering, what would happen in the case of slack and offset to the value of $alpha_t$. I was guessing that closer points to the decision boundary, get larger value of $alpha$? However, are points inside the margin even consider support vectors?
Best Answer
The cost function of SVM is: $$min_{alpha,xi,b} frac{1}{2}|mathbf{w}|^2 + Csum_{i=1}^n xi_i$$
where $xi$ are the slack variables (0 for hard margin) and $$ ||mathbf{w}||^2 =mathbf{w}^Tmathbf{w} = sum_{iin mathcal{S}}sum_{jin mathcal{S}} alpha_i alpha_j y_i y_j kappa(mathbf{x}_i,mathbf{x}_j) $$
A couple of important properties can be derived from the Langrangian (here with slack variables):
$$L_p = frac{1}{2}||mathbf{w}||^2+Csum_{i=1}^nxi_i -sum_{i=1}^nalpha_iBig[y_ibig(<mathbf{w},varphi(mathbf{x}_i)>+bbig)-(1-xi_i)Big]-sum_{i=1}^nmu_ixi_i. $$
By setting the partial derivative to $xi$ (the slack variables) to zero, we obtain the following equation in $alpha$: $$frac{partial L_p}{partial xi_i}=0 quad rightarrow quad alpha_i=C-mu_i, quad forall i$$ In the presence of slack variables, all support values ($alpha$'s) are bound by $0$ and $C$. They are not all equal.
In the absence of slack variables (hard margin SVM), $alpha$ values are not equal because $alpha$ is a direct part of the objective function as shown in the first two equations. In hard margin SVM, the cost function is a weighted quadratic function in $alpha$ where the kernel evaluations between corresponding points determine the weights. Since not all weights are equal, the $alpha$ values aren't either.
This is more apparant in least-squares SVM (LS-SVM), where all training points become support vectors due to the problem formulation. One of the original methods proposed to obtain sparsity in LS-SVM involved pruning support vectors with small $alpha$ values.
Similar Posts:
- Solved – Deriving the intercept term in a linearly separable and soft-margin SVM
- Solved – SVM: Why does the number of support vectors decrease when C is increased
- Solved – Why are the Lagrange multipliers sparse for SVMs
- Solved – the loss function of hard margin SVM
- Solved – Support Vectors Not Falling on Margin Lines for e1071 and kernlab packages in R