I am currently following CS229 and I'm trying to be diligent, proving most of the things that are not immediately obvious. So I'm looking at the derivation of the SVM as the dual problem to the optimal margin classifier.
Let $x^{i}, y^{i}$ be features and labels, $i = 1, ldots, m$, the optimal margin classifier problem is then defined as:
$$
min_{w,b} ,,frac{1}{2} ||w||^2 \
s.t. ,,, y^{(i)}(w^Tx^{(i)} + b) ge 1,,,,,, i = 1, ldots, m
$$
Now we construct the generalized Lagrangian (that allows for inequality constraints) as:
$$
mathcal{L}(w,b,alpha) = frac{1}{2} ||w||^2 – sum_{i=1}^m alpha_i [y^{(i)}(w^Tx^{(i)} + b) – 1]
$$
Using the KKT conditions we compute derrivatives w.r.t. $w$ and $b$, substitute them etc. into the formula above, and then construct this dual problem:
$$
max_alpha ,,,mathcal{L}(alpha) =sum_{i=1}^m alpha_i – frac{1}{2} sum_{i=1}^m sum_{j=1}^m y^{(i)} y^{(j)} alpha_i alpha_j (x^{(i)})^T x^{(j)} \
s.t. ,,,alpha_i ge 0 ,,, ,i=1,ldots,m \
sum_{i=1}^m alpha_i y^{(i)} = 0
$$
I can see how all of the KKT conditions are satisfied in the above problem, except one. And that is this:
$$
alpha_i [-y^{(i)}(w^Tx^{(i)} + b) + 1] = 0,,,, i = 1,ldots,m
$$
In the course material and where I looked on the internet, it is said that all 5 constraints are satisfied in the dual problem stated above. But I don't see, why this constraint could not be violated in the optimization problem. My guess is that if it were violated it implies that the solution is not optimal, and the optimal solution always satisfies this, but I don't see way to prove it. Could somebody please shine some light on this and tell me what I'm missing? Thanks a lot.
EDIT: course materials: https://see.stanford.edu/materials/aimlcs229/cs229-notes3.pdf page 7-11
Best Answer
I've got the answer, thanks to DanielTheRocketMan for providing half of it!
For $x^{(i)}$ that is a support vector, following equality holds:
$$ y^{(i)} (w^Tx^{(i)} + b) = 1 $$
This satisfies the constraint irrespective of $alpha_i$
For $x^{(i)}$ that is not a support vector, following inequality holds:
$$ y^{(i)} (w^Tx^{(i)} + b) gt 1 $$
Then in the Lagrangian below (that is being maximised in the dual formulation) it can be seen, that for such $x^{(i)}$ increasing value of $alpha_i$ would lead to lowering the value of the objective function which we're trying to maximise, therefore the smallest possible value is chosen, thus $alpha_i = 0$ (there is a non-negativity constraint on $alpha_i$, see the question above). And so the constraint is satisfied in this case too.
$$ mathcal{L}(w,b,alpha) = frac{1}{2} ||w||^2 – sum_{i=1}^m alpha_i [y^{(i)}(w^Tx^{(i)} + b) – 1] $$
Therefore, the constraint holds implicitly for all $i = 1,ldots,m$ and needs not to be included in the problem definition.
Similar Posts:
- Solved – Calculating the value of $b^{*}$ in an SVM
- Solved – Calculating the value of $b^{*}$ in an SVM
- Solved – Using gradient descent to train dual formulation of Kernel SVM
- Solved – In soft-margin SVM, is it guaranteed that some points will lie on the margin
- Solved – Is the value of $alpha$ the same for all support vectors (SV) in the dual and what is the reason for it if they do or don’t