I am reading about logistic regression (from https://piazza-resources.s3.amazonaws.com/h61o5linlbb1v0/h8exwp8dmm44ok/classificationV6.pdf?AWSAccessKeyId=AKIAIEDNRLJ4AZKBW6HA&Expires=1485650876&Signature=Rd4BqBgb4hPwWUjxAyxJNfPhklU%3D) and am looking at the negative log likelihood function. They take the gradient with respect to the weights and produce the result at the bottom of page 7. I calculated this myself and can't seem to get the solution that they arrived at.
They set
$$NLL = – sum_{i=1}^N[(1-y)log(1-s(w^Tx_i))+y;log;(s(w^Tx_i))]$$
where $s$ is the sigmoid function $s(x) =frac{1}{e^{-x}+1}$
When I take $frac{partial NNL}{partial w}$, I get
$$ -sum_{i=1}^N ( ;frac{x_i(y_i-1)e^{w^Tx_i}}{e^{W^tx_i}+1} + frac{x_iy_i}{e^{W^tx_i}+1})$$
and not $$ sum_{i=1}^N (s(w^Tx_i)-y)x_i)$$
$$
I must be making a mistake since this is just a simple gradient calculation. Can anyone shed some light onto how this was computed?
Best Answer
It is a simple calculation but one can easily make a mistake. Since we have
$frac{partial s(x)}{partial x} = s(x)(1 – s(x)) frac{partial s(w^Tx_i)}{partial w} = x_i(1 – s(w^Tx_i))s(w^Tx_i) frac{partial log(x)}{partial x} = frac{1}{x}$
so the derivative is
$frac{partial NLL}{partial w} = sum_{i = 1}^{n} (1 – y)frac{x_i(1 – s(w^Tx_i))s(w^Tx_i)}{1 – s(w^Tx_i)}-yfrac{x_i(1 – s(w^Tx_i))s(w^Tx_i)}{s(w^Tx_i)}$
and i checked it indeed simplifies to
$sum_{1=1}^{n} x_i(s(w^Tx_i) – y)$