I am reading about logistic regression (from https://piazza-resources.s3.amazonaws.com/h61o5linlbb1v0/h8exwp8dmm44ok/classificationV6.pdf?AWSAccessKeyId=AKIAIEDNRLJ4AZKBW6HA&Expires=1485650876&Signature=Rd4BqBgb4hPwWUjxAyxJNfPhklU%3D) and am looking at the negative log likelihood function. They take the gradient with respect to the weights and produce the result at the bottom of page 7. I calculated this myself and can't seem to get the solution that they arrived at.

They set

$$NLL = – sum_{i=1}^N[(1-y)log(1-s(w^Tx_i))+y;log;(s(w^Tx_i))]$$

where $s$ is the sigmoid function $s(x) =frac{1}{e^{-x}+1}$

When I take $frac{partial NNL}{partial w}$, I get

$$ -sum_{i=1}^N ( ;frac{x_i(y_i-1)e^{w^Tx_i}}{e^{W^tx_i}+1} + frac{x_iy_i}{e^{W^tx_i}+1})$$

and not $$ sum_{i=1}^N (s(w^Tx_i)-y)x_i)$$

$$

I must be making a mistake since this is just a simple gradient calculation. Can anyone shed some light onto how this was computed?

**Contents**hide

#### Best Answer

It is a simple calculation but one can easily make a mistake. Since we have

$frac{partial s(x)}{partial x} = s(x)(1 – s(x)) frac{partial s(w^Tx_i)}{partial w} = x_i(1 – s(w^Tx_i))s(w^Tx_i) frac{partial log(x)}{partial x} = frac{1}{x}$

so the derivative is

$frac{partial NLL}{partial w} = sum_{i = 1}^{n} (1 – y)frac{x_i(1 – s(w^Tx_i))s(w^Tx_i)}{1 – s(w^Tx_i)}-yfrac{x_i(1 – s(w^Tx_i))s(w^Tx_i)}{s(w^Tx_i)}$

and i checked it indeed simplifies to

$sum_{1=1}^{n} x_i(s(w^Tx_i) – y)$