The log likelihood function of logistic regression is as below:

begin{align}

ln(L(x, y; w))

&=sumlimits_{i=1}^n [y_iln(p(x_i;w))+(1-y_i)ln(1-p(x_i;w))]

\&=sumlimits_{i=1}^n [y_iln(dfrac{1}{1+e^{-w'x_i}})+(1-y_i)ln(1-dfrac{1}{1+e^{-w'x_i}})]

end{align}

The cost function including penalty and L1/L2 regularization is as below, see link

I understand C and L1/L2 norm but cannot derive cost function, can anyone help with the derivation process?

**Contents**hide

#### Best Answer

Your log-likelihood is: $$ log L(x, y; w) = sum_{i=1}^N ell_i $$ where begin{align} ell_i &= y_i logleft( frac{1}{1 + exp(- w^T x_i)} right) + (1-y_i) logleft( 1 – frac{1}{1 + exp(- w^T x_i)} right) \&= y_i logleft( frac{1}{1 + exp(- w^T x_i)} right) + (1-y_i) logleft( frac{1 + exp(- w^T x_i)}{1 + exp(- w^T x_i)} – frac{1}{1 + exp(- w^T x_i)} right) \&= y_i logleft( frac{1}{1 + exp(- w^T x_i)} right) + (1-y_i) logleft( frac{exp(- w^T x_i)}{1 + exp(- w^T x_i)} right) \&= y_i logleft( frac{1}{1 + exp(- w^T x_i)} right) + (1-y_i) logleft( frac{exp(- w^T x_i)}{1 + exp(- w^T x_i)} times frac{exp(w^T x_i)}{exp(w^T x_i)} right) \&= y_i logleft( frac{1}{1 + exp(- w^T x_i)} right) + (1-y_i) logleft( frac{1}{exp(w^T x_i) + 1} right) \&= logleft( frac{1}{1 + expleft( begin{cases}- w^T x_i & y_i = 1 \ w^T x_i & y_i = 0end{cases} right)} right) \&= logleft( frac{1}{1 + expleft( – y'_i w^T x_i right)} right) \&= -logleft( 1 + expleft( – y_i' w^T x_i right) right) end{align} where $y_i in {0, 1}$ but we defined $y_i' in {-1, 1}$.

To get to the loss function in the image, first we need to add an intercept to the model, replacing $w^T x_i$ with $w^T x_i + c$. Then: $$ argmax log L(X, y; w, c) = argmin – log L(X, y; w, c) ,$$ and then we add a regularizer $P(c, w)$: $$ argmin lambda P(w, c) – log L(X, y; w, c) = argmin P(w, c) – frac{1}{lambda} log L(X, y; w, c) ,$$ where we then set $C := frac1lambda$. The $L_2$ penalty is $$ P(w, c) = frac12 w^T w = frac12 sum_{j=1}^d w_j^2 ;$$ that $tfrac12$ is just done for mathematical convenience when we differentiate, it doesn't really affect anything. The $L_1$ penalty has $$ P(w, c) = lVert w rVert_1 = sum_{j=1}^d lvert w_j rvert .$$

### Similar Posts:

- Solved – How to get cost function of logistic regression in Scikit Learn from log likelihood function
- Solved – Derivation of Group Lasso
- Solved – Is the Gaussian Kernel still a valid Kernel when taking the negative of the inner function
- Solved – Is the Gaussian Kernel still a valid Kernel when taking the negative of the inner function
- Solved – the expected norm $mathbb E lVert X rVert$ for a multivariate normal $X sim mathcal N(mu, Sigma)$?