If I understand correctly, the **negative log likelihood cost function** goes hand-in-hand with the softMax output layer.

But why?

The simplest motivating logic I am aware of goes as follows: softMax outputs (which sum to 1) can be considered as probabilities. Whichever unit corresponds to the actual correct answer, we wish to maximise the probability (or output) for this unit. Which is the same as *minimising* the neg-log-prob.

So the probability of the network outputting the correct (Target) answer is:

$$P(T) = y_T = {exp z_T over sum_k exp z_k}$$

So we wish to minimise $-log(P(T))$, which can be rewritten as $log(sum_k exp z_k)-z_T$

Now if we calculate the gradient $partial C over partial z_m$ across the (SoftMax) activation function, it comes out as $y_T – 1$ when $m=T$ and $y_m$ otherwise, so we can write:

$${partial C over partial z_m} = y_m – targ_m$$

Which looks like magic.

My question is: *How to go about removing the magic while maintaining clarity?*

Michael Nielsen's page here points out that one can derive the *cross entropy* cost function for *sigmoid* neurons from the requirement that ${partial C over partial z_k} = y_k – targ_k$.

i.e. The cost function has to exactly counterbalance the gradient across the (sigmoid) activation function.

I've been trying to do the same thing for the negative log likelihood cost function for SoftMax neurons:

SoftMax looks like a sensible formula; $exp$ will accentuate the

winner, and maps $R to R^+$ and the denominator is just a

normalisation term to make sure outputs sum to 1. So maybe no further

justification is required for choosing $exp$… but still this seems

a little arbitrary/weak.Now supposing the choice has been made, we have:

$$y_m = {exp z_m over sum_k exp z_k}$$

So maybe we can start off by

requiring some cost function that satisfies:$${partial C over partial z_m} = y_m – targ_m$$

… so it should be possible to derive $C(y_m)$ using integration.

… but I can't see how to do that last step. Can anyone see it through?

PS Peter's Notes contains a tricky derivation for the negative log likelihood cost function.

**Contents**hide

#### Best Answer

I haven't touched integration after leaving college, but for this specific problem, it seems straightforward to get the cost function. Here I am trying to sketch it just FYR. $$ C = int{partial C over partial z_m} d z_m = int(y_m – 1_{m=T})d z_m $$ With $y_m = {exp z_m over sum_k exp z_k}$,

$$ int(y_m – 1_{m=T})d z_m = int({exp z_m over sum_k exp z_k} – 1_{m=T})d z_m $$ Since we have, $$ int {exp z_m over sum_k exp z_k} d z_m = log ({sum_k exp z_k}) + f, $$ and, $$ int 1_{m=T} d z_m = z_T + g, $$ $f$ and $g$ do not consist of $z_m, m = 0,…,k$, so it is safe to require $f$ and $g$ to have only constants, and here for minimization purpose, we can ignore the constants. Then the equations above give the cost function, $$ log(sum_k exp z_k)- z_T = – log{exp z_T over sum_k exp z_k} = -log(y_T) $$