I am following the derivation for back propagation presented in Bishop's book Pattern Recognition and Machine Learning and had some confusions in following the derivation presented in section 5.3.1.

In that chapter they present the application of the chain rule for partial derivatives on the definition of $delta_j$ and get equation 5.55:

$$ delta_j equiv frac{partial E_n}{ partial a_j} = sum_k frac{partial E_n}{ partial a_k} frac{partial a_k}{ partial a_j} $$

where the sum runs over all units $k$ to which unit $j$ sends connections.

My question is in how they get from the equation 5.55 to equation 5.56:

$$ delta_j = h'(a_j) sum_k w_{kj} delta_k$$

In the chapter of the book they do try to explain how that equation came about with the following paragraph:

If we now substitute the definition of $delta$ given by equation (5.51) $delta equiv frac{partial E_n}{ partial a_j}$ into equation (5.55) $ delta_j equiv frac{partial E_n}{ partial a_j} = sum_k frac{partial E_n}{ partial a_k} frac{partial a_k}{ partial a_j} $ and make use of (5.48) $a_j = sum_i w_{ji} z_i$ and (5.49) $z_j = h(a_j) $, we obtain the following backpropagation formula (5.56) $ delta_j = h'(a_j) sum_k w_{kj} delta_k$

Basically, its not 100% clear how they used all those steps to get $ delta_j = h'(a_j) sum_k w_{kj} delta_k$ from $ delta_j equiv frac{partial E_n}{ partial a_j} = sum_k frac{partial E_n}{ partial a_k} frac{partial a_k}{ partial a_j} $.

I've tried applying those steps and I will show what I have tried so far:

First I substituted the definition of $delta$ to the multivarable chain rule to get from $ delta_j equiv frac{partial E_n}{ partial a_j} = sum_k frac{partial E_n}{ partial a_k} frac{partial a_k}{ partial a_j} $ to:

$$ delta_j = sum_k delta_k frac{partial a_k}{ partial a_j} $$

then I guessed that they some how used the chain rule again on $ frac{partial a_k}{ partial a_j} $ and somehow involved $frac{ partial h(a_j) }{partial a_j} = h'(a_j)$ to it and then substituted it back. Though that is not 100% clear to me how it was done. Does anyone have an idea how that was done?

As a reference I will paste the relevant section of the book to help:

**Contents**hide

#### Best Answer

The first three steps are just the substitutions given in the explanation.

The fourth step deserves a little explanation. (5.55) expands the chain rule using "all units $k$ to which unit $j$ sends connections." Then, (5.48) expands $a_k$ in terms of its feed forward inputs, the same layer that $a_j$ is from. For example, in a 3 layer neural network, let $a_j$ be one of the hidden layer units. Then $a_k$ would be the output layer units that $a_j$ sends connections to. $a_k$ is computed from the hidden units and that is $a_i$. So $a_j$ and $a_i$ are both the hidden layer. So the partial derivative is zero except when $i=j$ so we are only left with one term remaining.

The last step is because $h'(a_j)$ doesn't depend on $k$.

$$ begin{align} delta_j equiv frac{partial E_n}{ partial a_j} &= sum_k frac{partial E_n}{ partial a_k} frac{partial a_k}{ partial a_j} quad (5.55)\ &= sum_k delta_k frac{partial a_k}{ partial a_j} quad (5.51, text{definition of } delta_j)\ &= sum_k delta_k frac{partial}{ partial a_j}big(sum_i w_{ki}z_i) quad (5.48)\ &= sum_k delta_k frac{partial}{ partial a_j}big(sum_i w_{ki}h(a_i)) quad (5.49)\ &= sum_k delta_k w_{kj}h'(a_j)\ &= h'(a_j)sum_k delta_k w_{kj} end{align} $$

### Similar Posts:

- Solved – Gradients of cross-entropy error in neural network
- Solved – How is the Cross-Entropy Cost Function back-propagated
- Solved – Derivative of Softmax with respect to weights
- Solved – How to calculate derivative of the contractive auto-encoder regularization term
- Solved – back-propagation derivatives