I try to really internalize the way backpropagation works. I made up different networks with increasing complexity and wrote the formulas to it.

However, I have some difficulties with the matrix notation. Hope anyone can help me.

My network has 2 input, 3 hidden and 2 output neurons.

The loss is MSE: $ L = frac {1}{2} sum (y_{hat} – y_{true})^2$

The derivative of the Loss with respect to the weight matrix $W^{(3)}$ should have the same dimensions like $W^{(3)}$ to update each entry with (stochastic) gradient descent.

$frac {partial L}{partial W^{(3)}} = frac {partial L}{partial a^{(3)}} frac {partial a^{(3)}}{partial z^{(3)}} frac {partial z^{(3)}}{partial W^{(3)}} = (a^{(3)} – y) odot a^{(3)} odot (1 – a^{(3)}) a^{(2)T}$

**First question**, is that correct to transpose $a^{(2)}$ since otherwise the dimension would not work out?

Now for the second weight matrix, where **I cannot figure out what is wrong with the dimensions**:

$frac {partial L}{partial W^{(3)}} = frac {partial L}{partial a^{(3)}} frac {partial a^{(3)}}{partial z^{(3)}} frac {partial z^{(3)}}{partial a^{(2)}} frac {partial a^{(2)}}{partial z^{(2)}} frac {partial a^{(2)}}{partial z^{(2)}} frac {partial z^{(2)}}{partial W^{(2)}} = (a^{(3)} – y) odot a^{(3)} odot (1 – a^{(3)}) W^{(3)} (1,1,1)^T a^{(1)T}$

I get **2×1 2×3 3×1 1×2**…

I wrote just $(1,1,1)$ assuming that the the $z = (z_1, z_2, z_3)$ are greater than 0.

#### Best Answer

For your first question, yes, transposing $a^{(2)}$ will do the job since for each entry of the matrix $w_{ij}^{(3)}$, the derivative includes the multiplier $a_{i}^{(2)}$. So, $a_{1}^{(2)}$ will be in the first column, $a_{2}^{(2)}$ will be in the second column and so on. This is directly achieved by multiplying $delta$ (the first three terms) by ${a_2^{(2)}}^T$Note that in your indexing, $w_{ij}$ denotes $j$-th row and $i$-th column.

Second one is a bit tricky. First of all, you're using denominator layout, so a vector of size $m$ divided by another vector of size $n$ has derivative of size $ntimes m$. Typically, numerator layout is more common.

### It's all about Layouts

Let's say we have a scalar loss $L$, and two vectors $a,z$ which have dimensions $m,n$ respectively. In $frac{partial L}{partial z}=frac{partial L}{partial a}frac{partial a}{partial z}$, according to denominator layout, first one produces $mtimes 1$ vector, and second one produces $ntimes m$ matrix, so dimensions mismatch. If it was numerator layout, we'd have $1times m$ times $mtimes n$, and get $1times n$ gradient, still consistent with the layout definition.

This is why you should append **to the left** as you move forward in denominator layout (because we're actually transposing a matrix multiplication in changing the layout: $(AB)^T=B^TA^T$): $$underbrace{frac{partial L}{partial z}}_{ntimes 1}=underbrace{frac{partial a}{partial z}}_{ntimes m}underbrace{frac{partial L}{partial a}}_{mtimes 1}$$

So, $$frac {partial L}{partial W_{ij}^{(2)}} = underbrace{frac {partial z^{(2)}}{partial W_{ij}^{(2)}}}_{1times 3} underbrace{frac {partial a^{(2)}}{partial z^{(2)}}}_{3times 3} underbrace{frac {partial z^{(3)}}{partial a^{(2)}}}_{3times 2} underbrace{frac {partial L}{partial z^{(3)}}}_{2times 1}$$

And everything matches. I've changed two more things:

- Some of these calculations can be merged and optimised using element-wise multiplications, e.g. the term $frac {partial a^{(2)}}{partial z^{(2)}}$ produces a 3×3 output, but it's a diagonal matrix. This is actually what you've done while calculating gradients in the last layer.
- I've used $W_{ij}$ because it's easier to think. $frac{partial z}{partial W}$ is a 3D tensor, since numerator is a vector and denominator is a matrix. After finding the expressions for each $W_{ij}$ and placing them into the gradient matrix one by one according to
**denominator**layout, you can take out common multiplications and write the final formula.