# Solved – Deriving the Backpropagation Matrix formulas for a Neural Network – Matrix dimensions don’t work out

I try to really internalize the way backpropagation works. I made up different networks with increasing complexity and wrote the formulas to it.
However, I have some difficulties with the matrix notation. Hope anyone can help me.

The loss is MSE: $$L = frac {1}{2} sum (y_{hat} – y_{true})^2$$

The derivative of the Loss with respect to the weight matrix $$W^{(3)}$$ should have the same dimensions like $$W^{(3)}$$ to update each entry with (stochastic) gradient descent.

$$frac {partial L}{partial W^{(3)}} = frac {partial L}{partial a^{(3)}} frac {partial a^{(3)}}{partial z^{(3)}} frac {partial z^{(3)}}{partial W^{(3)}} = (a^{(3)} – y) odot a^{(3)} odot (1 – a^{(3)}) a^{(2)T}$$

First question, is that correct to transpose $$a^{(2)}$$ since otherwise the dimension would not work out?

Now for the second weight matrix, where I cannot figure out what is wrong with the dimensions:

$$frac {partial L}{partial W^{(3)}} = frac {partial L}{partial a^{(3)}} frac {partial a^{(3)}}{partial z^{(3)}} frac {partial z^{(3)}}{partial a^{(2)}} frac {partial a^{(2)}}{partial z^{(2)}} frac {partial a^{(2)}}{partial z^{(2)}} frac {partial z^{(2)}}{partial W^{(2)}} = (a^{(3)} – y) odot a^{(3)} odot (1 – a^{(3)}) W^{(3)} (1,1,1)^T a^{(1)T}$$

I get 2×1 2×3 3×1 1×2

I wrote just $$(1,1,1)$$ assuming that the the $$z = (z_1, z_2, z_3)$$ are greater than 0.

Contents

For your first question, yes, transposing $$a^{(2)}$$ will do the job since for each entry of the matrix $$w_{ij}^{(3)}$$, the derivative includes the multiplier $$a_{i}^{(2)}$$. So, $$a_{1}^{(2)}$$ will be in the first column, $$a_{2}^{(2)}$$ will be in the second column and so on. This is directly achieved by multiplying $$delta$$ (the first three terms) by $${a_2^{(2)}}^T$$Note that in your indexing, $$w_{ij}$$ denotes $$j$$-th row and $$i$$-th column.

Second one is a bit tricky. First of all, you're using denominator layout, so a vector of size $$m$$ divided by another vector of size $$n$$ has derivative of size $$ntimes m$$. Typically, numerator layout is more common.

Let's say we have a scalar loss $$L$$, and two vectors $$a,z$$ which have dimensions $$m,n$$ respectively. In $$frac{partial L}{partial z}=frac{partial L}{partial a}frac{partial a}{partial z}$$, according to denominator layout, first one produces $$mtimes 1$$ vector, and second one produces $$ntimes m$$ matrix, so dimensions mismatch. If it was numerator layout, we'd have $$1times m$$ times $$mtimes n$$, and get $$1times n$$ gradient, still consistent with the layout definition.

This is why you should append to the left as you move forward in denominator layout (because we're actually transposing a matrix multiplication in changing the layout: $$(AB)^T=B^TA^T$$): $$underbrace{frac{partial L}{partial z}}_{ntimes 1}=underbrace{frac{partial a}{partial z}}_{ntimes m}underbrace{frac{partial L}{partial a}}_{mtimes 1}$$

So, $$frac {partial L}{partial W_{ij}^{(2)}} = underbrace{frac {partial z^{(2)}}{partial W_{ij}^{(2)}}}_{1times 3} underbrace{frac {partial a^{(2)}}{partial z^{(2)}}}_{3times 3} underbrace{frac {partial z^{(3)}}{partial a^{(2)}}}_{3times 2} underbrace{frac {partial L}{partial z^{(3)}}}_{2times 1}$$

And everything matches. I've changed two more things:

• Some of these calculations can be merged and optimised using element-wise multiplications, e.g. the term $$frac {partial a^{(2)}}{partial z^{(2)}}$$ produces a 3×3 output, but it's a diagonal matrix. This is actually what you've done while calculating gradients in the last layer.
• I've used $$W_{ij}$$ because it's easier to think. $$frac{partial z}{partial W}$$ is a 3D tensor, since numerator is a vector and denominator is a matrix. After finding the expressions for each $$W_{ij}$$ and placing them into the gradient matrix one by one according to denominator layout, you can take out common multiplications and write the final formula.

Rate this post