I read this part in the paper but i didn't fully understand.
"we additionally want the normalization to obey the convolutional property – so that different elements of the same feature map, at different locations, are normalized in the same way"
1- what is the meaning of "convolutional property" and "normalized in the same way"?
2- why do $gamma$ and $beta$ have dimension $C$ (the depth) and not of shape $[C,H,W]$ ? where $H$ and $W$ are the height and width.
Best Answer
"Convolution property" is that the very same values of weights are used against all locations of picture of big size. Convolution network doesn't distinguish parts of image, it looks with the "same eye" at each location of it.
This relates with what did they say in paragraph above:
BN transform is applied independently to each dimension of $x = Wu$, with a separate pair of learned parameters $γ^{(k)}$, $β^{(k)}$ per dimension
So, for dense layer with K
output features, you should have K
distributions, modelled by BN
.
For convolutional layer, number of output features is $W times H times C$, where $W$ is picture width, $H$ is picutre height and $C$ is number of filters (picture depth or number of channels)
In convolutional models like VGG it can be much greater than 3
in late layers.
So, one may think, that BN
should compute such big number of pairs $γ^{(k)}$ and $β^{(k)}$. Sometimes, images are processed with dense layers, and they are flattened then and really have such big number of features.
But to maintain "convolution property", there should be only $C$ of distribution models, because each region of picture should be modelled in the same way.
In other words, for convolutional layers BN
should not model distribution of activations in each region of picture separatedly, but model only one model per channel.
As far as I understood, tensorflow's batch_normaliztion
maintains this by design, because it has recommendation to set axis
to the position of channels
dimension.
In lua Torch they have special version of SpatialBatchNormalization
, but I think this is because they carefully model "spatial", "volumetric" and even "temporal" dimensions.
Similar Posts:
- Solved – cross channel parametric pooling layer in the architecture of Network in Network
- Solved – Why do CNNs conclude with FC layers
- Solved – What are the advantages of FC layers over Conv layers
- Solved – What are the advantages of FC layers over Conv layers
- Solved – Convolutional neural networks – What is done first? Padding or convolving