Solved – ny explanation for the spatial batch normalization

I read this part in the paper but i didn't fully understand.
"we additionally want the normalization to obey the convolutional property – so that different elements of the same feature map, at different locations, are normalized in the same way"

1- what is the meaning of "convolutional property" and "normalized in the same way"?
2- why do $gamma$ and $beta$ have dimension $C$ (the depth) and not of shape $[C,H,W]$ ? where $H$ and $W$ are the height and width.

"Convolution property" is that the very same values of weights are used against all locations of picture of big size. Convolution network doesn't distinguish parts of image, it looks with the "same eye" at each location of it.

This relates with what did they say in paragraph above:

BN transform is applied independently to each dimension of $x = Wu$, with a separate pair of learned parameters $γ^{(k)}$, $β^{(k)}$ per dimension

So, for dense layer with K output features, you should have K distributions, modelled by BN.

For convolutional layer, number of output features is $W times H times C$, where $W$ is picture width, $H$ is picutre height and $C$ is number of filters (picture depth or number of channels)

In convolutional models like VGG it can be much greater than 3 in late layers.

So, one may think, that BN should compute such big number of pairs $γ^{(k)}$ and $β^{(k)}$. Sometimes, images are processed with dense layers, and they are flattened then and really have such big number of features.

But to maintain "convolution property", there should be only $C$ of distribution models, because each region of picture should be modelled in the same way.

In other words, for convolutional layers BN should not model distribution of activations in each region of picture separatedly, but model only one model per channel.

As far as I understood, tensorflow's batch_normaliztion maintains this by design, because it has recommendation to set axis to the position of channels dimension.

In lua Torch they have special version of SpatialBatchNormalization, but I think this is because they carefully model "spatial", "volumetric" and even "temporal" dimensions.

Similar Posts:

Rate this post

Leave a Comment