Solved – Translational variance in convolutional neural networks

Convolutional networks have been proven to work very well detecting a shape independently of where it is in the image, which is referred as translational invariance.

In the case where the position of an object in an image contains information for a classification problem, are convolutional networks still a good method? For example, if we have images of this kind:

Cat habits

The position of the cat in the images is related with the time of the day, and because of that, we can infer if the cat is having lunch or dinner.

Is convolution still a good candidate to solve this problem or are there more appropriate methods? In case convolution fits this purpose, might a fully connected layer be a better approach to solve the spatial variability?

Edit: This is an oversimplification of the problem where images contain entangled and complex patterns which cannot be easily isolated.

Yes, convolutional networks are still a good candidate. As you know, the way they work is detecting more and more complex features, starting from edges and colors and detecting textures and other concepts in deeper layers. To detect a feature (let's say a cat head), it is not enough to know that relevant lower level features are present in the region, but their relative position also plays a role (e.g. the eyes should be lower than ears). Convolutional networks do capture this information, and if the relative position of an object plays a role for classification, they should discover it.

As you already mentioned, fully-connected layers encode this positional information even better, and they are used after convolution layers in networks for classification. Possibly even better would be using a "locally-connected layer", which is a combination of a conv-layer and a fully-connected one: it performs a convolution operation, but at every position it learns a separate convolution kernel.

Finally, if you know that position of some objects is crucial for the classification, consider splitting the problem in two parts. Firstly, train a network that can detect these relevant objects (for example a fully convolutional network with low output resolution), and secondly add an extra layer on top of this (locally connected for example) to do the classification. It has been demonstrated by (Gulcehre and Bengio, 2013) that such subproblem splitting can make the difference between easy and impossible: Knowledge Matters: Importance of Prior Information for Optimization.

Finally, I would correct that the phenomenon you are referring to is not called translation invariance but equivariance. A great explanation of the difference is here:

Invariance to a transformation means if you take the input and transform it then the representation you get is the same as the representation of the original, i.e. represent(x) = represent(transform(x))

Equivariance to a transformation means if you take the input and transform it then the representation you get is a transformation of the representation of the original, i.e. transform(represent(x)) = represent(transform(x)).

Similar Posts:

Rate this post

Leave a Comment