Recently I have started to implement my own Convolutional Neural Network. I have few questions. I will talk with reference to an example, so that we all remain on the same page. Suppose,

input: `64X64X1`

that is gray-channel only.————Output – 64X64X1

C1: `5X5X6`

that is 6 `conv_maps`

, each of size 5X5-Output – 60X60X6

P1: Max-Pooling – non_overlapping `size = 2X2`

–Output – 30X30X6

C2: `9X9X8`

– 8 `conv_maps`

, each of size 9X9——–Output – 22X22X48`//Subject_To_Change`

P2: Max-Pooling – Non_overlapping `size = 2X2`

–Output – 11X11X48`//Subject_To_Change`

Ok, Now following are the questions:

**ReLU**As I understand, ReLU is applied to every neuron. That is, in C1,

first time`5X5`

patch is moved over`input`

– Then the sum of

convolution has to pass through`transform_function`

. And no`transform_function`

at Pooling layer. Am I correct in understanding it?Which function to use as

`transfer_function`

?Softplus? Noisy one? Leaky one?Also, same transfer function should be used for

`FeedForward`

part, right? Or can I change to`sigmoid`

there?**Convolution-Feature_Map Connections**

How to carry out next convolution? The

`P1`

layer has 6 maps of`30X30`

. There are going to be 8 convolutional kernels, each of size 9X9. But I have NEVER seen this producing`6*8`

maps. Specifically,`LeNet`

has output of 16 maps. How to produce those maps is given in this paper on page 8. After reading it again and again I DO NOT get how to generate next feature maps. Are they doing it like this –>- Also, isn't the method mentioned in the paper specific to 'OCR'? I am very confused about how to write program for them in a user-friendly way. For e.g. if I want to see the output of different architecture, how to define these rules of connections programmatically?
I definitely did not understand

*"It forces a break of symmetry .."*thing from the above mentioned paper. Please if you could elaborate. I am not able to visualize problem of symmetry here.**About Bias**

Initially I thought

`bias`

as a window of kernel size, but now I think its just a number between 0-1. But How do I add a bias? If I treat kernel as a matrix, say 5X5, then how possibly I can add a single number to matrix? We get the sum after the convolution, I think I am supposed to add the bias to this sum and then apply the transform function. Right?

**Contents**hide

#### Best Answer

Convolution with a kernel is done on all input maps and their summation is taken. In the input layer it is obvious since there is only one feature (input map). However, after first convolution, the later comvolutions are summuation of kernel operation on all feature maps. Hence, instead of 48 output feature map at C2, there should be 8 maps. this link explains the network and its back-prop in a clear way.

Use $ f(x) = max(0,x)$ as activation(transform) function. After successful implementation, you can use the others too. You should use the same function for both 'feedforward' and 'back-prop'.

I haven't read the paper, but breaking symmetry is about selecting weight from a random distribution. If the weights are the same on feature maps, back propagated error will be the same. As a result network learns the same filters which is not desirable.

Rule of connections are already defined as mathematical expressions. The number of kernels, number of layers, kernelsize etc. should be defined symbolically and they sould be assigned in main section of the code.

You should add bias before applying activation function. A single bias, most commonly used, is added to feature map. Summing a scalar with a matrix is simply adding the scalar at each indexes of the matrix.

If you didn't write a code for NN before, It would be better to start with it.

### Similar Posts:

- Solved – How are kernels applied to feature maps to produce other feature maps
- Solved – Bottleneck building block in Residual learning networks
- Solved – How CNN reduces number of feature maps/ number of classes
- Solved – Connection between filters and feature map in CNN
- Solved – Is a 1D convolution of size $m$ with $k$ channels the same as a 2D convolution of size $m times k$ with 1 channel