As far as I can see, ResNet-152 (paper, visualization, Caffe Model) expects inputs with dimensions 224x224x3, and its first layer does 64 convolutions, each against a 7×7 kernel with a padding of 3 and stride=2.
Since $frac{text{input}+2timestext{padding}-text{filter}}{text{stride}}=112.5$, its output's dimensions should be 112.5×112.5×64 (right?). This must be converted to an integer, and it looks like Caffe "truncates toward zero" and the output's dimensions are actually 112x112x64 (here's the code), and I loaded the model with Caffe and verified it).
This seems to be different than Caffe's strategy for pooling layers (code), which ceils the result, instead of truncating it.
My questions:
- What's the conventional way of treating settings of kernel/padding/stride that results "fractional" dimensions (e.g. 112.5 as above).
- Are the common frameworks (Caffe, TensorFlow, Torch…) consistent about it?
- Is the inconsistency between the behaviour of convolutional layers and pooling-layers of Caffe is "by design"? If so, why?
Best Answer
The fraction part comes from the stride operation. Without stride, the output size should be output_no_stride = input + 2*pad - filter + 1 = 224
. With stride, the conventional formula to use is output_with_stride = floor((input + 2*pad - filter) / stride) + 1 = 112
.
In many programming languages, the default behavior of integer division is "round toward zero" so the floor operation can be omitted when the numerator and denominator are positive integers. (Ref: Caffe's convolution implementation, Cudnn docs)
Comparing the output dimension with and without stride
output_with_stride = floor((input + 2*pad - filter) / stride) + 1 = floor((output_no_stride - 1) / stride) + 1 = ceil(output_no_stride / stride)
Caffe's pooling is a bit complicated, it first replaces the floor with ceiling, then decreases the size by one if the last pooling does not start strictly inside the image, as shown in the code.
pooled_height_ = static_cast<int>(ceil(static_cast<float>( height_ + 2 * pad_h_ - kernel_h_) / stride_h_)) + 1; pooled_width_ = static_cast<int>(ceil(static_cast<float>( width_ + 2 * pad_w_ - kernel_w_) / stride_w_)) + 1; if (pad_h_ || pad_w_) { // If we have padding, ensure that the last pooling starts strictly // inside the image (instead of at the padding); otherwise clip the last. if ((pooled_height_ - 1) * stride_h_ >= height_ + pad_h_) { --pooled_height_; } if ((pooled_width_ - 1) * stride_w_ >= width_ + pad_w_) { --pooled_width_; } CHECK_LT((pooled_height_ - 1) * stride_h_, height_ + pad_h_); CHECK_LT((pooled_width_ - 1) * stride_w_, width_ + pad_w_); }
I think the result is mostly aligned with the conventional formula except when the last pooling is entirely outside the original input.
Similar Posts:
- Solved – Fractional output dimensions of “sliding-windows” (convolutions, pooling etc) in neural networks
- Solved – Fractional output dimensions of “sliding-windows” (convolutions, pooling etc) in neural networks
- Solved – How does Caffe handle non-integer convolution layer output size
- Solved – Padding and stride in backpropagation of a conv net
- Solved – How to handle even and odd convolutional filter sizes and images