CNN Networks

Milestone CNN Networks

CNN Networks Year First Author Max Depth Max Params Originality
Lenet 1998 Yann LeCun 7 11880+  
AlexNet 2012 Alex Krizhevsky 8 60M Dropout
GoogLenet 2014 Christian Szegedy 22   Inception module (Fig. 1); Intermediate classifier
VGG 2014 Karen Simonyan 19 144M 3\(\times\)3 kernels
ResNet 2015 Kaiming He 152   Residual learning; Shortcuts

Other Networks

CNN Networks Year First Author Originality
NIN 2013 Min Lin Mlpconv (Fig. 2) & Global average pooling
Maxout Network 2013 Ian J. Goodfellow Max pooling over affine feature maps
MatchNet 2015 Xufeng Han Produce similarity score for patch pair (Fig. 3)
Spatial Transformer Network 2015 Max Jaderberg Spatial transformer module (Fig. 4)
Fast R-CNN 2015 Ross Girshick Share computation compared to R-CNN (Fig. 5)
RPN 2015 Shaoqing Ren Region Proposal Network (Fig. 6)

Deep Inside CNN

Reflections

Appendix

Layer Shape Output Blob Shape
input 3x227x227  
conv1 96x11x11,stride=4 1x96x55x55
pool1 3x3,stride=2 1x95x27x27
conv2 256x5x5,stride=1 1x256x27x27
pool2 3x3,stride=2 1x256x13x13
conv3 384x3x3,stride=1 1x384x13x13
conv4 384x3x3,stride=1 1x384x13x13
conv5 256x3x3,stride=1 1x256x13x13
pool5 3x3,stride=2 1x256x6x6
fc6   4096
fc7   4096
fc8   1000

Table1. AlexLenet

Layer Shape Output Blob Shape
input 1x64x64  
conv1 24x7x7,stride=1 1x24x64x64
pool1 3x3,stride=2 1x24x32x32
conv2 64x5x5,stride=1 1x64x32x32
pool2 3x3,stride=2 1x64x16x16
conv3 96x3x3,stride=1 1x96x16x16
conv4 96x3x3,stride=1 1x96x16x16
conv5 64x3x3,stride=1 1x64x16x16
pool5 3x3,stride=2 1x64x8x8

Table2. Feature tower of MatchNet

Layer Shape Output Blob Shape
input 3x224x224  
conv1_1 64x3x3,stride=1 1x64x224x224
conv1_2 64x3x3,stride=1 1x64x224x224
pool1 2x2,stride=2 1x64x112x112
conv2_1 128x3x3,stride=1 1x128x112x112
conv2_2 128x3x3,stride=1 1x128x112x112
pool2 2x2,stride=2 1x128x56x56
conv3_1 256x3x3,stride=1 1x256x56x56
conv3_2 256x3x3,stride=1 1x256x56x56
conv3_3 256x3x3,stride=1 1x256x56x56
pool3 2x2,stride=2 1x256x28x28
conv4_1 512x3x3,stride=1 1x512x28x28
conv4_2 512x3x3,stride=1 1x512x28x28
conv4_3 512x3x3,stride=1 1x512x28x28
pool4 2x2,stride=2 1x512x14x14
conv5_1 512x3x3,stride=1 1x512x14x14
conv5_2 512x3x3,stride=1 1x512x14x14
conv5_3 512x3x3,stride=1 1x512x14x14
pool5 2x2,stride=2 1x512x7x7
fc6   4096
fc7   4096
fc8   1000

Table3. VGG16

inception module

Figure1. Inception module in GoogLenet

NIN

Figure2. Mlpconv layer in NIN

MatchNet

Figure3. The MatchNet architecture

spatial transformer

Figure4. Spatial transformer

spatial transformer Figure5. Fast R-CNN architecture The inputs are a whole image and a set of object proposals. It first processes image to produce a conv feature map and then extracts a fixed-length vector from the feature map for each proposal.Two subsequent sibling branches produce classification probability and refined bounding box respectively for proposals.

spatial transformer Figure6. Region Proposal Network The sliding n\(\times\)n window is mapped to a lower-dimensional vector by an n\(\times\)n conv layers, followed by two sibling 1\(\times\)1 conv layers and fully-connected layers for box regression and box classification respectively.