CS代考 EBU7240 Computer Vision

EBU7240 Computer Vision
Changjae Oh
layer Perceptron (MLP)
Semester 1, 2021

Neural networks
(Before) Linear score function: (Now) 2-layer Neural Network:
3-layer Neural Network:
𝒇 = 𝐖 max(𝟎, 𝐖 𝒙) 𝟐𝟏
𝒇 = 𝐖 max 𝟎, 𝐖 max 𝟎, 𝐖 𝒙 𝟑𝟐𝟏

Activation functions
• Adding non-linearities into neural networks, allowing the neural networ ks to learn powerful operations.
• A crucial component of deep learning
̶ If the activation functions were to be removed from a feedforward neural network, th e entire network could be re-factored to a simple linear operation or matrix transform ation on its input
̶ It would no longer be capable of performing complex tasks such as image recognition.

Activation functions
𝜎𝑥=1 1+𝑒−𝑥
𝑒𝑥 −𝑒−𝑥 tanh𝑥 =𝑒𝑥+𝑒−𝑥
Leaky ReLU
max 0.1𝑥,𝑥
max 𝑤𝑇𝑥+𝑏 ,𝑤𝑇𝑥+𝑏 1122
ቊ𝑥𝑥≥0 𝛼(𝑒𝑥−1) 𝑥<0 Neural networks: Architectures “2-layer Neural Net”, or “3-layer Neural Net”, or “1-hidden-layer Neural Net” “2-hidden-layer Neural Net” “Fully-connected” layers Derivative of Neural Net using Chain Rules • Example 1-layer Neural Net (L2 regression loss) 2-layer Neural Net (L2 regression loss) 1-layer Neural Net (Softmax classifier) 2-layer Neural Net (Softmax classifier) 𝑠1 𝒘𝑇 𝑤𝑤⋯𝑤 𝑏1 𝑥1 𝑠 1 1112 1𝑑 𝑏 𝑥2 layer Neural Net (L2 regression loss) Linear score Activation function 𝒑=𝜎𝒔=1 1+𝑒−𝒔 Output layer 𝒔=𝐖𝒙+𝒃 𝑠 =𝒘𝑇𝒙+𝑏 𝑗 𝑗 𝑗 𝐿 = (𝒛 − 𝒑)2 2 𝒘𝑇 𝑤21𝑤22⋯𝑤2𝑑 𝒃=2 𝒙= ⋮𝐖=2=⋮⋮⋮ 𝑛 𝒘𝑇 𝑤𝑛1 𝑤𝑛2 ⋯ 𝑤𝑛𝑑 𝑛 𝑛 𝑠1 𝒘𝑇 𝑤𝑤⋯𝑤 𝑏1 𝑥1 𝑠 1 1112 1𝑑 𝑏 𝑥2 layer Neural Net (L2 regression loss) 𝑠 =𝒘T𝒙+𝑏 =෍𝑤 𝑥 +𝑏 Ground truth Output layer (𝑧 −𝑝)2 1 1 Linear score Activation function 𝑠 =𝒘𝑇𝒙+𝑏 𝑗 𝑗 𝑗 𝒑=𝜎𝒔=1 1+𝑒−𝒔 𝐿 = (𝒛 − 𝒑)2 2 𝒘𝑇 𝑤21𝑤22⋯𝑤2𝑑 𝒃=2 𝒙= ⋮𝐖=2=⋮⋮⋮ 𝑛 𝒘𝑇 𝑤𝑛1 𝑤𝑛2 ⋯ 𝑤𝑛𝑑 𝑛 𝑛 2. Activation function 𝒑 = 𝜎 𝒔 = Linear score layer Neural Net (L2 regression loss) 𝑏1 𝑧1 𝑠1 𝑝1 Ground truth Output layer 𝑑 1 𝑠 =𝒘T𝒙+𝑏 =෍𝑤 𝑥 +𝑏 𝑝1= (𝑧 −𝑝)2 1 1 Ground truth 1𝑘𝑘 1 1+𝑒−𝑠1 𝐿 = (𝒛 − 𝒑)2 𝑠1 𝒘𝑇 𝑤𝑤⋯𝑤 𝑏1 𝑥1 𝑠 1 1112 1𝑑 𝑏 𝑥2 𝑤21 𝑤22 ⋯𝑤2𝑑 ⋮ 𝑤𝑛1 𝑤𝑛2 ⋯ 𝑤𝑛𝑑 (𝑧−𝑝)2 𝑛 𝑛 𝑠=𝒘T𝒙+𝑏=෍𝑤𝑥+𝑏 𝑝= 𝑏𝑥𝑑 𝑛𝑛𝑛𝑛𝑘𝑘𝑛𝑛1+𝑒−𝑠𝑛 𝑠1 𝒘𝑇 𝑤𝑤⋯𝑤 𝑏1 𝑥1 𝑠 1 1112 1𝑑 𝑏 𝑥2 layer Neural Net (L2 regression loss) In a vector form Ground truth 𝒛 𝒔𝒑 2 𝒘𝑇 𝑤21𝑤22⋯𝑤2𝑑 𝒃=2 𝒙= ⋮𝐖=2=⋮⋮⋮ Linear score Activation function Output layer 𝑑×1 𝑛×1 𝑛×1 𝑠 =𝒘𝑇𝒙+𝑏 𝑗 𝑗 𝑗 𝒑=𝜎𝒔=1 1+𝑒−𝒔 𝐿 = (𝒛 − 𝒑)2 𝑛 𝒘𝑇 𝑤𝑛1 𝑤𝑛2 ⋯ 𝑤𝑛𝑑 𝑛 𝑛 We need to compute gradients of 𝐖, 𝒃, 𝒔, 𝒑 with respect to the loss function 𝐿. 𝜕𝐿 = −2(𝒛 − 𝒑) 𝜕𝒑 𝜕𝐿=𝜕𝒑𝜕𝐿=𝑑𝑖𝑎𝑔 (1−𝜎(𝑠))𝜎(𝑠) 𝜕𝐿=−2 𝜕𝒔𝜕𝒔𝜕𝒑 𝑗 𝑗𝜕𝒑 jth column layer Neural Net (L2 regression loss) In a vector form Ground truth 𝒛 𝑛 × 1 𝑑×1 𝑛×1 𝑛×1 (1 − 𝜎(𝑠1))𝜎(𝑠1)(𝑧1 − 𝑝1) (1 − 𝜎(𝑠2))𝜎(𝑠2)(𝑧2 − 𝑝2) =(1−𝜎(𝒔))⊗𝜎 𝒔 ⊗ 𝜕𝐿 ⋮ 𝜕𝒑 𝜕𝐿𝜕𝐿𝜕𝐿 𝜕𝐿T𝜕𝐿T (1 − 𝜎(𝑠𝑛))𝜎(𝑠𝑛)(𝑧𝑛 − 𝑝𝑛) ⊗: element-wise multiplication 𝜕𝐿 = 𝜕𝒔 𝜕𝐿=𝐗 𝜕𝐿=[𝟎𝟎𝒙 ⋯𝟎]𝜕𝐿= 𝜕𝐿 𝒙 𝒂 𝑗:jth elementatvector𝒂 𝜕𝐿𝜕𝒔𝜕𝐿𝜕𝐿 𝜕𝒃=𝜕𝒃𝜕𝒔=𝜕𝒔 𝜕𝒘 𝜕𝒘 𝜕𝒔 𝑗 𝜕𝒔 𝜕𝒔 𝑗𝑗 𝜕𝐖= 𝜕𝒘 𝜕𝒘 ⋯ 𝜕𝒘 =𝜕𝒔𝒙 12𝑛 layer Neural Net (L2 regression loss) In a vector form Ground truth 𝒛 𝑛 × 1 𝜕𝐿 = −2(𝒛 − 𝒑) 𝜕𝒑 𝜕𝐿=𝜕𝒑𝜕𝐿=(1−𝜎(𝒔))⊗𝜎 𝒔 ⊗ 𝜕𝐿 𝜕𝒔 𝜕𝒔𝜕𝒑 𝜕𝒑 𝜕𝐿 =𝜕𝐿𝒙T 𝜕𝐖 𝜕𝒔 𝜕𝐿 = 𝜕𝐿 𝜕𝒃 𝜕𝒔 𝑑×1 𝑛×1 𝑛×1 Note that the following derivative can also be computed, but here 𝒙 is an input data that is fixed during training. Thus, it is not necessary to compute its derivative. 𝜕𝐿 = 𝜕𝒔 𝜕𝐿 = 𝐖T 𝜕𝐿 𝜕𝒙 𝜕𝒙𝜕𝒔 𝜕𝒔 layer Neural Net (L2 regression loss) Groundtruth 𝒛 𝑚×1 𝒔1 𝒑1 𝐖 𝒔2 𝒑𝟐 In a vector form 𝜕𝐿 =−2(𝒛−𝒑𝟐) 𝜕𝒑𝟐 𝜕𝐿 = 𝜕𝒑2 𝜕𝐿 = 𝑑𝑖𝑎𝑔 (1 − 𝜎(𝑠2,𝑗))𝜎(𝑠2,𝑗) 𝜕𝐿 𝜕𝒔2 𝜕𝒔2 𝜕𝒑2 𝜕𝐿 = 𝜕𝒑1 𝜕𝐿 = 𝑑𝑖𝑎𝑔 (1 − 𝜎(𝑠1,𝑗))𝜎(𝑠1,𝑗) 𝜕𝐿 𝜕𝐿 = 𝜕𝐿 𝒑T 𝜕𝐿 =𝜕𝒔2 𝜕𝐿 =𝐖T 𝜕𝐿 𝜕𝒑1 𝜕𝒑1 𝜕𝒔2 2 𝜕𝒔2 𝜕𝒔1 𝜕𝒔1 𝜕𝒑1 𝜕𝐿 = 𝜕𝐿 𝒙T 𝜕𝒃 𝜕𝒔 11 11 layer Neural Net ( classifier) Groundtruth 𝒛 𝑛×1 −1/𝑝𝑦 𝒙𝐖𝒔𝒑 0 yth row 0 Inavectorform 𝑑×1 𝑛×1 𝑛×1 𝜕𝐿𝜕𝒑𝜕𝐿 𝜕𝐿 1𝐷 = = 𝐃 = − 2𝑦 𝜕𝒔 𝜕𝒔𝜕𝒑 𝜕𝒑 𝑝𝑦 ⋮ 𝑛𝑦 jth column = 𝒑 − 𝒛 𝐷𝑎𝑏 = 𝑝𝑎(𝛿𝑎𝑏 − 𝑝𝑏) 𝐷 1𝑎=𝑏 𝜕𝐿 = 𝜕𝒔 𝜕𝐿 = 𝐗 𝜕𝐿 = 𝜕𝐿 = 𝜕𝐿 𝜕𝒔 𝜕𝒔 𝟎 𝟎 𝒙 ⋯ 𝟎 𝜕𝐿𝜕𝐿𝜕𝐿 𝜕𝐿T𝜕𝐿T 𝒂 : jth element at vector 𝒂 𝑗 𝜕𝐿 = 𝜕𝒔 𝜕𝐿 = 𝜕𝐿 𝜕𝒃 𝜕𝒃 𝜕𝒔 𝜕𝒔 𝜕𝒘 𝜕𝒘𝜕𝒔 𝑗𝜕𝒔 𝑗𝑗 𝜕𝐖= 𝜕𝒘 𝜕𝒘 ⋯ 𝜕𝒘 =𝜕𝒔𝒙 12𝑛 Log likelihood 0 yth row 0 layer Neural Net ( In a vector form classifier) Ground truth 𝒛 𝒔𝒑 𝑑×1 𝑛×1 𝑛×1 𝜕𝐿𝜕𝒑𝜕𝐿 𝜕𝐿 1𝐷 = =𝐃=− 2𝑦=𝒑−𝒛 𝜕𝒔 𝜕𝒔𝜕𝒑 𝜕𝒑 𝑝𝑦 ⋮ 𝜕𝐿 =𝜕𝐿𝒙T 𝜕𝐿=𝜕𝐿 𝜕𝐖 𝜕𝒔 𝜕𝒃 𝜕𝒔 Log likelihood Note that the following derivative can also be computed, but here 𝒙 is an input data that is fixed during training. Thus, it is not necessary to compute its derivative. 𝜕𝐿 = 𝜕𝒔 𝜕𝐿 = 𝐖T 𝜕𝐿 𝜕𝒙 𝜕𝒙𝜕𝒔 𝜕𝒔 layer Neural Net ( classifier) Groundtruth 𝒛 𝑚×1 𝒑1 𝐖 𝒔2 𝒑𝟐 𝒃1 Inavectorform 𝑛×1 𝒃2 𝑚×1 yth row Log likelihood 𝜕𝐿 = 𝜕𝒑1 𝜕𝐿 = 𝑑𝑖𝑎𝑔 (1 − 𝜎(𝑠1,𝑗))𝜎(𝑠1,𝑗) 𝜕𝐿 𝜕𝐿 = 𝜕𝒑2 𝜕𝐿 = 𝐃 𝜕𝐿 𝜕𝒔2 𝜕𝒔2𝜕𝒑2 𝜕𝒑2 𝜕𝐖 𝜕𝒔 1 𝟐2 𝜕𝐿 = 𝜕𝒔2 𝜕𝐿 = 𝐖T 𝜕𝐿 𝜕𝒑1 𝜕𝒑1𝜕𝒔2 2 𝜕𝒔2 𝐷𝑎𝑏 = 𝑝𝑎(𝛿𝑎𝑏 − 𝑝𝑏) 𝛿𝑎𝑏 =ቊ1 𝑎=𝑏 0 otherwise 𝜕𝒔1 𝜕𝒔1 𝜕𝒑1 𝜕𝐿 = 𝜕𝐿 𝜕𝒃1 𝜕𝒔1 𝜕𝐿 = 𝜕𝐿 𝜕𝒃2 𝜕𝒔2 Full implementation of training a 2 N: batch size D_in: input feature size H: input feature size of the second layer D_out: output feature size layer Neural Network Ground truth 𝒚 𝒔 Neural networks: Pros and cons ̶ Flexible and general function approximation framework ̶ Can build extremely powerful models by adding more layers Hard to analyze theoretically (e.g., training is prone to local optima) Huge amount of training data, computing power may be required to get good perfor mance The space of implementation choices is huge (network architectures, parameters) • We arrange neurons into fully-connected layers • The layer allows us to use efficient vectorized code (e.g. matrix multiplic ation) ̶ Using back-propagation EBU7240 Computer Vision Neural Networks Convolutional Semester 1, 2021 Changjae Oh CNN Introduction • Image Recognition ̶ Recognizing the object class in the image – is a sequence of layers – Every layer of a ConvNet transforms one volume of activations to another through a differentiable function. – Convolutional Layer: computes the output of neurons that are connected to local regions in the input – ReLU (nonlinear) layer: activates relevant responses – Pooling Layer: performs a downsampling operation along the spatial dimensions – Fully-Connected Layer: each neuron in this layer will be connected to all the numbers in the previous volume Typical architectures of [(Conv-ReLU)*N - POOL]*M - (FC-RELU)*K - Net (Conv-ReLU)*N - POOL (Conv-ReLU)*N - POOL N is usually up to ~5, M is large, 0 <= K <= 2 but some advances such as ResNet/GoogLeNet challenge this paradigm Typical architectures of Convolutional Layer Convolutional Layer To preserve spatial structure, use an original 2D image 32x32x3 image 5x5x3 filter Convolve the filter with the image i.e. “slide over the image spatially, computing dot products” Convolutional Layer To preserve spatial structure, use an original 2D image 32x32x3 image Filters always extend the full depth of the input volume 5x5x3 filter Convolve the filter with the image i.e. “slide over the image spatially, computing dot products” Convolutional Layer 32x32x3 image 5x5x3 filter the result of taking a dot product between the filter and a small 5x5x3 chunk of the image (i.e. 5*5*3 = 75-dimensional dot product + bias) Convolutional Layer 32x32x3 image 5x5x3 filter Activation map convolve (slide) over all spatial locations Convolutional Layer 32x32x3 image 5x5x3 filter convolve (slide) over all spatial locations consider a second 5x5x3 (orange) filter Convolutional Layer If we had six 5x5x3 filters, we’ll get six separate activation maps: We call the layer convolutional because it is related to convolution of two signals: elementwise multiplication and sum of a filter and the signal (image) Convolutional Layer • The number of parameters in convolutional layer Input Weight Output 𝐻 ×𝑊 ×𝐶 𝐶 filtersof𝐹 ×𝐹 ×𝐶 𝐻 ×𝑊 ×𝐶 1112h𝑣1112 The number of weights: 𝐶 × (𝐹 × 𝐹 × 𝐶 ) 2h𝑣1 The number of bias: 𝐶 2 • A sequence of Convolutional Layers, interspersed with activation functions 32 28 24 (6 filters of 5x5x3) ReLU (10 filters of 5x5x6) ReLU ..... Conv Receptive field The region of the input space that affects a particular unit of the network From the convolution property, 𝑦 = 𝑥 ∗ h1 𝑧 = 𝑦 ∗ h2 → 𝑧 = 𝑥 ∗ h1 ∗ h2 = 𝑥 ∗ h 5×5×1filter 5×5×1filter Effective receptive field at : 9 × 9 (= 5 + 5 − 1) h = h1 ∗ h2 Convolutional Filter Size Three 3x3 Conv layers vs. Single 7x7 Conv layer 3x3xc 3x3xc 3x3xc 7x7xc Assume that zero-padding is applied to preserve a spatial resolution • Receptive fields are equal. For three 3x3 Conv layers, 3+3-1+3-1 = 7. • #ofConvparameters:3×𝐶× 3×3×𝐶 =27𝐶2 vs. 𝐶× 7×7×𝐶 =49𝐶2 • The three stacks of CONV layers produce more expressive activation maps Spatial Dimension at Convolution Layer 7x7 input (spatially) assume 3x3 filter applied with stride 1 -> 5×5 output
assume 3×3 filter applied with stride 2 -> 3×3 output
assume 3×3 filter applied with stride 3
-> doesn’t fit!
cannot apply 3×3 filter on 7×7 input with stride 3.
*Stride: is the number of pixels shifts over the input matrix

Spatial Dimension at Convolution Layer
Output size:
(N – F) / stride + 1
e.g. N = 7, F = 3:
stride 1 => (7 – 3)/1 + 1 = 5 stride 2 => (7 – 3)/2 + 1 = 3 stride 3 => (7 – 3)/3 + 1 = 2.33

Spatial Dimension at Convolution Layer
In practice: Common to zero pad the border
e.g. input 7×7
3×3 filter, applied with stride 1 pad with 1 pixel border
-> 7×7 output!
in general, common to see CONV layers with stride 1, filters of size FxF, and zero-padding with (F-1)/2. (will preserve size spatially)
e.g. F = 3 => zero pad with 1
F = 5 => zero pad with 2 F = 7 => zero pad with 3

Spatial Dimension at Convolution Layer
Input volume: 32x32x3
10 5x5x3 filters with stride 1, pad 2
Output volume size: (32+2*2-5)/1+1 = 32 spatially, so 32x32x10

Spatial Dimension at Convolution Layer
Input volume: 32x32x3
10 5x5x3 filters with stride 1, pad 2
Number of parameters in this layer?
each filter has 5*5*3 + 1 = 76 params (+1 for bias) -> 76*10 = 760

Spatial Dimension at Convolution Layer

1×1 Convolution
1×1 convolution layers make perfect sense
56 1x1CONV with 32 filters
(each filter has size 1x1x64, and performs a 64-dimensional dot product)

Convolutional Layer
Implementation and Backpropagation

Implementation as Matrix Multiplication
• Convolution: dot products between the filters and local regions of the input
• Conv layer: the forward pass of a convolutional layer as one big matrix multiply
• Example of feed-forward process
1. Convert the input into X_col by taking a block of 11x11x3 (=363) pixels in the input for 55×55 (=3025) times
X_col: [363×3025]
2. Reshape the conv filter into W_row: [96×363]
Reshape the conv bias (96×1 vector) into b_col: [96×3025] by stacking it for 3025 times
3. Perform matrix multiplication O = W_row * X_col + b_col 4. Reshape O: [96×3025] into [55x55x96]
Input: [227x227x3]
Conv filter: 96 filters of [11x11x3] Conv bias: 96×1 vector
Padding: 0
Output: (227-11)/4+1 = 55 -> [55x55x96]
O = W_row * X_col + b_col
96×3025 96×363 363×3025 96×3025

Backpropagation of Convolution Layer
Input: [227x227x3]
Conv filter: 96 filters of [11x11x3] Stride: 4
Padding: 0
Output: (227-11)/4+1 = 55 -> [55x55x96]
Convolution layer shares 𝐖 for all neurons of current activation map. For each neuron,
𝜕𝐿 T 𝜕𝐿 𝜕𝐿 = ෍ 𝜕𝐿 𝒙T 𝜕𝐿 = ෍ 𝜕𝐿
𝜕𝒙 =𝐖 𝜕𝒔 𝜕𝐖 𝜕𝒔𝑝 𝑝 𝜕𝒃 𝑝𝑝𝑝𝑝
𝒙𝑝: (11x11x3)x1 = 363×1
𝑝 = 1, … , 3025(= 55 × 55)
Backpropagation of 𝑠 = W𝑥 + 𝑏 𝒙𝒔
𝜕𝐿 = 𝜕𝒔 𝜕𝐿 = 𝐖T 𝜕𝐿 𝜕𝒙 𝜕𝒙𝜕𝒔 𝜕𝒔
𝜕𝐿 =𝜕𝐿𝒙T 𝜕𝐖 𝜕𝒔
𝜕𝐿=𝜕𝐿 𝜕𝒃 𝜕𝒔

Backpropagation of Convolution Layer
Input: [227x227x3]
Conv filter: 96 filters of [11x11x3] Stride: 4
Padding: 0
Output: (227-11)/4+1 = 55 -> [55x55x96]
Convolution layer shares 𝐖 for all neurons of current activation map. For all neurons,
𝜕𝐿 = 𝐖T 𝜕𝐿 𝜕𝒙 𝜕𝒔
𝜕𝐿 =𝜕𝐿𝒙T 𝜕𝐖 𝜕𝒔
𝒙: (11x11x3)x3025 = 363×3025
𝜕𝐿=𝜕𝐿𝟏T 𝜕𝒃 𝜕𝒔
1×3025 vector 𝒔: 96×3025
Backpropagation of 𝑠 = W𝑥 + 𝑏 𝒙𝒔
𝜕𝐿 = 𝜕𝒔 𝜕𝐿 = 𝐖T 𝜕𝐿 𝜕𝒙 𝜕𝒙𝜕𝒔 𝜕𝒔
𝜕𝐿 =𝜕𝐿𝒙T 𝜕𝐖 𝜕𝒔
𝜕𝐿=𝜕𝐿 𝜕𝒃 𝜕𝒔

Backpropagation of Convolution Layer
𝜕𝐿 = 𝐖T 𝜕𝐿 𝜕𝒙 𝜕𝒔
1. Perform 𝜕𝐿 = 𝐖T 𝜕𝐿 𝜕𝒙 𝜕𝒔
2. Reshape 𝜕𝐿 (363×3025) into 3025 gradients of 11x11x3 𝜕𝒙
3. Overlay the reshaped gradient into 3D matrix [227x227x3] in which overlapped gradients are accumulated.
𝒙: (11x11x3)x3025 = 363×3025
𝒔: 96×3025 𝐖: 96×363
𝜕𝐿 =𝜕𝐿𝒙T 𝜕𝐖 𝜕𝒔
𝜕𝐿=𝜕𝐿𝟏T 𝜕𝒃 𝜕𝒔
1. Convert the input into 𝒙 by taking a block of 11x11x3 (=363) pixels in the input for 55×55 (=3025) times. 𝒙: [363×3025]
2.Perform 𝜕𝐿 =𝜕𝐿𝒙T and𝜕𝐿 =𝜕𝐿𝟏T 𝜕𝐖 𝜕𝒔 𝜕𝒃 𝜕𝒔

Fully Connected Layer

Fully Connected Layer
32x32x3 image -> stretch to 3072 x 1
Each neuron looks at the full input volume
activation
10 x 3072 weight matrix
the result of taking a dot product between a row of W and the input (a 3072-dimensional dot product)

Pooling Layer

Pooling layer
• makes the representations smaller and more manageable
• operates over each activation map independently
Pooling layer
– Max pooling
– Average pooling (rarely used) – L2 norm pooling (rarely used)

MAX POOLING
Single depth slice
max pool with 2×2 filters and stride 2

EBU7240 Computer Vision
CNN Architectures
Semester 1, 2021
Changjae Oh

CNN Architectures
• Case Studies ̶ AlexNet
̶ GoogLeNet ̶ ResNet

Case Study:
Architecture:
CONV3 CONV4 CONV5 OOL3 FC6
[Krizhevsky et al. 2012]

Case Study:
[Krizhevsky et al. 2012]
Input: 227x227x3 images
First layer (CONV1): 96 11×11 filters applied at stride 4 =>
The output volume size: (227-11)/4+1 = 55
Output volume [55x55x96]
Total number of parameters in this layer
Parameters: (11*11*3)*96 = 35K Bias: 96

Case Study:
[Krizhevsky et al. 2012]
Input: 227x227x3 images After CONV1: 55x55x96
Second layer (POOL1): 3×3 filters applied at stride 2 The output volume size: (55-3)/2+1 = 27
Output volume: 27x27x96
The number of parameters in this layer
Parameters: 0!

Case Study:
[Krizhevsky et al. 2012]
Input: 227x227x3 images After CONV1: 55x55x96 After POOL1: 27x27x96 …

Case Study:
[Krizhevsky et al. 2012]
Full (simplified) AlexNet architecture:
[227x227x3] INPUT
[55x55x96] CONV1: 96 11×11 filters at stride 4, pad 0 [27x27x96] MAX POOL1: 3×3 filters at stride 2 [27x27x96] NORM1: Normalization layer [27x27x256] CONV2: 256 5×5 filters at stride 1, pad 2 [13x13x256] MAX POOL2: 3×3 filters at stride 2 [13x13x256] NORM2: Normalization layer [13x13x384] CONV3: 384 3×3 filters at stride 1, pad 1 [13x13x384] CONV4: 384 3×3 filters at stride 1, pad 1 [13x13x256] CONV5: 256 3×3 filters at stride 1, pad 1 [6x6x256] MAX POOL3: 3×3 filters at stride 2
[4096] FC6: 4096 neurons
[4096] FC7: 4096 neurons
[1000] FC8: 1000 neurons (class scores)
Details/Retrospectives:
– first use of ReLU
– used Norm layers (not common anymore) – heavy data augmentation
– dropout 0.5
– batch size 128
– SGD Momentum 0.9
– Learning rate 1e-2, reduced by 10 manually when val accuracy plateaus
– L2 weight decay 5e-4
– 7 CNN ensemble: 18.2% -> 15.4%

ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners
First CNN-based winner
ZFNet: Improved hyperparameters over AlexNet

Case Study:
[Simonyan and Zisserman, 2014]
Small filters, Deeper networks
Only 3×3 CONV stride 1, pad 1 and 2×2 MAX POOL stride 2
8 layers (AlexNet)
-> 16 – 19 layers (VGGNet)
11.7% top 5 error in ILSVRC’13 (ZFNet)
-> 7.3% top 5 error in ILSVRC’14

Case Study:
[Simonyan and Zisserman, 2014]
Q: Why use smaller filters? (3×3 conv)
Stack of three 3×3 conv (stride 1) layers has same effective receptive field as one 7×7 conv layer
But deeper, more non-linearity
And fewer parameters: 3 * (32C2) vs. 72C2 for C channels per layer

(not counting biases)
INPUT: [224x224x3] memory: 224*224*3=150K params: 0
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728 CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864 POOL2: [112x112x64] memory: 112*112*64=800K params: 0
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728 CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456 POOL2: [56x56x128] memory: 56*56*128=400K params: 0
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912 CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 POOL2: [28x28x256] memory: 28*28*256=200K params: 0
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648 CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296 CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296 POOL2: [14x14x512] memory: 14*14*512=100K params: 0
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 POOL2: [7x7x512] memory: 7*7*512=25K params: 0
FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448
FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216
FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000
TOTAL memory: 15.2M * 4 bytes ~= 61MB / image (for a forward pass) TOTAL params: 138M parameters

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts