EBU7240 Computer Vision
Changjae Oh
layer Perceptron (MLP)
Semester 1, 2021
Copyright By PowCoder代写 加微信 powcoder
Neural networks
(Before) Linear score function: (Now) 2-layer Neural Network:
3-layer Neural Network:
𝒇 = 𝐖 max(𝟎, 𝐖 𝒙) 𝟐𝟏
𝒇 = 𝐖 max 𝟎, 𝐖 max 𝟎, 𝐖 𝒙 𝟑𝟐𝟏
Activation functions
• Adding non-linearities into neural networks, allowing the neural networ ks to learn powerful operations.
• A crucial component of deep learning
̶ If the activation functions were to be removed from a feedforward neural network, th e entire network could be re-factored to a simple linear operation or matrix transform ation on its input
̶ It would no longer be capable of performing complex tasks such as image recognition.
Activation functions
𝜎𝑥=1 1+𝑒−𝑥
𝑒𝑥 −𝑒−𝑥 tanh𝑥 =𝑒𝑥+𝑒−𝑥
Leaky ReLU
max 0.1𝑥,𝑥
max 𝑤𝑇𝑥+𝑏 ,𝑤𝑇𝑥+𝑏 1122
ቊ𝑥𝑥≥0 𝛼(𝑒𝑥−1) 𝑥<0
Neural networks: Architectures
“2-layer Neural Net”, or “3-layer Neural Net”, or “1-hidden-layer Neural Net” “2-hidden-layer Neural Net”
“Fully-connected” layers
Derivative of Neural Net using Chain Rules • Example
1-layer Neural Net (L2 regression loss) 2-layer Neural Net (L2 regression loss)
1-layer Neural Net (Softmax classifier) 2-layer Neural Net (Softmax classifier)
𝑠1 𝒘𝑇 𝑤𝑤⋯𝑤 𝑏1 𝑥1 𝑠 1 1112 1𝑑 𝑏 𝑥2
layer Neural Net (L2 regression loss)
Linear score Activation function
𝒑=𝜎𝒔=1 1+𝑒−𝒔
Output layer
𝒔=𝐖𝒙+𝒃 𝑠 =𝒘𝑇𝒙+𝑏 𝑗 𝑗 𝑗
𝐿 = (𝒛 − 𝒑)2
2 𝒘𝑇 𝑤21𝑤22⋯𝑤2𝑑 𝒃=2 𝒙= ⋮𝐖=2=⋮⋮⋮
𝑛 𝒘𝑇 𝑤𝑛1 𝑤𝑛2 ⋯ 𝑤𝑛𝑑 𝑛 𝑛
𝑠1 𝒘𝑇 𝑤𝑤⋯𝑤 𝑏1 𝑥1 𝑠 1 1112 1𝑑 𝑏 𝑥2
layer Neural Net (L2 regression loss)
𝑠 =𝒘T𝒙+𝑏 =𝑤 𝑥 +𝑏
Ground truth
Output layer
(𝑧 −𝑝)2 1 1
Linear score Activation function
𝑠 =𝒘𝑇𝒙+𝑏 𝑗 𝑗 𝑗
𝒑=𝜎𝒔=1 1+𝑒−𝒔
𝐿 = (𝒛 − 𝒑)2
2 𝒘𝑇 𝑤21𝑤22⋯𝑤2𝑑 𝒃=2 𝒙= ⋮𝐖=2=⋮⋮⋮
𝑛 𝒘𝑇 𝑤𝑛1 𝑤𝑛2 ⋯ 𝑤𝑛𝑑 𝑛 𝑛
2. Activation function 𝒑 = 𝜎 𝒔 =
Linear score
layer Neural Net (L2 regression loss)
𝑏1 𝑧1 𝑠1 𝑝1
Ground truth
Output layer
𝑑 1 𝑠 =𝒘T𝒙+𝑏 =𝑤 𝑥 +𝑏 𝑝1=
(𝑧 −𝑝)2 1 1
Ground truth
1𝑘𝑘 1 1+𝑒−𝑠1
𝐿 = (𝒛 − 𝒑)2
𝑠1 𝒘𝑇 𝑤𝑤⋯𝑤 𝑏1 𝑥1 𝑠 1 1112 1𝑑 𝑏 𝑥2
𝑤21 𝑤22 ⋯𝑤2𝑑 ⋮
𝑤𝑛1 𝑤𝑛2 ⋯ 𝑤𝑛𝑑
(𝑧−𝑝)2 𝑛 𝑛
𝑠=𝒘T𝒙+𝑏=𝑤𝑥+𝑏 𝑝= 𝑏𝑥𝑑 𝑛𝑛𝑛𝑛𝑘𝑘𝑛𝑛1+𝑒−𝑠𝑛
𝑠1 𝒘𝑇 𝑤𝑤⋯𝑤 𝑏1 𝑥1 𝑠 1 1112 1𝑑 𝑏 𝑥2
layer Neural Net (L2 regression loss)
In a vector form
Ground truth 𝒛 𝒔𝒑
2 𝒘𝑇 𝑤21𝑤22⋯𝑤2𝑑 𝒃=2 𝒙= ⋮𝐖=2=⋮⋮⋮
Linear score Activation function
Output layer
𝑑×1 𝑛×1 𝑛×1
𝑠 =𝒘𝑇𝒙+𝑏 𝑗 𝑗 𝑗
𝒑=𝜎𝒔=1 1+𝑒−𝒔
𝐿 = (𝒛 − 𝒑)2
𝑛 𝒘𝑇 𝑤𝑛1 𝑤𝑛2 ⋯ 𝑤𝑛𝑑 𝑛 𝑛
We need to compute gradients of 𝐖, 𝒃, 𝒔, 𝒑 with respect to the loss function 𝐿.
𝜕𝐿 = −2(𝒛 − 𝒑) 𝜕𝒑
𝜕𝐿=𝜕𝒑𝜕𝐿=𝑑𝑖𝑎𝑔 (1−𝜎(𝑠))𝜎(𝑠) 𝜕𝐿=−2 𝜕𝒔𝜕𝒔𝜕𝒑 𝑗 𝑗𝜕𝒑
jth column
layer Neural Net (L2 regression loss)
In a vector form Ground truth 𝒛 𝑛 × 1
𝑑×1 𝑛×1 𝑛×1
(1 − 𝜎(𝑠1))𝜎(𝑠1)(𝑧1 − 𝑝1) (1 − 𝜎(𝑠2))𝜎(𝑠2)(𝑧2 − 𝑝2)
=(1−𝜎(𝒔))⊗𝜎 𝒔 ⊗ 𝜕𝐿 ⋮ 𝜕𝒑
𝜕𝐿𝜕𝐿𝜕𝐿 𝜕𝐿T𝜕𝐿T
(1 − 𝜎(𝑠𝑛))𝜎(𝑠𝑛)(𝑧𝑛 − 𝑝𝑛)
⊗: element-wise multiplication
𝜕𝐿 = 𝜕𝒔 𝜕𝐿=𝐗 𝜕𝐿=[𝟎𝟎𝒙 ⋯𝟎]𝜕𝐿= 𝜕𝐿 𝒙
𝒂 𝑗:jth elementatvector𝒂
𝜕𝐿𝜕𝒔𝜕𝐿𝜕𝐿 𝜕𝒃=𝜕𝒃𝜕𝒔=𝜕𝒔
𝜕𝒘 𝜕𝒘 𝜕𝒔 𝑗 𝜕𝒔 𝜕𝒔 𝑗𝑗
𝜕𝐖= 𝜕𝒘 𝜕𝒘 ⋯ 𝜕𝒘 =𝜕𝒔𝒙 12𝑛
layer Neural Net (L2 regression loss)
In a vector form Ground truth 𝒛 𝑛 × 1
𝜕𝐿 = −2(𝒛 − 𝒑) 𝜕𝒑
𝜕𝐿=𝜕𝒑𝜕𝐿=(1−𝜎(𝒔))⊗𝜎 𝒔 ⊗ 𝜕𝐿 𝜕𝒔 𝜕𝒔𝜕𝒑 𝜕𝒑
𝜕𝐿 =𝜕𝐿𝒙T 𝜕𝐖 𝜕𝒔
𝜕𝐿 = 𝜕𝐿 𝜕𝒃 𝜕𝒔
𝑑×1 𝑛×1 𝑛×1
Note that the following derivative can also be computed, but here 𝒙 is an input data that is fixed during training. Thus, it is not necessary to compute its derivative.
𝜕𝐿 = 𝜕𝒔 𝜕𝐿 = 𝐖T 𝜕𝐿 𝜕𝒙 𝜕𝒙𝜕𝒔 𝜕𝒔
layer Neural Net (L2 regression loss)
Groundtruth 𝒛 𝑚×1 𝒔1 𝒑1 𝐖 𝒔2 𝒑𝟐
In a vector form
𝜕𝐿 =−2(𝒛−𝒑𝟐) 𝜕𝒑𝟐
𝜕𝐿 = 𝜕𝒑2 𝜕𝐿 = 𝑑𝑖𝑎𝑔 (1 − 𝜎(𝑠2,𝑗))𝜎(𝑠2,𝑗) 𝜕𝐿
𝜕𝒔2 𝜕𝒔2 𝜕𝒑2
𝜕𝐿 = 𝜕𝒑1 𝜕𝐿 = 𝑑𝑖𝑎𝑔 (1 − 𝜎(𝑠1,𝑗))𝜎(𝑠1,𝑗) 𝜕𝐿
𝜕𝐿 = 𝜕𝐿 𝒑T
𝜕𝐿 =𝜕𝒔2 𝜕𝐿 =𝐖T 𝜕𝐿 𝜕𝒑1 𝜕𝒑1 𝜕𝒔2 2 𝜕𝒔2
𝜕𝒔1 𝜕𝒔1 𝜕𝒑1
𝜕𝐿 = 𝜕𝐿 𝒙T
𝜕𝒃 𝜕𝒔 11 11
layer Neural Net (
classifier)
Groundtruth 𝒛 𝑛×1 −1/𝑝𝑦 𝒙𝐖𝒔𝒑
0 yth row 0
Inavectorform
𝑑×1 𝑛×1 𝑛×1
𝜕𝐿𝜕𝒑𝜕𝐿 𝜕𝐿 1𝐷 = = 𝐃 = − 2𝑦
𝜕𝒔 𝜕𝒔𝜕𝒑 𝜕𝒑 𝑝𝑦 ⋮ 𝑛𝑦
jth column
= 𝒑 − 𝒛 𝐷𝑎𝑏 = 𝑝𝑎(𝛿𝑎𝑏 − 𝑝𝑏) 𝐷 1𝑎=𝑏
𝜕𝐿 = 𝜕𝒔 𝜕𝐿 = 𝐗 𝜕𝐿 =
𝜕𝐿 = 𝜕𝐿 𝜕𝒔 𝜕𝒔
𝟎 𝟎 𝒙 ⋯ 𝟎 𝜕𝐿𝜕𝐿𝜕𝐿 𝜕𝐿T𝜕𝐿T
𝒂 : jth element at vector 𝒂 𝑗
𝜕𝐿 = 𝜕𝒔 𝜕𝐿 = 𝜕𝐿 𝜕𝒃 𝜕𝒃 𝜕𝒔 𝜕𝒔
𝜕𝒘 𝜕𝒘𝜕𝒔 𝑗𝜕𝒔 𝑗𝑗
𝜕𝐖= 𝜕𝒘 𝜕𝒘 ⋯ 𝜕𝒘 =𝜕𝒔𝒙 12𝑛
Log likelihood
0 yth row 0
layer Neural Net (
In a vector form
classifier)
Ground truth 𝒛 𝒔𝒑
𝑑×1 𝑛×1 𝑛×1
𝜕𝐿𝜕𝒑𝜕𝐿 𝜕𝐿 1𝐷
= =𝐃=− 2𝑦=𝒑−𝒛
𝜕𝒔 𝜕𝒔𝜕𝒑 𝜕𝒑 𝑝𝑦 ⋮
𝜕𝐿 =𝜕𝐿𝒙T 𝜕𝐿=𝜕𝐿 𝜕𝐖 𝜕𝒔 𝜕𝒃 𝜕𝒔
Log likelihood
Note that the following derivative can also be computed, but here 𝒙 is an input data that is fixed during training. Thus, it is not necessary to compute its derivative.
𝜕𝐿 = 𝜕𝒔 𝜕𝐿 = 𝐖T 𝜕𝐿 𝜕𝒙 𝜕𝒙𝜕𝒔 𝜕𝒔
layer Neural Net (
classifier)
Groundtruth 𝒛 𝑚×1 𝒑1 𝐖 𝒔2 𝒑𝟐
𝒃1 Inavectorform 𝑛×1
𝒃2 𝑚×1 yth row
Log likelihood
𝜕𝐿 = 𝜕𝒑1 𝜕𝐿 = 𝑑𝑖𝑎𝑔 (1 − 𝜎(𝑠1,𝑗))𝜎(𝑠1,𝑗) 𝜕𝐿
𝜕𝐿 = 𝜕𝒑2 𝜕𝐿 = 𝐃 𝜕𝐿 𝜕𝒔2 𝜕𝒔2𝜕𝒑2 𝜕𝒑2
𝜕𝐖 𝜕𝒔 1 𝟐2
𝜕𝐿 = 𝜕𝒔2 𝜕𝐿 = 𝐖T 𝜕𝐿 𝜕𝒑1 𝜕𝒑1𝜕𝒔2 2 𝜕𝒔2
𝐷𝑎𝑏 = 𝑝𝑎(𝛿𝑎𝑏 − 𝑝𝑏)
𝛿𝑎𝑏 =ቊ1 𝑎=𝑏
0 otherwise
𝜕𝒔1 𝜕𝒔1 𝜕𝒑1
𝜕𝐿 = 𝜕𝐿 𝜕𝒃1 𝜕𝒔1
𝜕𝐿 = 𝜕𝐿 𝜕𝒃2 𝜕𝒔2
Full implementation of training a 2
N: batch size
D_in: input feature size
H: input feature size of the second layer D_out: output feature size
layer Neural Network
Ground truth 𝒚 𝒔
Neural networks: Pros and cons
̶ Flexible and general function approximation framework
̶ Can build extremely powerful models by adding more layers
Hard to analyze theoretically (e.g., training is prone to local optima)
Huge amount of training data, computing power may be required to get good perfor mance
The space of implementation choices is huge (network architectures, parameters)
• We arrange neurons into fully-connected layers
• The layer allows us to use efficient vectorized code (e.g. matrix multiplic ation)
̶ Using back-propagation
EBU7240 Computer Vision
Neural Networks
Convolutional
Semester 1, 2021
Changjae Oh
CNN Introduction
• Image Recognition
̶ Recognizing the object class in the image
– is a sequence of layers
– Every layer of a ConvNet transforms one volume of activations to another through a differentiable function.
– Convolutional Layer: computes the output of neurons that are connected to local regions in the input
– ReLU (nonlinear) layer: activates relevant responses
– Pooling Layer: performs a downsampling operation along the spatial dimensions
– Fully-Connected Layer: each neuron in this layer will be connected to all the numbers in the previous volume
Typical architectures of
[(Conv-ReLU)*N - POOL]*M - (FC-RELU)*K - Net
(Conv-ReLU)*N - POOL
(Conv-ReLU)*N - POOL
N is usually up to ~5, M is large, 0 <= K <= 2
but some advances such as ResNet/GoogLeNet challenge this paradigm
Typical architectures of
Convolutional Layer
Convolutional Layer
To preserve spatial structure, use an original 2D image
32x32x3 image
5x5x3 filter
Convolve the filter with the image i.e. “slide over the image spatially, computing dot products”
Convolutional Layer
To preserve spatial structure, use an original 2D image
32x32x3 image
Filters always extend the full depth of the input volume
5x5x3 filter
Convolve the filter with the image i.e. “slide over the image spatially, computing dot products”
Convolutional Layer
32x32x3 image
5x5x3 filter
the result of taking a dot product between the filter and a small 5x5x3 chunk of the image
(i.e. 5*5*3 = 75-dimensional dot product + bias)
Convolutional Layer
32x32x3 image
5x5x3 filter
Activation map
convolve (slide) over all spatial locations
Convolutional Layer
32x32x3 image
5x5x3 filter
convolve (slide) over all spatial locations
consider a second 5x5x3 (orange) filter
Convolutional Layer
If we had six 5x5x3 filters, we’ll get six separate activation maps:
We call the layer convolutional because it is related to convolution of two signals:
elementwise multiplication and sum of a filter and the signal (image)
Convolutional Layer
• The number of parameters in convolutional layer
Input Weight Output
𝐻 ×𝑊 ×𝐶 𝐶 filtersof𝐹 ×𝐹 ×𝐶 𝐻 ×𝑊 ×𝐶
1112h𝑣1112
The number of weights: 𝐶 × (𝐹 × 𝐹 × 𝐶 ) 2h𝑣1
The number of bias: 𝐶 2
• A sequence of Convolutional Layers, interspersed with activation functions 32 28 24
(6 filters of 5x5x3) ReLU
(10 filters of 5x5x6) ReLU
..... Conv
Receptive field
The region of the input space that affects a particular unit of the network
From the convolution property,
𝑦 = 𝑥 ∗ h1 𝑧 = 𝑦 ∗ h2
→ 𝑧 = 𝑥 ∗ h1 ∗ h2 = 𝑥 ∗ h
5×5×1filter 5×5×1filter Effective receptive field at : 9 × 9 (= 5 + 5 − 1)
h = h1 ∗ h2
Convolutional Filter Size
Three 3x3 Conv layers vs. Single 7x7 Conv layer 3x3xc 3x3xc 3x3xc 7x7xc
Assume that zero-padding is applied to preserve a spatial resolution
• Receptive fields are equal. For three 3x3 Conv layers, 3+3-1+3-1 = 7.
• #ofConvparameters:3×𝐶× 3×3×𝐶 =27𝐶2 vs. 𝐶× 7×7×𝐶 =49𝐶2 • The three stacks of CONV layers produce more expressive activation maps
Spatial Dimension at Convolution Layer
7x7 input (spatially)
assume 3x3 filter applied with stride 1 -> 5×5 output
assume 3×3 filter applied with stride 2 -> 3×3 output
assume 3×3 filter applied with stride 3
-> doesn’t fit!
cannot apply 3×3 filter on 7×7 input with stride 3.
*Stride: is the number of pixels shifts over the input matrix
Spatial Dimension at Convolution Layer
Output size:
(N – F) / stride + 1
e.g. N = 7, F = 3:
stride 1 => (7 – 3)/1 + 1 = 5 stride 2 => (7 – 3)/2 + 1 = 3 stride 3 => (7 – 3)/3 + 1 = 2.33
Spatial Dimension at Convolution Layer
In practice: Common to zero pad the border
e.g. input 7×7
3×3 filter, applied with stride 1 pad with 1 pixel border
-> 7×7 output!
in general, common to see CONV layers with stride 1, filters of size FxF, and zero-padding with (F-1)/2. (will preserve size spatially)
e.g. F = 3 => zero pad with 1
F = 5 => zero pad with 2 F = 7 => zero pad with 3
Spatial Dimension at Convolution Layer
Input volume: 32x32x3
10 5x5x3 filters with stride 1, pad 2
Output volume size: (32+2*2-5)/1+1 = 32 spatially, so 32x32x10
Spatial Dimension at Convolution Layer
Input volume: 32x32x3
10 5x5x3 filters with stride 1, pad 2
Number of parameters in this layer?
each filter has 5*5*3 + 1 = 76 params (+1 for bias) -> 76*10 = 760
Spatial Dimension at Convolution Layer
1×1 Convolution
1×1 convolution layers make perfect sense
56 1x1CONV with 32 filters
(each filter has size 1x1x64, and performs a 64-dimensional dot product)
Convolutional Layer
Implementation and Backpropagation
Implementation as Matrix Multiplication
• Convolution: dot products between the filters and local regions of the input
• Conv layer: the forward pass of a convolutional layer as one big matrix multiply
• Example of feed-forward process
1. Convert the input into X_col by taking a block of 11x11x3 (=363) pixels in the input for 55×55 (=3025) times
X_col: [363×3025]
2. Reshape the conv filter into W_row: [96×363]
Reshape the conv bias (96×1 vector) into b_col: [96×3025] by stacking it for 3025 times
3. Perform matrix multiplication O = W_row * X_col + b_col 4. Reshape O: [96×3025] into [55x55x96]
Input: [227x227x3]
Conv filter: 96 filters of [11x11x3] Conv bias: 96×1 vector
Padding: 0
Output: (227-11)/4+1 = 55 -> [55x55x96]
O = W_row * X_col + b_col
96×3025 96×363 363×3025 96×3025
Backpropagation of Convolution Layer
Input: [227x227x3]
Conv filter: 96 filters of [11x11x3] Stride: 4
Padding: 0
Output: (227-11)/4+1 = 55 -> [55x55x96]
Convolution layer shares 𝐖 for all neurons of current activation map. For each neuron,
𝜕𝐿 T 𝜕𝐿 𝜕𝐿 = 𝜕𝐿 𝒙T 𝜕𝐿 = 𝜕𝐿
𝜕𝒙 =𝐖 𝜕𝒔 𝜕𝐖 𝜕𝒔𝑝 𝑝 𝜕𝒃 𝑝𝑝𝑝𝑝
𝒙𝑝: (11x11x3)x1 = 363×1
𝑝 = 1, … , 3025(= 55 × 55)
Backpropagation of 𝑠 = W𝑥 + 𝑏 𝒙𝒔
𝜕𝐿 = 𝜕𝒔 𝜕𝐿 = 𝐖T 𝜕𝐿 𝜕𝒙 𝜕𝒙𝜕𝒔 𝜕𝒔
𝜕𝐿 =𝜕𝐿𝒙T 𝜕𝐖 𝜕𝒔
𝜕𝐿=𝜕𝐿 𝜕𝒃 𝜕𝒔
Backpropagation of Convolution Layer
Input: [227x227x3]
Conv filter: 96 filters of [11x11x3] Stride: 4
Padding: 0
Output: (227-11)/4+1 = 55 -> [55x55x96]
Convolution layer shares 𝐖 for all neurons of current activation map. For all neurons,
𝜕𝐿 = 𝐖T 𝜕𝐿 𝜕𝒙 𝜕𝒔
𝜕𝐿 =𝜕𝐿𝒙T 𝜕𝐖 𝜕𝒔
𝒙: (11x11x3)x3025 = 363×3025
𝜕𝐿=𝜕𝐿𝟏T 𝜕𝒃 𝜕𝒔
1×3025 vector 𝒔: 96×3025
Backpropagation of 𝑠 = W𝑥 + 𝑏 𝒙𝒔
𝜕𝐿 = 𝜕𝒔 𝜕𝐿 = 𝐖T 𝜕𝐿 𝜕𝒙 𝜕𝒙𝜕𝒔 𝜕𝒔
𝜕𝐿 =𝜕𝐿𝒙T 𝜕𝐖 𝜕𝒔
𝜕𝐿=𝜕𝐿 𝜕𝒃 𝜕𝒔
Backpropagation of Convolution Layer
𝜕𝐿 = 𝐖T 𝜕𝐿 𝜕𝒙 𝜕𝒔
1. Perform 𝜕𝐿 = 𝐖T 𝜕𝐿 𝜕𝒙 𝜕𝒔
2. Reshape 𝜕𝐿 (363×3025) into 3025 gradients of 11x11x3 𝜕𝒙
3. Overlay the reshaped gradient into 3D matrix [227x227x3] in which overlapped gradients are accumulated.
𝒙: (11x11x3)x3025 = 363×3025
𝒔: 96×3025 𝐖: 96×363
𝜕𝐿 =𝜕𝐿𝒙T 𝜕𝐖 𝜕𝒔
𝜕𝐿=𝜕𝐿𝟏T 𝜕𝒃 𝜕𝒔
1. Convert the input into 𝒙 by taking a block of 11x11x3 (=363) pixels in the input for 55×55 (=3025) times. 𝒙: [363×3025]
2.Perform 𝜕𝐿 =𝜕𝐿𝒙T and𝜕𝐿 =𝜕𝐿𝟏T 𝜕𝐖 𝜕𝒔 𝜕𝒃 𝜕𝒔
Fully Connected Layer
Fully Connected Layer
32x32x3 image -> stretch to 3072 x 1
Each neuron looks at the full input volume
activation
10 x 3072 weight matrix
the result of taking a dot product between a row of W and the input (a 3072-dimensional dot product)
Pooling Layer
Pooling layer
• makes the representations smaller and more manageable
• operates over each activation map independently
Pooling layer
– Max pooling
– Average pooling (rarely used) – L2 norm pooling (rarely used)
MAX POOLING
Single depth slice
max pool with 2×2 filters and stride 2
EBU7240 Computer Vision
CNN Architectures
Semester 1, 2021
Changjae Oh
CNN Architectures
• Case Studies ̶ AlexNet
̶ GoogLeNet ̶ ResNet
Case Study:
Architecture:
CONV3 CONV4 CONV5 OOL3 FC6
[Krizhevsky et al. 2012]
Case Study:
[Krizhevsky et al. 2012]
Input: 227x227x3 images
First layer (CONV1): 96 11×11 filters applied at stride 4 =>
The output volume size: (227-11)/4+1 = 55
Output volume [55x55x96]
Total number of parameters in this layer
Parameters: (11*11*3)*96 = 35K Bias: 96
Case Study:
[Krizhevsky et al. 2012]
Input: 227x227x3 images After CONV1: 55x55x96
Second layer (POOL1): 3×3 filters applied at stride 2 The output volume size: (55-3)/2+1 = 27
Output volume: 27x27x96
The number of parameters in this layer
Parameters: 0!
Case Study:
[Krizhevsky et al. 2012]
Input: 227x227x3 images After CONV1: 55x55x96 After POOL1: 27x27x96 …
Case Study:
[Krizhevsky et al. 2012]
Full (simplified) AlexNet architecture:
[227x227x3] INPUT
[55x55x96] CONV1: 96 11×11 filters at stride 4, pad 0 [27x27x96] MAX POOL1: 3×3 filters at stride 2 [27x27x96] NORM1: Normalization layer [27x27x256] CONV2: 256 5×5 filters at stride 1, pad 2 [13x13x256] MAX POOL2: 3×3 filters at stride 2 [13x13x256] NORM2: Normalization layer [13x13x384] CONV3: 384 3×3 filters at stride 1, pad 1 [13x13x384] CONV4: 384 3×3 filters at stride 1, pad 1 [13x13x256] CONV5: 256 3×3 filters at stride 1, pad 1 [6x6x256] MAX POOL3: 3×3 filters at stride 2
[4096] FC6: 4096 neurons
[4096] FC7: 4096 neurons
[1000] FC8: 1000 neurons (class scores)
Details/Retrospectives:
– first use of ReLU
– used Norm layers (not common anymore) – heavy data augmentation
– dropout 0.5
– batch size 128
– SGD Momentum 0.9
– Learning rate 1e-2, reduced by 10 manually when val accuracy plateaus
– L2 weight decay 5e-4
– 7 CNN ensemble: 18.2% -> 15.4%
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners
First CNN-based winner
ZFNet: Improved hyperparameters over AlexNet
Case Study:
[Simonyan and Zisserman, 2014]
Small filters, Deeper networks
Only 3×3 CONV stride 1, pad 1 and 2×2 MAX POOL stride 2
8 layers (AlexNet)
-> 16 – 19 layers (VGGNet)
11.7% top 5 error in ILSVRC’13 (ZFNet)
-> 7.3% top 5 error in ILSVRC’14
Case Study:
[Simonyan and Zisserman, 2014]
Q: Why use smaller filters? (3×3 conv)
Stack of three 3×3 conv (stride 1) layers has same effective receptive field as one 7×7 conv layer
But deeper, more non-linearity
And fewer parameters: 3 * (32C2) vs. 72C2 for C channels per layer
(not counting biases)
INPUT: [224x224x3] memory: 224*224*3=150K params: 0
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728 CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864 POOL2: [112x112x64] memory: 112*112*64=800K params: 0
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728 CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456 POOL2: [56x56x128] memory: 56*56*128=400K params: 0
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912 CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 POOL2: [28x28x256] memory: 28*28*256=200K params: 0
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648 CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296 CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296 POOL2: [14x14x512] memory: 14*14*512=100K params: 0
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 POOL2: [7x7x512] memory: 7*7*512=25K params: 0
FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448
FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216
FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000
TOTAL memory: 15.2M * 4 bytes ~= 61MB / image (for a forward pass) TOTAL params: 138M parameters
(not counting biases)
INPUT: [224x224x3] memory: 224*224*3=150K params: 0
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728 CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864 POOL2: [112x112x64] memory: 112*112*64=800K params: 0
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728 CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456 POOL2: [56x56x128] memory: 56*56*128=400K params: 0
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912 CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 POOL2: [28x28x256] memory: 28*28*256=200K params: 0
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648 CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296 CONV3-512: [28x28x512] memory: 28*28*512
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com