CS作业代写 VGG16 VGG19

PowerPoint 프레젠테이션

Changjae Oh

Copyright By PowCoder代写 加微信 powcoder

Computer Vision
– Multi-layer Perceptron (MLP)-

Semester 1, 22/23

Neural networks

(Before) Linear score function: 𝒇 = 𝐖𝒙

(Now) 2-layer Neural Network: 𝒇 = 𝐖𝟐max(𝟎,𝐖𝟏𝒙)

3-layer Neural Network: 𝒇 = 𝐖𝟑max 𝟎,𝐖𝟐max 𝟎,𝐖𝟏𝒙

Activation functions

• Adding non-linearities into neural networks, allowing the neural networ
ks to learn powerful operations.

• A crucial component of deep learning

̶ If the activation functions were to be removed from a feedforward neural network, th
e entire network could be re-factored to a simple linear operation or matrix transform
ation on its input

̶ It would no longer be capable of performing complex tasks such as image recognition.

Activation functions

Leaky ReLU

max 0.1𝑥, 𝑥

𝑇𝑥 + 𝑏1, 𝑤2

𝛼(𝑒𝑥 − 1) 𝑥 < 0 Neural networks: Architectures “2-layer Neural Net”, or “1-hidden-layer Neural Net” “3-layer Neural Net”, or “2-hidden-layer Neural Net” “Fully-connected” layers Derivative of Neural Net using Chain Rules ̶ 1-layer Neural Net (L2 regression loss) ̶ 2-layer Neural Net (L2 regression loss) ̶ 1-layer Neural Net (Softmax classifier) ̶ 2-layer Neural Net (Softmax classifier) 1. 1-layer Neural Net (L2 regression loss) Output layer 1. Linear score 2. Activation function 𝐿 = (𝒛 − 𝒑)2 𝑤11 𝑤12 ⋯ 𝑤1𝑑 𝑤21 𝑤22 ⋯ 𝑤2𝑑 𝑤𝑛1 𝑤𝑛2 ⋯ 𝑤𝑛𝑑 1. 1-layer Neural Net (L2 regression loss) Output layer 1. Linear score 2. Activation function 𝐿 = (𝒛 − 𝒑)2 𝑧1 Ground truth T𝒙 + 𝑏1 = ෍ 𝑤1𝑘𝑥𝑘 + 𝑏1 (𝑧1 − 𝑝1) 𝑤11 𝑤12 ⋯ 𝑤1𝑑 𝑤21 𝑤22 ⋯ 𝑤2𝑑 𝑤𝑛1 𝑤𝑛2 ⋯ 𝑤𝑛𝑑 1. 1-layer Neural Net (L2 regression loss) Ground truth Output layer 1. Linear score 2. Activation function 𝐿 = (𝒛 − 𝒑)2 T𝒙 + 𝑏𝑛 = ෍ 𝑤𝑛𝑘𝑥𝑘 + 𝑏𝑛 (𝑧𝑛 − 𝑝𝑛) 𝑧1 Ground truth T𝒙 + 𝑏1 = ෍ 𝑤1𝑘𝑥𝑘 + 𝑏1 (𝑧1 − 𝑝1) 𝑤11 𝑤12 ⋯ 𝑤1𝑑 𝑤21 𝑤22 ⋯ 𝑤2𝑑 𝑤𝑛1 𝑤𝑛2 ⋯ 𝑤𝑛𝑑 1. 1-layer Neural Net (L2 regression loss) Output layer 1. Linear score 2. Activation function 𝐿 = (𝒛 − 𝒑)2 We need to compute gradients of 𝐖,𝒃, 𝒔, 𝒑 with respect to the loss function 𝐿. 𝐖𝒙 sigmoid 𝒛In a vector form 𝑛 × 1 𝑛 × 1 𝑛 × 1Ground truth 𝑤11 𝑤12 ⋯ 𝑤1𝑑 𝑤21 𝑤22 ⋯ 𝑤2𝑑 𝑤𝑛1 𝑤𝑛2 ⋯ 𝑤𝑛𝑑 1. 1-layer Neural Net (L2 regression loss) = 𝑑𝑖𝑎𝑔 (1 − 𝜎(𝑠𝑗))𝜎(𝑠𝑗) (1 − 𝜎(𝑠1))𝜎(𝑠1)(𝑧1 − 𝑝1) (1 − 𝜎(𝑠2))𝜎(𝑠2)(𝑧2 − 𝑝2) (1 − 𝜎(𝑠𝑛))𝜎(𝑠𝑛)(𝑧𝑛 − 𝑝𝑛) = −2(𝒛 − 𝒑) = [𝟎 𝟎 𝒙 ⋯𝟎] jth column = (1 − 𝜎(𝒔)) ⊗ 𝜎 𝒔 ⊗ ⊗: element-wise multiplication th element at vector 𝒂 𝐖𝒙 sigmoid 𝒛In a vector form 𝑛 × 1 𝑛 × 1 𝑛 × 1Ground truth 1. 1-layer Neural Net (L2 regression loss) = −2(𝒛 − 𝒑) = (1 − 𝜎(𝒔)) ⊗ 𝜎 𝒔 ⊗ Note that the following derivative can also be computed, but here 𝒙 is an input data that is fixed during training. Thus, it is not necessary to compute its derivative.𝜕𝐿 𝐖𝒙 sigmoid 𝒛In a vector form 𝑛 × 1 𝑛 × 1 𝑛 × 1Ground truth 2. 2-layer Neural Net (L2 regression loss) 𝐖1𝒙 sigmoid 𝐖2 sigmoid = 𝑑𝑖𝑎𝑔 (1 − 𝜎(𝑠2,𝑗))𝜎(𝑠2,𝑗) = −2(𝒛 − 𝒑𝟐) = 𝑑𝑖𝑎𝑔 (1 − 𝜎(𝑠1,𝑗))𝜎(𝑠1,𝑗) 𝑛 × 1 𝑛 × 1 𝑚 × 1Ground truthIn a vector form 3. 1-layer Neural Net (Softmax classifier) = 𝟎 𝟎 𝒙 ⋯𝟎 𝐷𝑎𝑏 = 𝑝𝑎(𝛿𝑎𝑏 − 𝑝𝑏) 0 otherwise th element at vector 𝒂 jth column 𝐖𝒙 softmax 𝒛In a vector form 𝑛 × 1 𝑛 × 1 𝑛 × 1Ground truth likelihood 3. 1-layer Neural Net (Softmax classifier) Note that the following derivative can also be computed, but here 𝒙 is an input data that is fixed during training. Thus, it is not necessary to compute its derivative. 𝐖𝒙 softmax 𝒛In a vector form 𝑛 × 1 𝑛 × 1 𝑛 × 1Ground truth likelihood 4. 2-layer Neural Net (Softmax classifier) = 𝑑𝑖𝑎𝑔 (1 − 𝜎(𝑠1,𝑗))𝜎(𝑠1,𝑗) 𝐷𝑎𝑏 = 𝑝𝑎(𝛿𝑎𝑏 − 𝑝𝑏) 0 otherwise 𝐖1𝒙 sigmoid 𝐖2 softmax 𝑛 × 1 𝑛 × 1 𝑚 × 1Ground truth In a vector form𝒃1 likelihood Full implementation of training a 2-layer Neural Network N: batch size D_in: input feature size H: input feature size of the second layer D_out: output feature size 𝐖1𝒙 sigmoid Ground truth Neural networks: Pros and cons ̶ Flexible and general function approximation framework ̶ Can build extremely powerful models by adding more layers ̶ Hard to analyze theoretically (e.g., training is prone to local optima) ̶ Huge amount of training data, computing power may be required to get good perfor ̶ The space of implementation choices is huge (network architectures, parameters) • We arrange neurons into fully-connected layers • The layer allows us to use efficient vectorized code (e.g. matrix multiplic ̶ Using back-propagation Changjae Oh Computer Vision - Convolutional Neural Networks - Semester 1, 22/23 CNN Introduction • Image Recognition ̶ Recognizing the object class in the image CNN (= ConvNet) – is a sequence of layers – Every layer of a ConvNet transforms one volume of activations to another through a differentiable function. – Convolutional Layer: computes the output of neurons that are connected to local regions in the input – ReLU (nonlinear) layer: activates relevant responses – Pooling Layer: performs a downsampling operation along the spatial dimensions – Fully-Connected Layer: each neuron in this layer will be connected to all the numbers in the previous volume Typical architectures of ConvNet [(Conv-ReLU)*N - POOL]*M - (FC-RELU)*K - 1 ReLU ⋯ Pooling (Conv-ReLU)*N - POOL (Conv-ReLU)*N - POOL⋯ Conv N ReLU N is usually up to ~5, M is large, 0 <= K <= 2 but some advances such as ResNet/GoogLeNet challenge this paradigm Typical architectures of ConvNet Convolutional Layer Convolutional Layer 32x32x3 image 5x5x3 filter Convolve the filter with the image i.e. “slide over the image spatially, computing dot products” To preserve spatial structure, use an original 2D image Convolutional Layer 32x32x3 image 5x5x3 filter Filters always extend the full depth of the input volume To preserve spatial structure, use an original 2D image Convolve the filter with the image i.e. “slide over the image spatially, computing dot products” Convolutional Layer the result of taking a dot product between the filter and a small 5x5x3 chunk of the image (i.e. 5*5*3 = 75-dimensional dot product + bias) 32x32x3 image 5x5x3 filter Convolutional Layer 32x32x3 image 5x5x3 filter convolve (slide) over all spatial locations Activation map Convolutional Layer 32x32x3 image 5x5x3 filter convolve (slide) over all spatial locations consider a second 5x5x3 (orange) filter Convolutional Layer If we had six 5x5x3 filters, we’ll get six separate activation maps: We call the layer convolutional because it is related to convolution of two signals: elementwise multiplication and sum of a filter and the signal (image) Convolutional Layer • The number of parameters in convolutional layer 𝐻1 ×𝑊1 × 𝐶1 𝐶2 filters of 𝐹ℎ × 𝐹𝑣 × 𝐶1 𝐻1 ×𝑊1 × 𝐶2 The number of weights: 𝐶2 × (𝐹ℎ × 𝐹𝑣 × 𝐶1) The number of bias: 𝐶2 • A sequence of Convolutional Layers, interspersed with activation functions (6 filters (10 filters • Receptive field ̶ The region of the input space that affects a particular unit of the network 5 × 5 × 1 filter 5 × 5 × 1 filter Effective receptive field at : 9 × 9 (= 5 + 5 − 1) → 𝑧 = 𝑥 ∗ ℎ1 ∗ ℎ2 = 𝑥 ∗ ℎ ℎ = ℎ1 ∗ ℎ2 𝑦 = 𝑥 ∗ ℎ1 𝑧 = 𝑦 ∗ ℎ2 From the convolution property, Convolutional Filter Size • Receptive fields are equal. For three 3x3 Conv layers, 3+3-1+3-1 = 7. • # of Conv parameters: 3 × 𝐶 × 3 × 3 × 𝐶 = 27𝐶2 vs. 𝐶 × 7 × 7 × 𝐶 = 49𝐶2 • The three stacks of CONV layers produce more expressive activation maps 3x3xc 3x3xc 3x3xc Three 3x3 Conv layers vs. Single 7x7 Conv layer Assume that zero-padding is applied to preserve a spatial resolution Spatial Dimension at Convolution Layer 7x7 input (spatially) assume 3x3 filter applied with stride 1 -> 5×5 output

assume 3×3 filter applied with stride 2
-> 3×3 output

assume 3×3 filter applied with stride 3
-> doesn’t fit!
cannot apply 3×3 filter on 7×7 input with stride 3.

*Stride: is the number of pixels shifts over the input matrix

Spatial Dimension at Convolution Layer

Output size:
(N – F) / stride + 1
e.g. N = 7, F = 3:
stride 1 => (7 – 3)/1 + 1 = 5
stride 2 => (7 – 3)/2 + 1 = 3
stride 3 => (7 – 3)/3 + 1 = 2.33

Spatial Dimension at Convolution Layer

In practice: Common to zero pad the border

e.g. input 7×7
3×3 filter, applied with stride 1
pad with 1 pixel border
-> 7×7 output!

in general, common to see CONV layers with
stride 1, filters of size FxF, and zero-padding with
(F-1)/2. (will preserve size spatially)
e.g. F = 3 => zero pad with 1

F = 5 => zero pad with 2
F = 7 => zero pad with 3

Spatial Dimension at Convolution Layer

Input volume: 32x32x3
10 5x5x3 filters with stride 1, pad 2

Output volume size:
(32+2*2-5)/1+1 = 32 spatially, so

Spatial Dimension at Convolution Layer

Input volume: 32x32x3
10 5x5x3 filters with stride 1, pad 2

Number of parameters in this layer?
each filter has 5*5*3 + 1 = 76 params (+1 for bias)
-> 76*10 = 760

Spatial Dimension at Convolution Layer

1×1 Convolution

1×1 convolution layers make perfect sense

with 32 filters

(each filter has size
1x1x64, and performs a
64-dimensional dot

Convolutional Layer
Implementation and Backpropagation

Implementation as Matrix Multiplication

• Convolution: dot products between the filters and local regions of the input

• Conv layer: the forward pass of a convolutional layer as one big matrix multiply

• Example of feed-forward process

1. Convert the input into X_col by taking a block of 11x11x3 (=363) pixels in
the input for 55×55 (=3025) times

X_col: [363×3025]

2. Reshape the conv filter into W_row: [96×363]
Reshape the conv bias (96×1 vector) into b_col: [96×3025] by stacking it for 3025 times

3. Perform matrix multiplication O = W_row * X_col + b_col

4. Reshape O: [96×3025] into [55x55x96]

Input: [227x227x3]
Conv filter: 96 filters of [11x11x3]
Conv bias: 96×1 vector
Padding: 0

Output: (227-11)/4+1 = 55
-> [55x55x96]

O = W_row * X_col + b_col

96 × 363 363 × 302596 × 3025 96 × 3025

Backpropagation of Convolution Layer

Convolution layer shares 𝐖 for all neurons of current activation map.
For each neuron,

𝒙𝑝: (11x11x3)x1 =

Input: [227x227x3]
Conv filter: 96 filters of [11x11x3]
Padding: 0

Output: (227-11)/4+1 = 55
-> [55x55x96]

Backpropagation of 𝑠 = W𝑥 + 𝑏

𝑝 = 1,… , 3025(= 55 × 55)

Backpropagation of Convolution Layer

Convolution layer shares 𝐖 for all neurons of current activation map.
For all neurons,

𝒙: (11x11x3)x3025
= 363×3025

𝒔: 96×3025

Input: [227x227x3]
Conv filter: 96 filters of [11x11x3]
Padding: 0

Output: (227-11)/4+1 = 55
-> [55x55x96]

Backpropagation of 𝑠 = W𝑥 + 𝑏

1×3025 vector

Backpropagation of Convolution Layer

𝒙: (11x11x3)x3025
= 363×3025

𝒔: 96×3025

1. Perform

2. Reshape

(363×3025) into 3025 gradients of 11x11x3

3. Overlay the reshaped gradient into 3D matrix [227x227x3] in
which overlapped gradients are accumulated.

1. Convert the input into 𝒙 by taking a block of 11x11x3 (=363)
pixels in the input for 55×55 (=3025) times. 𝒙: [363×3025]

2. Perform

Fully Connected Layer

Fully Connected Layer

32x32x3 image -> stretch to 3072 x 1

activation

weight matrix

the result of taking a dot product
between a row of W and the input
(a 3072-dimensional dot product)

Each neuron looks at
the full input volume

Pooling Layer

Pooling layer

• makes the representations smaller and more manageable

• operates over each activation map independently

Pooling layer
– Max pooling
– Average pooling (rarely used)
– L2 norm pooling (rarely used)

MAX POOLING

Single depth slice

max pool with 2×2 filters
and stride 2

Changjae Oh

Computer Vision
– CNN Architectures –

Semester 1, 22/23

CNN Architectures

• Case Studies

̶ GoogLeNet

Case Study: AlexNet

Architecture:

[Krizhevsky et al. 2012]

Case Study: AlexNet
[Krizhevsky et al. 2012]

Input: 227x227x3 images
First layer (CONV1): 96 11×11 filters applied at stride 4
The output volume size: (227-11)/4+1 = 55
Output volume [55x55x96]

Total number of parameters in this layer
Parameters: (11*11*3)*96 = 35K

Case Study: AlexNet
[Krizhevsky et al. 2012]

Input: 227x227x3 images
After CONV1: 55x55x96

Second layer (POOL1): 3×3 filters applied at stride 2
The output volume size: (55-3)/2+1 = 27
Output volume: 27x27x96

The number of parameters in this layer
Parameters: 0!

Case Study: AlexNet
[Krizhevsky et al. 2012]

Input: 227x227x3 images
After CONV1: 55x55x96
After POOL1: 27x27x96

Case Study: AlexNet
[Krizhevsky et al. 2012]

Full (simplified) AlexNet architecture:
[227x227x3] INPUT
[55x55x96] CONV1: 96 11×11 filters at stride 4, pad 0
[27x27x96] MAX POOL1: 3×3 filters at stride 2
[27x27x96] NORM1: Normalization layer
[27x27x256] CONV2: 256 5×5 filters at stride 1, pad 2
[13x13x256] MAX POOL2: 3×3 filters at stride 2
[13x13x256] NORM2: Normalization layer
[13x13x384] CONV3: 384 3×3 filters at stride 1, pad 1
[13x13x384] CONV4: 384 3×3 filters at stride 1, pad 1
[13x13x256] CONV5: 256 3×3 filters at stride 1, pad 1
[6x6x256] MAX POOL3: 3×3 filters at stride 2
[4096] FC6: 4096 neurons
[4096] FC7: 4096 neurons
[1000] FC8: 1000 neurons (class scores)

Details/Retrospectives:
– first use of ReLU
– used Norm layers (not common anymore)
– heavy data augmentation
– dropout 0.5
– batch size 128
– SGD Momentum 0.9
– Learning rate 1e-2, reduced by 10
manually when val accuracy plateaus
– L2 weight decay 5e-4
– 7 CNN ensemble: 18.2% -> 15.4%

ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners

First CNN-based

ZFNet: Improved
hyperparameters over

Case Study: VGGNet

Small filters, Deeper networks

Only 3×3 CONV stride 1, pad 1
and 2×2 MAX POOL stride 2

8 layers (AlexNet)
-> 16 – 19 layers (VGGNet)

11.7% top 5 error in ILSVRC’13
-> 7.3% top 5 error in ILSVRC’14

AlexNet VGG16 VGG19

[Simonyan and Zisserman, 2014]

Case Study: VGGNet

Q: Why use smaller filters? (3×3 conv)

Stack of three 3×3 conv (stride 1) layers
has same effective receptive field as
one 7×7 conv layer

But deeper, more non-linearity

And fewer parameters: 3 * (32C2) vs.
72C2 for C channels per layer

AlexNet VGG16 VGG19

[Simonyan and Zisserman, 2014]

INPUT: [224x224x3] memory: 224*224*3=150K params: 0
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864
POOL2: [112x112x64] memory: 112*112*64=800K params: 0
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456
POOL2: [56x56x128] memory: 56*56*128=400K params: 0
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
POOL2: [28x28x256] memory: 28*28*256=200K params: 0
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
POOL2: [14x14x512] memory: 14*14*512=100K params: 0
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
POOL2: [7x7x512] memory: 7*7*512=25K params: 0
FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448
FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216
FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000

TOTAL memory: 15.2M * 4 bytes ~= 61MB / image (for a forward pass)
TOTAL params: 138M parameters

(not counting biases)

INPUT: [224x224x3] memory: 224*224*3=150K params: 0
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864
POOL2: [112x112x64] memory: 112*112*64=800K params: 0
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728
CONV3-128: [112x112x128] memory: 112*112*128=1.6M pa

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com