CS作业代写 VGG16 VGG19

PowerPoint 프레젠테이션

Changjae Oh

Computer Vision
– Multi-layer Perceptron (MLP)-

Semester 1, 22/23

Neural networks

(Before) Linear score function: 𝒇 = 𝐖𝒙

(Now) 2-layer Neural Network: 𝒇 = 𝐖𝟐max(𝟎,𝐖𝟏𝒙)

3-layer Neural Network: 𝒇 = 𝐖𝟑max 𝟎,𝐖𝟐max 𝟎,𝐖𝟏𝒙

Activation functions

• Adding non-linearities into neural networks, allowing the neural networ
ks to learn powerful operations.

• A crucial component of deep learning

̶ If the activation functions were to be removed from a feedforward neural network, th
e entire network could be re-factored to a simple linear operation or matrix transform
ation on its input

̶ It would no longer be capable of performing complex tasks such as image recognition.

Activation functions

Leaky ReLU

max 0.1𝑥, 𝑥

𝑇𝑥 + 𝑏1, 𝑤2

𝛼(𝑒𝑥 − 1) 𝑥 < 0 Neural networks: Architectures “2-layer Neural Net”, or “1-hidden-layer Neural Net” “3-layer Neural Net”, or “2-hidden-layer Neural Net” “Fully-connected” layers Derivative of Neural Net using Chain Rules ̶ 1-layer Neural Net (L2 regression loss) ̶ 2-layer Neural Net (L2 regression loss) ̶ 1-layer Neural Net (Softmax classifier) ̶ 2-layer Neural Net (Softmax classifier) 1. 1-layer Neural Net (L2 regression loss) Output layer 1. Linear score 2. Activation function 𝐿 = (𝒛 − 𝒑)2 𝑤11 𝑤12 ⋯ 𝑤1𝑑 𝑤21 𝑤22 ⋯ 𝑤2𝑑 𝑤𝑛1 𝑤𝑛2 ⋯ 𝑤𝑛𝑑 1. 1-layer Neural Net (L2 regression loss) Output layer 1. Linear score 2. Activation function 𝐿 = (𝒛 − 𝒑)2 𝑧1 Ground truth T𝒙 + 𝑏1 = ෍ 𝑤1𝑘𝑥𝑘 + 𝑏1 (𝑧1 − 𝑝1) 𝑤11 𝑤12 ⋯ 𝑤1𝑑 𝑤21 𝑤22 ⋯ 𝑤2𝑑 𝑤𝑛1 𝑤𝑛2 ⋯ 𝑤𝑛𝑑 1. 1-layer Neural Net (L2 regression loss) Ground truth Output layer 1. Linear score 2. Activation function 𝐿 = (𝒛 − 𝒑)2 T𝒙 + 𝑏𝑛 = ෍ 𝑤𝑛𝑘𝑥𝑘 + 𝑏𝑛 (𝑧𝑛 − 𝑝𝑛) 𝑧1 Ground truth T𝒙 + 𝑏1 = ෍ 𝑤1𝑘𝑥𝑘 + 𝑏1 (𝑧1 − 𝑝1) 𝑤11 𝑤12 ⋯ 𝑤1𝑑 𝑤21 𝑤22 ⋯ 𝑤2𝑑 𝑤𝑛1 𝑤𝑛2 ⋯ 𝑤𝑛𝑑 1. 1-layer Neural Net (L2 regression loss) Output layer 1. Linear score 2. Activation function 𝐿 = (𝒛 − 𝒑)2 We need to compute gradients of 𝐖,𝒃, 𝒔, 𝒑 with respect to the loss function 𝐿. 𝐖𝒙 sigmoid 𝒛In a vector form 𝑛 × 1 𝑛 × 1 𝑛 × 1Ground truth 𝑤11 𝑤12 ⋯ 𝑤1𝑑 𝑤21 𝑤22 ⋯ 𝑤2𝑑 𝑤𝑛1 𝑤𝑛2 ⋯ 𝑤𝑛𝑑 1. 1-layer Neural Net (L2 regression loss) = 𝑑𝑖𝑎𝑔 (1 − 𝜎(𝑠𝑗))𝜎(𝑠𝑗) (1 − 𝜎(𝑠1))𝜎(𝑠1)(𝑧1 − 𝑝1) (1 − 𝜎(𝑠2))𝜎(𝑠2)(𝑧2 − 𝑝2) (1 − 𝜎(𝑠𝑛))𝜎(𝑠𝑛)(𝑧𝑛 − 𝑝𝑛) = −2(𝒛 − 𝒑) = [𝟎 𝟎 𝒙 ⋯𝟎] jth column = (1 − 𝜎(𝒔)) ⊗ 𝜎 𝒔 ⊗ ⊗: element-wise multiplication th element at vector 𝒂 𝐖𝒙 sigmoid 𝒛In a vector form 𝑛 × 1 𝑛 × 1 𝑛 × 1Ground truth 1. 1-layer Neural Net (L2 regression loss) = −2(𝒛 − 𝒑) = (1 − 𝜎(𝒔)) ⊗ 𝜎 𝒔 ⊗ Note that the following derivative can also be computed, but here 𝒙 is an input data that is fixed during training. Thus, it is not necessary to compute its derivative.𝜕𝐿 𝐖𝒙 sigmoid 𝒛In a vector form 𝑛 × 1 𝑛 × 1 𝑛 × 1Ground truth 2. 2-layer Neural Net (L2 regression loss) 𝐖1𝒙 sigmoid 𝐖2 sigmoid = 𝑑𝑖𝑎𝑔 (1 − 𝜎(𝑠2,𝑗))𝜎(𝑠2,𝑗) = −2(𝒛 − 𝒑𝟐) = 𝑑𝑖𝑎𝑔 (1 − 𝜎(𝑠1,𝑗))𝜎(𝑠1,𝑗) 𝑛 × 1 𝑛 × 1 𝑚 × 1Ground truthIn a vector form 3. 1-layer Neural Net (Softmax classifier) = 𝟎 𝟎 𝒙 ⋯𝟎 𝐷𝑎𝑏 = 𝑝𝑎(𝛿𝑎𝑏 − 𝑝𝑏) 0 otherwise th element at vector 𝒂 jth column 𝐖𝒙 softmax 𝒛In a vector form 𝑛 × 1 𝑛 × 1 𝑛 × 1Ground truth likelihood 3. 1-layer Neural Net (Softmax classifier) Note that the following derivative can also be computed, but here 𝒙 is an input data that is fixed during training. Thus, it is not necessary to compute its derivative. 𝐖𝒙 softmax 𝒛In a vector form 𝑛 × 1 𝑛 × 1 𝑛 × 1Ground truth likelihood 4. 2-layer Neural Net (Softmax classifier) = 𝑑𝑖𝑎𝑔 (1 − 𝜎(𝑠1,𝑗))𝜎(𝑠1,𝑗) 𝐷𝑎𝑏 = 𝑝𝑎(𝛿𝑎𝑏 − 𝑝𝑏) 0 otherwise 𝐖1𝒙 sigmoid 𝐖2 softmax 𝑛 × 1 𝑛 × 1 𝑚 × 1Ground truth In a vector form𝒃1 likelihood Full implementation of training a 2-layer Neural Network N: batch size D_in: input feature size H: input feature size of the second layer D_out: output feature size 𝐖1𝒙 sigmoid Ground truth Neural networks: Pros and cons ̶ Flexible and general function approximation framework ̶ Can build extremely powerful models by adding more layers ̶ Hard to analyze theoretically (e.g., training is prone to local optima) ̶ Huge amount of training data, computing power may be required to get good perfor ̶ The space of implementation choices is huge (network architectures, parameters) • We arrange neurons into fully-connected layers • The layer allows us to use efficient vectorized code (e.g. matrix multiplic ̶ Using back-propagation Changjae Oh Computer Vision - Convolutional Neural Networks - Semester 1, 22/23 CNN Introduction • Image Recognition ̶ Recognizing the object class in the image CNN (= ConvNet) – is a sequence of layers – Every layer of a ConvNet transforms one volume of activations to another through a differentiable function. – Convolutional Layer: computes the output of neurons that are connected to local regions in the input – ReLU (nonlinear) layer: activates relevant responses – Pooling Layer: performs a downsampling operation along the spatial dimensions – Fully-Connected Layer: each neuron in this layer will be connected to all the numbers in the previous volume Typical architectures of ConvNet [(Conv-ReLU)*N - POOL]*M - (FC-RELU)*K - 1 ReLU ⋯ Pooling (Conv-ReLU)*N - POOL (Conv-ReLU)*N - POOL⋯ Conv N ReLU N is usually up to ~5, M is large, 0 <= K <= 2 but some advances such as ResNet/GoogLeNet challenge this paradigm Typical architectures of ConvNet Convolutional Layer Convolutional Layer 32x32x3 image 5x5x3 filter Convolve the filter with the image i.e. “slide over the image spatially, computing dot products” To preserve spatial structure, use an original 2D image Convolutional Layer 32x32x3 image 5x5x3 filter Filters always extend the full depth of the input volume To preserve spatial structure, use an original 2D image Convolve the filter with the image i.e. “slide over the image spatially, computing dot products” Convolutional Layer the result of taking a dot product between the filter and a small 5x5x3 chunk of the image (i.e. 5*5*3 = 75-dimensional dot product + bias) 32x32x3 image 5x5x3 filter Convolutional Layer 32x32x3 image 5x5x3 filter convolve (slide) over all spatial locations Activation map Convolutional Layer 32x32x3 image 5x5x3 filter convolve (slide) over all spatial locations consider a second 5x5x3 (orange) filter Convolutional Layer If we had six 5x5x3 filters, we’ll get six separate activation maps: We call the layer convolutional because it is related to convolution of two signals: elementwise multiplication and sum of a filter and the signal (image) Convolutional Layer • The number of parameters in convolutional layer 𝐻1 ×𝑊1 × 𝐶1 𝐶2 filters of 𝐹ℎ × 𝐹𝑣 × 𝐶1 𝐻1 ×𝑊1 × 𝐶2 The number of weights: 𝐶2 × (𝐹ℎ × 𝐹𝑣 × 𝐶1) The number of bias: 𝐶2 • A sequence of Convolutional Layers, interspersed with activation functions (6 filters (10 filters • Receptive field ̶ The region of the input space that affects a particular unit of the network 5 × 5 × 1 filter 5 × 5 × 1 filter Effective receptive field at : 9 × 9 (= 5 + 5 − 1) → 𝑧 = 𝑥 ∗ ℎ1 ∗ ℎ2 = 𝑥 ∗ ℎ ℎ = ℎ1 ∗ ℎ2 𝑦 = 𝑥 ∗ ℎ1 𝑧 = 𝑦 ∗ ℎ2 From the convolution property, Convolutional Filter Size • Receptive fields are equal. For three 3x3 Conv layers, 3+3-1+3-1 = 7. • # of Conv parameters: 3 × 𝐶 × 3 × 3 × 𝐶 = 27𝐶2 vs. 𝐶 × 7 × 7 × 𝐶 = 49𝐶2 • The three stacks of CONV layers produce more expressive activation maps 3x3xc 3x3xc 3x3xc Three 3x3 Conv layers vs. Single 7x7 Conv layer Assume that zero-padding is applied to preserve a spatial resolution Spatial Dimension at Convolution Layer 7x7 input (spatially) assume 3x3 filter applied with stride 1 -> 5×5 output

assume 3×3 filter applied with stride 2
-> 3×3 output

assume 3×3 filter applied with stride 3
-> doesn’t fit!
cannot apply 3×3 filter on 7×7 input with stride 3.

*Stride: is the number of pixels shifts over the input matrix