程序代写代做代考 algorithm kernel C CMPUT 366 F20: Supervised Learning VI

CMPUT 366 F20: Supervised Learning VI
James Wright & Vadim Bulitko
November 24, 2020
CMPUT 366 F20: Supervised Learning VI
1

Lecture Outline
Convolutional Networks
GBC 9.0-9.4
CMPUT 366 F20: Supervised Learning VI 2

Recap: Neural Networks
x1 h1
x2 h2
Each unit’s inputs are outputs from previous layer’s units
Single unit h: Inputs x, weights w, bias b, activation g
 

h(x; w, b) = g(wTx + b)
Weights and biases are parameters of the model, set
Neural networks are universal approximators
y


Neural networks are compositions of units



during training

Neural Networks Parameters
A neural network is just a supervised model
 


 
 

y = f(x; θ)
It is a function that takes inputs x, and computes an
Question: What is 𝜃 in a feedforward neural network? A: the weights and biases
x1 h1
x2 h2
y

output y based on parameters 𝜃

Training Neural Networks
Specify a loss L and a set of training examples 


E={(x(1),y(1)),…, (x(n),y(n)),}
Training by gradient descent:

1. Compute loss on training data:
L(W, b) = ∑ l (f(x(i); W, b), y(i))
[Wnew] = [Wold] − η∇L(Wold, bold) bnew bold
2. Compute gradient of loss:
3. Update parameters to make loss smaller:

i ∇L(W, b)

Image Classification
FIVE
Problem: Recognize the handwritten digit from an image
A: pixels!
A: one-hot vector
A: many possibilities
 log-loss, cross-entropy,
 0/1 (i.e., accuracy)
What are the inputs?

• What are the outputs?
• What is the loss?

Image Classification with
 Neural Networks
How can we use a neural network to solve this problem?
How to represent the inputs? How to represent the outputs? What are the parameters? What is the loss?
x1,1 h1,1 h2,1
x1,2 h1,2 h2,2
x1,3 h1,3 h2,3
zero
one .
.
. nine
• • • •
.. .. ..
x32,32 h1,512 h2,256
. . .

Image Recognition Issues
h2,1
h2,2
h2,3
. . .
h2,256
For a large image, the number of

parameters will be very large
x1,1 h1,1
x1,2 h1,2
x1,3 h1,3
.. .. ..
x32,32 h1,512
zero
one .
.
. nine
For 32×32 greyscale image,
 hidden layer of 256 units,


hidden layer of 512 units


1024×512 + 512×256 + 256×10 

= 657,920 weights (and 1802 biases)
Needs lots of data to train
Want to generalize over transformations
=

of the input

• •
Convolutional neural networks: a specialized architecture for image recognition Introduce two new operations:
1. Convolutions

2. Pooling Efficient learning via:
1. Sparse interactions
2. Parameter sharing
3. Equivariant representations
Convolutional Neural Networks

connections
due to small
convolution
Dense 

x1 x2 x3 x4 x5 Sparse Interactions
CHAPTER 9. CO
kernel
NVOLUTIONAL NETWORKS
s1 s2 s3 s4 s5
x1 x2 x3 x4 x5
Dense
connections
connections
Sparse Connectivity
Sparse
connections
connections
due to small
convolution kernel
s1
s2 s3
s4 s5
Figure 9.2
(Goodfellow 2016)
Figure 9.2: Sparse connectivity, viewed from below: We highlight one input unit, x ,
3
and also highlight the output units in s that are affected by this unit. (Top)When s is formed by convolution with a kernel of width 3, only three outputs are affected by x.
sn a
formed by matrix multiplication, connectivity is
s1 s2 s3 s4 s5 re affected by x3.
x1 x2 x3 x4 x5
(BotStopma)Wrshen s i all of the outputs
o longer sparse, so
(Images: Goodfellow 2016)

connections
1 2 3 4 5
due to small convolution
Dense 
 connections
kernel Dense
connections
x1 x2 x3 x4 x5 Sparse Interactions
s1 s2 s3 s4 s5
x1 x2 x3 x4 x5
Sparse Connectivity
CHAPTER 9. CONVOLUTIONAL NETWORKS
Figure 9.3
(Goodfellow 2016)
Figure 9.3: Sparse connectivity, viewed from above: We highlight one output unit, s ,
and also highlight the input units in x that affect this unit. These units are known as the receptive field of s . (Top)When s is formed by convolution with a kernel of
width 3, only thre Sparse
rix multiplication,
connectivity is no
connections
Sparse
due to small
connections
convolution kernel
h4 h5 s s s s s
h1 h2 h3
(Images: Goodfellow 2016)
3
3
e inputs affect s3. (Bottom)When s is formed by ma
s4 s5 longer sparse, so all of the inputs affect s3.
s1 s2 s3
g1 g2 g3 g4 g5 x1 x2 x3 x4 x5
t

multiplication
Parameter Sharing
matrix learn a uniquemuvlatipluliecation
for each connection
any parameters
Convolution shares the same
s1 s2 s3 s4 s5 Parameter Sharing
s1 s2 s3 s4 s5
x1 x2 x3 x4 x5
Traditional neural nets
parameters across all spatial locations
Traditional
x x x x x
does not share
1 2 3 4 5
parameter in two different models. (Top)The black arrows indicate uses of the central Parameter Sharing
CHAPTER 9. CONVOLUTIONAL NETWORKS
Figure 9.5: Parameter sharing: Black arrows indicate the connections that use a particular
(Goodfellow 2016)
Figure 9.5
Convolutional neural nets
constrains multiple parameters
parameters to be equal across all spatial
element of a 3-element kernel in a convolutional model. Due to parameter sharing, this
ck arrow indicates model. This model
requirements of the model to k parameters. Recall that k is usually several orders
used at all input locations. (Bottom)The single bla s s s s s
al element of the weight matrix in a fully connected sharing so the parameter is used only once.
1 2 3 4 5
x1 x2 x3 x4 x5
n, we learn only one set. This does not affec
ion—it is still O(k n)—but it does further re
Convolution
single parameter is
the use of the centr
shares the same
has no parameter
foreverylocatiottheruntimeof forwardpropagat⇥ducethestorage
locations
Traditional
of magnitude less than m. Since m and n are usually roughly the same size, k is
s1 s2 s3 s4
practically insignificant compared to m ⇥ n. Convolution is thus dramatically more
s5
efficient than dense matrix multiplication in terms of the memory requirements
matrix
(Images: Goodfellow 2016)

Large response in detector
r in

• •
=
Large response Large response in pooling unit in pooling unit
Equivariant Representations
unit 1
We want to be able to recognize transformed versions of inputs we have seen before:
Translation (moved)
Rotation
Without having been trained on
Figure 9.9: Example of learned invariances: A pooling unit that pools o

all transformed versions
that are learned with separate parameters can learn to be invariant t
the input. Here we show how a set of three learned filters and a max p
Figure 9.9
e u
v o
o

Multiplying Matrices and Vectors tiplying Matrices and Vectors
Operation: Matrix Product
the most important operations involving matrices is multiplication of two
s. The matrix product of matrices A and B is a third matrix C. In order st important operations involving matrices is multiplication of two
product to be defined, A must have the same number of columns as B has matrix product of matrices A and B is a third matrix C. In order
A is of shape m ⇥ n and B is of shape n ⇥ p, then C is of shape m ⇥ p. ct to be defined, A must have the same number of columns as B has
write the matrix product just by placing two or more matrices together, of shape m⇥n and B is of shape n⇥p, then C is of shape m⇥p.
Matrix (Dot) Product
x1,1 h1,1 h2,1
x1,2 h1,2 h2,2
x1,3 h1,3 h2,3
the matrix product just by placing two or more matrices together,
C = AB. (2.4) Recall that we can represent the
activations in a neural network by a
C = AB. product operation is defined by
(2.4)
one .
.
. nine
XmXatrix product
ct operation is defined by
(2.5) . . k.
e that the stan duct of the ind
Ci,j =
Ai,kBk,j. (2.5)
34
. . . …
Ci,j = Ai,kBk,j.
dard product of two matrices is not just a matrix containing k
ividual elements. Such an operation exists and is called the or Hadamard product, and is denoted as A B.
product of two matrices is not just a matrix containing mal elements. Such an op=eramtion exists an•d ins called the
etween two vectors x and y of the same dimensionality is the We can think of the matrix product C AB as computing
da d d , and is denoted as B. ctbewfAandcolumnjof
mar
etwe
pro
n ro
uct
io
=
A B.
n two vectors x and y of the same dimensionality is thep
an think of the m34atrix product C = AB as computing
tween row i of A and column j of B.
pn
Must
match
-wise product
the standard fdtohteprinoduicvtidbu
x32,32 h1,512
h2,256
product x>y. roduct or Ha
the dot produ oduct betwee
t x>y. We c t product be
h =g (W(1)x+b(1)) 1 h( (2) (2))
(Image: Goodfellow 2016)
(Goodfellow 2016)
h2=gh(W h1+b ) y = gy W(3)h2 + b(3)
l
e o
e
f u
n
e u
t
oe
p
r
c o

CHAPTER 9. CONVOLUTIONAL NETWORKS
Operation: 2D Convolution
a
b
c
d
w
x
y
z
e
f
g
h
i
j
k
l
aw + bx + ey + fz
ew + f x + iy + jz
bw + cx + fy + gz
f w + gx + jy + kz
cw + dx + gy + hz
gw + hx + ky + lz
Convolution scans a small block of weights (called the kernel) over the elements of the inputs, taking weighted averages
2D Convolution
Input
Kernel
Output
Note that input and output

dimensions need not match
Same weights used for very

many combinations
Figure 9.1: An example of 2-D convolution without kernel-flipping. In this case we restrict the output to only positions where the kernel lies entirely within the image, called “valid”
Figure 9.1
(Goodfellow 2016)
convolution in some contexts. We draw boxes with arrows to indicate how the upper-left element of the output tensor is formed by applying the kernel to the corresponding
upper-left region of the input tensor.
(Image: Goodfellow 2016)

Replace Matrix Multiplication by Convolution
Main idea: Replace matrix multiplications with convolutions
Sparsity: Inputs only combined with neighbours Parameter sharing: Same kernel used for entire input
• •

Example: Edge Detection
CHAPTER 9. CONVOLUTIONAL NETWORKS
Edge Detection by Convolution
Input
fficiency of edge detection. The image on the right was formed by taking the original image and subtracting the value of its neighboring pixel on the
ows the strength of all of the vertically oriented edges in the input image, a useful object detection. Both images are 280 pixels tall.
operat
-1
0 pixe
ure 9.
ion for
-1
s wide 6: Effi
age is 32 l while the output image is 319 pixels wide. This
Fig ciency of edge detection. The image on the right was form
n can be described by a convolution kernel containing two elements, and
each pixel in the original image and subtracting the value of its neighborin
Kernel
Output
280 3 = 267,960 floating point operations (two multiplications and
left. This shows the strength of all of the vertically oriented edges in the
Figure 9.6: E each pixel in left. This sh
which can be The input im transformatio
requires 319 ⇥ ⇥
one addition per output pixel) to compute using convolution. To describe the same
ed by taking g pixel on the
input image, which can be a useful operation for object detection. Both images are 280 pixels tall.
transformation with a matrix multiplication would take 320 ⇥ 280 ⇥ 319 ⇥ 280, or over
The input image is 320 pixels wide while the output image is 319 pixels wide. This
Figure 9.6
(Goodfellow 2016)
eight billion, entries in the matrix, making convolution four billion times more efficient for
transformation can be described by a convolution kernel containing two elements, and
representing this transformation. The straightforward matrix multiplication algorithm
requires 319 ⇥ 280 ⇥ 3 = 267, 960 floating point operations (two multiplications and
performs over sixteen billion floating point operations, making convolution roughly 60,000
one addition per output pixel) to compute using convolution. To describe the same
(Image: Goodfellow 2016)
times more efficient computationally. Of course, most of the entries of the matrix would be
transformation with a matrix multiplication would take 320 ⇥ 280 ⇥ 319 ⇥ 280, or over

Efficiency of Convolution
Input size: 320 by 280 Kernel size: 2 by 1 Output size: 319 by 280
Dense matrix Sparse matrix Convolution
Stored floats
Float muls or adds
319*280*320*28 2*319*280 = 0 > 8e9 178,640
2
Same as > 16e9 convolution
(267,960)
319*280*3 = 267,960
(Goodfellow 2016)

Operation: Pooling
5
28
Pooling summarizes its inputs

into a single value, e.g.,
• •
max
average
Max-pooling is parameter-free 8

(no bias or edge weights)

Max Pooling and Invariance to
of a convolutional layer. The bottom row shows outputs of the nonlinearity. The top
Figure 9.8
Example:
Max Pooling and Invariance to Translation
Translation Invariance
Translation
… 1. 1. 1. 0.2 … POOLING STAGE
POOLING STAGE
… 1. 1. 1. 0.2 … … 0.1 1. 0.2 0.1 …
DETECTOR STAGE
… 0.1 1. 0.2 0.1 …
DETECTOR STAGE POOLING STAGE
… 0.3 1. 1. 1. … POOLING STAGE
… 0.3 1. 1. 1. … … 0.3 0.1 1. 0.2 …
DETECTOR STAGE
… 0.3 0.1 1. 0.2 … (Goodfellow 2016)
Figure 9.8: Max pooling introduces invariance. (Top)A view of the middle of the output DETECTOR STAGE

Typical Architecture
Input 
 Image
Convolution
Pooling
Often convolution-then-pooling is collectively referred to as a
“convolution layer”
Convolution
Pooling
Fully- connected
Outputs



• •

Summary
Classifying images with a standard feedforward network

requires vast quantities of parameters (and hence data)
Convolutional networks add pooling and convolution Sparse connectivity
Parameter sharing
Translation equivariance
Fewer parameters means far more efficient to train