Machine learning lecture slides
Machine learning lecture slides
COMS 4771 Fall 2020
0 / 36
Optimization II: Neural networks
Outline
I Architecture of (layered) feedforward neural networks
I Universal approximation
I Backpropagation
I Practical issues
1 / 36
Parametric featurizations
I So far: data features (x or ϕ(x)) are fixed during training
I Consider a (small) collection of feature transformations ϕ
I Select ϕ via cross-validation – outside of normal training
I “Deep learning” approach:
I Use ϕ with many tunable parameters
I Optimize parameters of ϕ during normal training process
I Neural network: parameterization for function f : Rd → R
I f(x) = ϕ(x)Tw
I Parameters include both w and parameters of ϕ
I Varying parameters of ϕ allows f to be essentially any function!
I Major challenge: optimization
ϕ
Figure 1: Neural networks as feature maps 2 / 36
Feedforward neural network
I Architecture of a feedforward neural network
I Directed acyclic graph G = (V,E)
I One source node (vertex) per input to the function (x1, . . . , xd)
I One sink node per output of the function
I Internal nodes are called hidden units
I Each edge (u, v) ∈ E has a weight parameter wu,v ∈ R
I Value hv of node v given values of parents
πG(v) = {u ∈ V : (u, v) ∈ E} is
hv := σv(zv), zv :=
∑
u∈πG(v)
wu,v · hu.
I σv : R→ R is the activation function (a.k.a. link function)
I E.g., sigmoid function σv(z) = 1/(1 + e−z)
I Inspired by neurons in the brain
3 / 36
· · ·
wu,v
u
v
Figure 2: Computation DAG of a feedforward neural network
4 / 36
Standard layered architectures
I Standard feedforward architecture arranges nodes into layers
I Initial layer (layer zero): source nodes (input)
I Final layer (layer L): sink nodes (output)
I (Layer counting is confusing; usually don’t count input)
I Edges only go from one layer to the next
I (Non-standard feedforward architectures (e.g., ResNets) break
this rule)
I Can write function using matrices of weight parameters
f(x) = σL(WLσL−1(· · ·σ1(W1x) · · · ))
I Layer ` has d` nodes
I W` ∈ Rd`×d`−1 are the weight parameters for layer `
I Scalar-valued activation function σ` : R→ R (e.g., sigmoid) is
applied coordinate-wise to input
I Often also include “bias” parameters b` ∈ Rd`
f(x) = σL(bL +WLσL−1(· · ·σ1(b1 +W1x) · · · ))
I The tunable parameters: θ = (W1, b1, . . . ,WL, bL)
5 / 36
input hidden units output
W 1 W 2 W 3
Figure 3: Layered feedforward neural network
6 / 36
Well-known activation functions
I Heaviside: σ(z) = 1{z≥0}
I Popular in the 1940s; also called step function
I Sigmoid (from logistic regression): σ(z) = 1/(1 + e−z)
I Popular since 1970s
I Hyperbolic tangent: σ(z) = tanh(z)
I Similar to sigmoid, but range is (−1, 1) rather than (0, 1)
I Rectified Linear Unit (ReLU): σ(z) = max{0, z}
I Popular since 2012
I Identity: σ(z) = z
I Popular with luddites
7 / 36
Power of non-linear activations
I What happens if every activation function is linear/affine?
I Overall function is affine
I An unusual way to parameterize an affine function
I Therefore, use non-linear/non-affine activation functions
8 / 36
Necessity of multiple layers (1)
I Suppose only have input and output layers, so function f is
f(x) = σ(b+ wTx)
where b ∈ R and w ∈ Rd (so wT ∈ R1×d)
I If σ is monotone (e.g., Heaviside, sigmoid, hyperbolic tangent,
ReLU, identity), then f has same limitations as a linear/affine
classifier
Figure 4: XOR data set
9 / 36
Necessity of multiple layers (2)
I XOR problem
I Let x(1) = (+1,+1), x(2) = (+1,−1), x(3) = (−1,+1),
x(4) = (−1,−1).
I y(i) = +1 iff coordinates of x(i) are the same. (XNOR)
I Suppose (w, b) ∈ R2 × R satisfies
−b− wTx(1) < 0,
b+ wTx(2) < 0,
b+ wTx(3) < 0.
I Add up the equations:
b+ wT(x(2) + x(3) − x(1)) < 0.
I But x(2) + x(3) − x(1) = x(4), so
b+ wTx(4) < 0.
In other words, cannot correctly label x(4).
10 / 36
Neural network approximation theorems
I Theorem (Cybenko, 1989; Hornik, Stinchcombe, & White,
1989): Let σ1 be any continuous non-linear activation function
from above. For any continuous function f : Rd → R and any
ε > 0, there is a two-layer neural network (with parameters
θ = (W1, b1, w2)) s.t.
max
x∈[0,1]d
|f(x)− wT2σ1(b1 +W1x)| < ε. I This property of such families of neural networks is called universal approximation. I Many caveats I “Width” (number of hidden units) may need to be very large I Does not tell us how to find the network I Does not justify deeper networks 11 / 36 Stone-Weierstrass theorem (polynomial version) Theorem (Weierstrass, 1885): For any continuous function f : [a, b]→ R, and any � > 0, there exists a polynomial p such that
sup
x∈[a,b]
|f(x)− p(x)| < �.
12 / 36
Stone-Weierstrass theorem (general version)
Theorem (Stone, 1937): Let K ⊂ Rd be any bounded set. Let A
be a set of continuous functions on K such that the following hold.
(1) A is an algebra (i.e., A is closed under addition, multiplication,
and scalar multiplication).
(2) A does not vanish on K (i.e., for all x ∈ K, there exists
h ∈ A such that h(x) 6= 0).
(3) A separates points in K (i.e., for all distinct x, y ∈ K, there
exists h ∈ A such that h(x) 6= h(y)).
For any continuous function f : K → R, and any � > 0, there exists
h ∈ A such that
sup
x∈K
|f(x)− h(x)| < �.
13 / 36
Two-layer neural networks with cosine activation functions
Let K = [0, 1]d, and let
A =
{
x 7→
m∑
k=1
ak cos(xTwk + bk) :
m ∈ N, ak, bk ∈ R, wk ∈ Rd for k = 1, . . . ,m
}
.
(Check that A satisfies properties of Stone-Weierstrass theorem.)
14 / 36
Two-layer neural networks with exp activation functions
Let K = [0, 1]d, and let
A =
{
x 7→
m∑
k=1
ak exp(xTwk + bk) :
m ∈ N, ak, bk ∈ R, wk ∈ Rd for k = 1, . . . ,m
}
.
(Check that A satisfies properties of Stone-Weierstrass theorem.)
15 / 36
Fitting neural networks to data
I Training data (x1, y1), . . . , (xn, yn) ∈ Rd × R
I Fix architecture: G = (V,E) and activation functions
I ERM: find parameters θ of neural network fθ to minimize
empirical risk (possibly with a surrogate loss)
I Regression yi ∈ R
R̂(θ) =
1
n
n∑
i=1
(fθ(xi)− yi)2
I Binary classification yi ∈ {−1,+1}
R̂(θ) =
1
n
n∑
i=1
ln(1 + exp(−yifθ(xi)))
(Could use other surrogate loss functions . . . )
I Can also add regularization terms, but also common to use
algorithmic regularization
I Typically objective is not convex in parameters θ
I Nevertheless, local search (e.g., gradient descent, SGD) often
works well! 16 / 36
Backpropagation
I Backpropagation (backprop): Algorithm for computing partial
derivatives wrt weights in a feedforward neural network
I Clever organization of partial derivative computations with
chain rule
I Use in combination with gradient descent, SGD, etc.
I Consider loss on a single example (x, y), written as
J := `(y, fθ(x))
I Goal: compute ∂J
∂wu,v
for every edge (u, v) ∈ E
I Initial step of backprop: forward propagation
I Compute zv’s and hv’s for every node v ∈ V
I Running time: linear in size of network
I We’ll see that rest of backprop also just requires time linear in
size of network
17 / 36
Derivative of loss with respect to weights
I Let ŷ1, ŷ2, . . . denote the values at the output nodes.
I Then by chain rule,
∂J
wu,v
=
∑
i
∂J
∂ŷi
·
∂ŷi
wu,v
.
I ∂J
∂ŷi
is just determined by the loss function (e.g., squared loss)
I So just have to focus on ∂ŷi
wu,v
I Assume for simplicity there is just a single output, ŷ
18 / 36
Derivative of output with respect to weights
I Chain rule, again:
∂ŷ
∂wu,v
=
∂ŷ
∂hv
·
∂hv
∂wu,v
I First term: trickier; we’ll handle later
I Second term:
I hv = σv(zv)
I zv = wu,v · hu + (terms not involving wu,v)
I Therefore
∂hv
∂wu,v
=
∂hv
∂zv
·
∂zv
∂wu,v
= σ′(zv) · hu.
I zv and hu were computed during forward propagation
19 / 36
v
u · · ·
wu,v
Figure 5: Derivative of a node’s output with respect to an incoming weight
20 / 36
Derivative of output with respect to hidden units
I Key trick: compute ∂ŷ
∂hv
for all vertices in decreasing order of
layer number
I If v is not the output node, then by chain rule (yet again),
∂ŷ
∂hv
=
∑
v′:(v,v′)∈E
∂ŷ
∂hv′
·
∂hv′
∂hv
I ∂ŷ
∂hv′
was already computed since v′ is in higher layer than v
I hv′ = σv′(zv′)
I zv′ = wv,v′ · hv + (terms not involving hv)
I Therefore
∂hv′
∂hv
=
∂hv′
∂zv′
·
∂zv′
∂hv
= σ′(zv′) · wv,v′ .
I zv′ ’s were computed during forward propagation
I wv,v′ ’s are the values of the weight parameters at which we
want to compute the gradient
21 / 36
v
· · ·v′1 v
′
k
wv,v′1 · · · wv,v′k
Figure 6: Derivative of the network output with respect to hidden unit
values
22 / 36
Example: chain graph (1)
I Function fθ : R→ R
I Architecture
I DAG: 0 −→ 1 −→ 2 −→ · · · −→ L
I Same activation σ in every layer
I Parameters θ = (w0,1, w1,2, . . . , wL−1,L)
I Input is at vertex 0, and output is at vertex L
I Fix input value x ∈ R; what is ∂hL
∂wi−1,i
for i = 1, . . . , L?
I Forward propagation:
I h0 := x
I For i = 1, 2, . . . , L:
zi := wi−1,ihi−1
hi := σ(zi)
23 / 36
Example: chain graph (2)
I Backprop:
I For i = L,L− 1, . . . , 1:
∂hL
∂hi
:=
{
1 if i = L
∂hL
∂hi+1
· σ′(zi+1)wi,i+1 if i < L
∂hL
∂wi−1,i
:=
∂hL
∂hi
· σ′(zi)hi−1
· · ·1 L
input output
0
w0,1 w1,2 wL−1,L
Figure 7: Neural network with a chain computation graph
24 / 36
Practical issues I: Initialization
I Ensure inputs are standardized: every feature has zero mean
and unit variance (wrt training data)
I Even better: different features have zero covariance (again, on
training data)
I But this can be expensive
I Initialize weights randomly for gradient descent / SGD
I Standard normal random variables (or similar)
I What should variance be?
I Heuristic: ensure hv have similar statistics as inputs
I E.g., using tanh-activation, if v has in-degree k, use variance
1/k for all weights wu,v
I Many initialization schemes for other activations (e.g., ReLU),
dealing with bias parameters, . . .
25 / 36
Practical issues II: Architecture choice
I Architecture can be regarded as a “hyperparameter”
I Could use cross-validation to select, but . . .
I Many “good” architectures are known for popular problems
(e.g., image classification)
I Unclear what to do for completely new problems
I Optimization-inspired architecture choice
I With wide enough network, can get zero training error
I Use the smallest network that lets you get zero training error
I Then add regularization term to objective (e.g., sum of squares
of weights), and optimize the regularized ERM objective
I Entire research communities are trying to figure out good
architectures for their problems
26 / 36
Multi-class
Vector-valued activation: σ : Rd` → Rd`
I Softmax activation: σ(v)i = exp(vi)/
∑
j exp(vj)
I Common to use this in final layer
27 / 36
Convolutional nets
I Neural networks with convolutional layers
I Useful when inputs have locality structure
I Sequential structure (e.g., speech waveform)
I 2D grid structure (e.g., image)
I . . .
I Weight matrix W` is highly-structured
I Determined by a small filter
I Time to compute W`h`−1 is typically � d` × d`−1 (e.g., closer
to max{d`, d`−1})
28 / 36
Convolutions I
I Convolution of two continuous functions: h := f ∗ g
h(x) =
∫ +∞
−∞
f(y)g(x− y) dy
I If f(x) = 0 for x /∈ [−w,+w], then
h(x) =
∫ +w
−w
f(y)g(x− y) dy
I Replaces g(x) with weighted combination of g at nearby points
29 / 36
Convolutions II
I For functions on discrete domain, replace integral with sum
h(i) =
∞∑
j=−∞
f(j)g(i− j)
E.g., suppose only f(0), f(1), f(2) are non-zero. Then:
h(1)
h(2)
h(3)
h(4)
h(5)
h(6)
h(7)
=
f(0) 0 0 0 0
f(1) f(0) 0 0 0
f(2) f(1) f(0) 0 0
0 f(2) f(1) f(0) 0
0 0 f(2) f(1) f(0)
0 0 0 f(2) f(1)
0 0 0 0 f(2)
g(1)
g(2)
g(3)
g(4)
g(5)
(Here, we pretend g(i) = 0 for i < 1 and i > 5.)
30 / 36
Figure 8: Convolutional layer
31 / 36
Convolutions III
I Similar for 2D inputs (e.g., images), except now sum over two
indices
I g is the input image
I f is the filter
I Lots of variations (e.g., padding, strides, multiple “channels”)
I Use additional layers/activations to down-sample after
convolution
I E.g., max-pooling
32 / 36
Figure 9: 2D convolution
33 / 36
Figure 10: 2D convolution
34 / 36
Figure 11: 2D convolution
35 / 36
Figure 12: 2D convolution
36 / 36
Optimization II: Neural networks