Machine learning lecture slides
COMS 4771 Fall 2020
0/36
Optimization II: Neural networks
Outline
Architecture of (layered) feedforward neural networks Universal approximation
Backpropagation
Practical issues
1/36
Parametric featurizations
So far: data features (x or φ(x)) are fixed during training Consider a (small) collection of feature transformations φ Select φ via cross-validation – outside of normal training
“Deep learning” approach:
Use φ with many tunable parameters
Optimize parameters of φ during normal training process
Neural network: parameterization for function f : Rd → R f(x) = φ(x)Tw
Parameters include both w and parameters of φ
Varying parameters of φ allows f to be essentially any function! Major challenge: optimization
φ
Figure 1: Neural networks as feature maps
2 / 36
Feedforward neural network
Architecture of a feedforward neural network
Directed acyclic graph G = (V, E)
One source node (vertex) per input to the function (x1, . . . , xd)
One sink node per output of the function
Internal nodes are called hidden units
Each edge (u, v) ∈ E has a weight parameter wu,v ∈ R
Value hv of node v given values of parents πG(v)={u∈V :(u,v)∈E}is
hv:=σv(zv), zv:= wu,v·hu. u∈πG(v)
σv : R → R is the activation function (a.k.a. link function)
E.g., sigmoid function σv(z) = 1/(1 + e−z)
Inspired by neurons in the brain
3/36
wu,v v u
···
Figure 2: Computation DAG of a feedforward neural network
4/36
Standard layered architectures
Standard feedforward architecture arranges nodes into layers Initial layer (layer zero): source nodes (input)
Final layer (layer L): sink nodes (output)
(Layer counting is confusing; usually don’t count input)
Edges only go from one layer to the next
Can
(Non-standard feedforward architectures (e.g., ResNets) break this rule)
write function using matrices of weight parameters
f(x) = σL(WLσL−1(· · · σ1(W1x) · · · ))
Layer l has dl nodes
Wl ∈ Rdl×dl−1 are the weight parameters for layer l Scalar-valued activation function σl : R → R (e.g., sigmoid) is applied coordinate-wise to input
Often also include “bias” parameters bl ∈ Rdl f(x)=σL(bL +WLσL−1(···σ1(b1 +W1x)···))
The tunable parameters: θ = (W1, b1, . . . , WL, bL)
5/36
W1W2W3
input hidden units output
Figure 3: Layered feedforward neural network
6/36
Well-known activation functions
Heaviside: σ(z) = 1{z≥0}
Popular in the 1940s; also called step function
Sigmoid (from logistic regression): σ(z) = 1/(1 + e−z) Popular since 1970s
Hyperbolic tangent: σ(z) = tanh(z)
Similar to sigmoid, but range is (−1, 1) rather than (0, 1)
Rectified Linear Unit (ReLU): σ(z) = max{0, z} Popular since 2012
Identity: σ(z) = z
Popular with luddites
7/36
Power of non-linear activations
What happens if every activation function is linear/affine? Overall function is affine
An unusual way to parameterize an affine function
Therefore, use non-linear/non-affine activation functions
8/36
Necessity of multiple layers (1)
Suppose only have input and output layers, so function f is f(x) = σ(b + wTx)
whereb∈Randw∈Rd (sowT ∈R1×d)
If σ is monotone (e.g., Heaviside, sigmoid, hyperbolic tangent,
ReLU, identity), then f has same limitations as a linear/affine classifier
Figure 4: XOR data set
9/36
Necessity of multiple layers (2)
XOR problem
Let x(1) = (+1, +1), x(2) = (+1, −1), x(3) = (−1, +1),
x(4) = (−1, −1).
y(i) = +1 iff coordinates of x(i) are the same. (XNOR)
Suppose (w, b) ∈ R2 × R satisfies
−b−wTx(1) <0, b+wTx(2) <0, b+wTx(3) <0.
Add up the equations:
b+wT(x(2) +x(3) −x(1))<0.
Butx(2)+x(3)−x(1) =x(4),so b+wTx(4) <0.
In other words, cannot correctly label x(4).
10/36
Neural network approximation theorems
Theorem (Cybenko, 1989; Hornik, Stinchcombe, & White, 1989): Let σ1 be any continuous non-linear activation function from above. For any continuous function f : Rd → R and any ε > 0, there is a two-layer neural network (with parameters
θ = (W1, b1, w2)) s.t.
maxd |f(x) − w2Tσ1(b1 + W1x)| < ε.
x∈[0,1]
This property of such families of neural networks is called universal approximation.
Many caveats
“Width” (number of hidden units) may need to be very large
Does not tell us how to find the network Does not justify deeper networks
11/36
Stone-Weierstrass theorem (polynomial version)
Theorem (Weierstrass, 1885): For any continuous function
f : [a, b] → R, and any ε > 0, there exists a polynomial p such that
sup |f(x)−p(x)|<ε. x∈[a,b]
12/36
Stone-Weierstrass theorem (general version)
Theorem (Stone, 1937): Let K ⊂ Rd be any bounded set. Let A be a set of continuous functions on K such that the following hold.
(1) A is an algebra (i.e., A is closed under addition, multiplication, and scalar multiplication).
(2) A does not vanish on K (i.e., for all x ∈ K, there exists h ∈ A such that h(x) ̸= 0).
(3) A separates points in K (i.e., for all distinct x, y ∈ K, there exists h ∈ A such that h(x) ̸= h(y)).
For any continuous function f : K → R, and any ε > 0, there exists h ∈ A such that
sup |f(x) − h(x)| < ε. x∈K
13/36
Two-layer neural networks with cosine activation functions
Let K = [0,1]d, and let
m x→akcos(xTwk +bk):
A=
k=1
m∈N,ak,bk ∈R,wk ∈Rd fork=1,...,m .
(Check that A satisfies properties of Stone-Weierstrass theorem.)
14/36
Two-layer neural networks with exp activation functions
Let K = [0,1]d, and let
m x→akexp(xTwk +bk):
A=
k=1
m∈N,ak,bk ∈R,wk ∈Rd fork=1,...,m .
(Check that A satisfies properties of Stone-Weierstrass theorem.)
15/36
Fitting neural networks to data
Trainingdata(x1,y1),...,(xn,yn)∈Rd×R
Fix architecture: G = (V, E) and activation functions
ERM: find parameters θ of neural network fθ to minimize
empirical risk (possibly with a surrogate loss) Regression yi ∈ R
1 n
(fθ(xi) − yi)2 Binary classification yi ∈ {−1, +1}
1 n
R(θ) = n
(Could use other surrogate loss functions . . . )
Can also add regularization terms, but also common to use
algorithmic regularization
Typically objective is not convex in parameters θ
Nevertheless, local search (e.g., gradient descent, SGD) often
R(θ) = n
i=1
ln(1 + exp(−yifθ(xi)))
i=1
works well!
16 / 36
Backpropagation
Backpropagation (backprop): Algorithm for computing partial derivatives wrt weights in a feedforward neural network
Clever organization of partial derivative computations with chain rule
Use in combination with gradient descent, SGD, etc.
Consider loss on a single example (x, y), written as
J := l(y,fθ(x))
Goal: compute ∂J for every edge (u, v) ∈ E
∂ wu,v
Initial step of backprop: forward propagation
Compute zv’s and hv’s for every node v ∈ V Running time: linear in size of network
We’ll see that rest of backprop also just requires time linear in size of network
17/36
Derivative of loss with respect to weights
Let yˆ1, yˆ2, . . . denote the values at the output nodes. Then by chain rule,
∂ J = ∂ J · ∂ yˆ i . wu,v i ∂ yˆi wu,v
∂J is just determined by the loss function (e.g., squared loss) ∂ yˆ i
So just have to focus on ∂yˆi wu,v
Assume for simplicity there is just a single output, yˆ
18/36
Derivative of output with respect to weights
Chain rule, again:
∂yˆ = ∂yˆ · ∂hv ∂wu,v ∂hv ∂wu,v
First term: trickier; we’ll handle later Second term:
hv =σv(zv)
zv = wu,v · hu + (terms not involving wu,v ) Therefore
∂hv =∂hv·∂zv =σ′(zv)·hu. ∂wu,v ∂zv ∂wu,v
zv and hu were computed during forward propagation
19/36
v wu,v
u ···
Figure 5: Derivative of a node’s output with respect to an incoming weight
20/36
Derivative of output with respect to hidden units
Key trick: compute ∂yˆ for all vertices in decreasing order of ∂hv
layer number
If v is not the output node, then by chain rule (yet again),
∂yˆ= ∂yˆ·∂hv′ ∂hv v′:(v,v′)∈E ∂hv′ ∂hv
∂yˆ was already computed since v′ is in higher layer than v ∂hv′
hv′ =σv′(zv′)
zv′ = wv,v′ · hv + (terms not involving hv)
Therefore
∂hv′ = ∂hv′ · ∂zv′ ∂hv ∂zv′ ∂hv
= σ′(zv′ ) · wv,v′ .
zv′ ’s were computed during forward propagation
wv,v′ ’s are the values of the weight parameters at which we
want to compute the gradient
21/36
v1′ ··· vk′
wv,v1′ · · · wv,vk′ v
Figure 6: Derivative of the network output with respect to hidden unit values
22/36
Example: chain graph (1)
Functionfθ:R→R Architecture
DAG:0−→1−→2−→···−→L Same activation σ in every layer
Parameters θ = (w0,1, w1,2, . . . , wL−1,L)
Input is at vertex 0, and output is at vertex L
Fix input value x ∈ R; what is ∂hL for i = 1,...,L?
Forward propagation:
h0 := x
For i = 1,2,...,L:
∂ wi−1,i
zi := wi−1,ihi−1 hi := σ(zi)
23/36
Example: chain graph (2)
Backprop:
Fori=L,L−1,...,1:
∂h 1 L:=∂hL ′
if i = L if i < L
∂hi ∂hL
∂hi+1 · σ (zi+1)wi,i+1 := ∂hL · σ′(zi)hi−1
∂ hi
w0,1 w1,2 wL−1,L
∂ wi−1,i 01···L
input output
Figure 7: Neural network with a chain computation graph
24/36
Practical issues I: Initialization
Ensure inputs are standardized: every feature has zero mean and unit variance (wrt training data)
Even better: different features have zero covariance (again, on training data)
But this can be expensive
Initialize weights randomly for gradient descent / SGD
Standard normal random variables (or similar)
What should variance be?
Heuristic: ensure hv have similar statistics as inputs
E.g., using tanh-activation, if v has in-degree k, use variance
1/k for all weights wu,v
Many initialization schemes for other activations (e.g., ReLU),
dealing with bias parameters, . . .
25/36
Practical issues II: Architecture choice
Architecture can be regarded as a “hyperparameter”
Could use cross-validation to select, but . . .
Many “good” architectures are known for popular problems
(e.g., image classification)
Unclear what to do for completely new problems
Optimization-inspired architecture choice
With wide enough network, can get zero training error
Use the smallest network that lets you get zero training error
Then add regularization term to objective (e.g., sum of squares
of weights), and optimize the regularized ERM objective
Entire research communities are trying to figure out good architectures for their problems
26/36
Multi-class
Vector-valued activation: σ : Rdl → Rdl
Softmax activation: σ(v)i = exp(vi)/j exp(vj) Common to use this in final layer
27/36
Convolutional nets
Neural networks with convolutional layers Useful when inputs have locality structure
Sequential structure (e.g., speech waveform) 2D grid structure (e.g., image)
...
Weight matrix Wl is highly-structured
Determined by a small filter
Time to compute Wlhl−1 is typically ≪ dl × dl−1 (e.g., closer
to max{dl, dl−1})
28/36
Convolutions I
Convolution of two continuous functions: h := f ∗ g +∞
f (y)g(x − y) dy
Iff(x)=0forx∈/[−w,+w],then +w
h(x) =
−∞
−w
f (y)g(x − y) dy
h(x) =
Replaces g(x) with weighted combination of g at nearby points
29/36
Convolutions II
For functions on discrete domain, replace integral with sum
∞
h(i)= f(j)g(i−j) j =−∞
E.g., suppose only f(0),f(1),f(2) are non-zero. Then:
h(1)
h(2)
h(3)
f(0) 0 0 0 0 f(1) f(0) 0 0 0 g(1)
f(2) f(1)
f(0)
f(1)
0 f(0) f (1)
0 g(2)
0 g(3)
h(4) = 0 f(2)
f (0) g(4) h(6) 0 0 0 f (2) f (1) g(5)
h(7) 0 0 0 0 f(2) (Here, we pretend g(i) = 0 for i < 1 and i > 5.)
h(5) 0 0
f (2)
30/36
Figure 8: Convolutional layer
31/36
Convolutions III
Similar for 2D inputs (e.g., images), except now sum over two indices
g is the input image
f is the filter
Lots of variations (e.g., padding, strides, multiple “channels”)
Use additional layers/activations to down-sample after convolution
E.g., max-pooling
32/36
Figure 9: 2D convolution
33/36
Figure 10: 2D convolution
34/36
Figure 11: 2D convolution
35/36
Figure 12: 2D convolution
36/36