CS计算机代考程序代写 scheme chain deep learning algorithm Machine learning lecture slides

Machine learning lecture slides

Machine learning lecture slides

COMS 4771 Fall 2020

0 / 36

Optimization II: Neural networks

Outline

I Architecture of (layered) feedforward neural networks
I Universal approximation
I Backpropagation
I Practical issues

1 / 36

Parametric featurizations

I So far: data features (x or ϕ(x)) are fixed during training
I Consider a (small) collection of feature transformations ϕ
I Select ϕ via cross-validation – outside of normal training

I “Deep learning” approach:
I Use ϕ with many tunable parameters
I Optimize parameters of ϕ during normal training process

I Neural network: parameterization for function f : Rd → R
I f(x) = ϕ(x)Tw
I Parameters include both w and parameters of ϕ
I Varying parameters of ϕ allows f to be essentially any function!
I Major challenge: optimization

ϕ

Figure 1: Neural networks as feature maps 2 / 36

Feedforward neural network

I Architecture of a feedforward neural network
I Directed acyclic graph G = (V,E)
I One source node (vertex) per input to the function (x1, . . . , xd)
I One sink node per output of the function
I Internal nodes are called hidden units
I Each edge (u, v) ∈ E has a weight parameter wu,v ∈ R
I Value hv of node v given values of parents

πG(v) = {u ∈ V : (u, v) ∈ E} is

hv := σv(zv), zv :=

u∈πG(v)

wu,v · hu.

I σv : R→ R is the activation function (a.k.a. link function)
I E.g., sigmoid function σv(z) = 1/(1 + e−z)

I Inspired by neurons in the brain

3 / 36

· · ·

wu,v
u

v

Figure 2: Computation DAG of a feedforward neural network

4 / 36

Standard layered architectures

I Standard feedforward architecture arranges nodes into layers
I Initial layer (layer zero): source nodes (input)
I Final layer (layer L): sink nodes (output)
I (Layer counting is confusing; usually don’t count input)

I Edges only go from one layer to the next
I (Non-standard feedforward architectures (e.g., ResNets) break

this rule)
I Can write function using matrices of weight parameters

f(x) = σL(WLσL−1(· · ·σ1(W1x) · · · ))
I Layer ` has d` nodes
I W` ∈ Rd`×d`−1 are the weight parameters for layer `
I Scalar-valued activation function σ` : R→ R (e.g., sigmoid) is

applied coordinate-wise to input
I Often also include “bias” parameters b` ∈ Rd`

f(x) = σL(bL +WLσL−1(· · ·σ1(b1 +W1x) · · · ))
I The tunable parameters: θ = (W1, b1, . . . ,WL, bL)

5 / 36

input hidden units output

W 1 W 2 W 3

Figure 3: Layered feedforward neural network

6 / 36

Well-known activation functions

I Heaviside: σ(z) = 1{z≥0}
I Popular in the 1940s; also called step function

I Sigmoid (from logistic regression): σ(z) = 1/(1 + e−z)
I Popular since 1970s

I Hyperbolic tangent: σ(z) = tanh(z)
I Similar to sigmoid, but range is (−1, 1) rather than (0, 1)

I Rectified Linear Unit (ReLU): σ(z) = max{0, z}
I Popular since 2012

I Identity: σ(z) = z
I Popular with luddites

7 / 36

Power of non-linear activations

I What happens if every activation function is linear/affine?
I Overall function is affine
I An unusual way to parameterize an affine function

I Therefore, use non-linear/non-affine activation functions

8 / 36

Necessity of multiple layers (1)

I Suppose only have input and output layers, so function f is
f(x) = σ(b+ wTx)

where b ∈ R and w ∈ Rd (so wT ∈ R1×d)
I If σ is monotone (e.g., Heaviside, sigmoid, hyperbolic tangent,

ReLU, identity), then f has same limitations as a linear/affine
classifier

Figure 4: XOR data set

9 / 36

Necessity of multiple layers (2)

I XOR problem
I Let x(1) = (+1,+1), x(2) = (+1,−1), x(3) = (−1,+1),

x(4) = (−1,−1).
I y(i) = +1 iff coordinates of x(i) are the same. (XNOR)
I Suppose (w, b) ∈ R2 × R satisfies

−b− wTx(1) < 0, b+ wTx(2) < 0, b+ wTx(3) < 0. I Add up the equations: b+ wT(x(2) + x(3) − x(1)) < 0. I But x(2) + x(3) − x(1) = x(4), so b+ wTx(4) < 0. In other words, cannot correctly label x(4). 10 / 36 Neural network approximation theorems I Theorem (Cybenko, 1989; Hornik, Stinchcombe, & White, 1989): Let σ1 be any continuous non-linear activation function from above. For any continuous function f : Rd → R and any ε > 0, there is a two-layer neural network (with parameters
θ = (W1, b1, w2)) s.t.

max
x∈[0,1]d

|f(x)− wT2σ1(b1 +W1x)| < ε. I This property of such families of neural networks is called universal approximation. I Many caveats I “Width” (number of hidden units) may need to be very large I Does not tell us how to find the network I Does not justify deeper networks 11 / 36 Stone-Weierstrass theorem (polynomial version) Theorem (Weierstrass, 1885): For any continuous function f : [a, b]→ R, and any � > 0, there exists a polynomial p such that

sup
x∈[a,b]

|f(x)− p(x)| < �. 12 / 36 Stone-Weierstrass theorem (general version) Theorem (Stone, 1937): Let K ⊂ Rd be any bounded set. Let A be a set of continuous functions on K such that the following hold. (1) A is an algebra (i.e., A is closed under addition, multiplication, and scalar multiplication). (2) A does not vanish on K (i.e., for all x ∈ K, there exists h ∈ A such that h(x) 6= 0). (3) A separates points in K (i.e., for all distinct x, y ∈ K, there exists h ∈ A such that h(x) 6= h(y)). For any continuous function f : K → R, and any � > 0, there exists
h ∈ A such that

sup
x∈K
|f(x)− h(x)| < �. 13 / 36 Two-layer neural networks with cosine activation functions Let K = [0, 1]d, and let A = { x 7→ m∑ k=1 ak cos(xTwk + bk) : m ∈ N, ak, bk ∈ R, wk ∈ Rd for k = 1, . . . ,m } . (Check that A satisfies properties of Stone-Weierstrass theorem.) 14 / 36 Two-layer neural networks with exp activation functions Let K = [0, 1]d, and let A = { x 7→ m∑ k=1 ak exp(xTwk + bk) : m ∈ N, ak, bk ∈ R, wk ∈ Rd for k = 1, . . . ,m } . (Check that A satisfies properties of Stone-Weierstrass theorem.) 15 / 36 Fitting neural networks to data I Training data (x1, y1), . . . , (xn, yn) ∈ Rd × R I Fix architecture: G = (V,E) and activation functions I ERM: find parameters θ of neural network fθ to minimize empirical risk (possibly with a surrogate loss) I Regression yi ∈ R R̂(θ) = 1 n n∑ i=1 (fθ(xi)− yi)2 I Binary classification yi ∈ {−1,+1} R̂(θ) = 1 n n∑ i=1 ln(1 + exp(−yifθ(xi))) (Could use other surrogate loss functions . . . ) I Can also add regularization terms, but also common to use algorithmic regularization I Typically objective is not convex in parameters θ I Nevertheless, local search (e.g., gradient descent, SGD) often works well! 16 / 36 Backpropagation I Backpropagation (backprop): Algorithm for computing partial derivatives wrt weights in a feedforward neural network I Clever organization of partial derivative computations with chain rule I Use in combination with gradient descent, SGD, etc. I Consider loss on a single example (x, y), written as J := `(y, fθ(x)) I Goal: compute ∂J ∂wu,v for every edge (u, v) ∈ E I Initial step of backprop: forward propagation I Compute zv’s and hv’s for every node v ∈ V I Running time: linear in size of network I We’ll see that rest of backprop also just requires time linear in size of network 17 / 36 Derivative of loss with respect to weights I Let ŷ1, ŷ2, . . . denote the values at the output nodes. I Then by chain rule, ∂J wu,v = ∑ i ∂J ∂ŷi · ∂ŷi wu,v . I ∂J ∂ŷi is just determined by the loss function (e.g., squared loss) I So just have to focus on ∂ŷi wu,v I Assume for simplicity there is just a single output, ŷ 18 / 36 Derivative of output with respect to weights I Chain rule, again: ∂ŷ ∂wu,v = ∂ŷ ∂hv · ∂hv ∂wu,v I First term: trickier; we’ll handle later I Second term: I hv = σv(zv) I zv = wu,v · hu + (terms not involving wu,v) I Therefore ∂hv ∂wu,v = ∂hv ∂zv · ∂zv ∂wu,v = σ′(zv) · hu. I zv and hu were computed during forward propagation 19 / 36 v u · · · wu,v Figure 5: Derivative of a node’s output with respect to an incoming weight 20 / 36 Derivative of output with respect to hidden units I Key trick: compute ∂ŷ ∂hv for all vertices in decreasing order of layer number I If v is not the output node, then by chain rule (yet again), ∂ŷ ∂hv = ∑ v′:(v,v′)∈E ∂ŷ ∂hv′ · ∂hv′ ∂hv I ∂ŷ ∂hv′ was already computed since v′ is in higher layer than v I hv′ = σv′(zv′) I zv′ = wv,v′ · hv + (terms not involving hv) I Therefore ∂hv′ ∂hv = ∂hv′ ∂zv′ · ∂zv′ ∂hv = σ′(zv′) · wv,v′ . I zv′ ’s were computed during forward propagation I wv,v′ ’s are the values of the weight parameters at which we want to compute the gradient 21 / 36 v · · ·v′1 v ′ k wv,v′1 · · · wv,v′k Figure 6: Derivative of the network output with respect to hidden unit values 22 / 36 Example: chain graph (1) I Function fθ : R→ R I Architecture I DAG: 0 −→ 1 −→ 2 −→ · · · −→ L I Same activation σ in every layer I Parameters θ = (w0,1, w1,2, . . . , wL−1,L) I Input is at vertex 0, and output is at vertex L I Fix input value x ∈ R; what is ∂hL ∂wi−1,i for i = 1, . . . , L? I Forward propagation: I h0 := x I For i = 1, 2, . . . , L: zi := wi−1,ihi−1 hi := σ(zi) 23 / 36 Example: chain graph (2) I Backprop: I For i = L,L− 1, . . . , 1: ∂hL ∂hi := { 1 if i = L ∂hL ∂hi+1 · σ′(zi+1)wi,i+1 if i < L ∂hL ∂wi−1,i := ∂hL ∂hi · σ′(zi)hi−1 · · ·1 L input output 0 w0,1 w1,2 wL−1,L Figure 7: Neural network with a chain computation graph 24 / 36 Practical issues I: Initialization I Ensure inputs are standardized: every feature has zero mean and unit variance (wrt training data) I Even better: different features have zero covariance (again, on training data) I But this can be expensive I Initialize weights randomly for gradient descent / SGD I Standard normal random variables (or similar) I What should variance be? I Heuristic: ensure hv have similar statistics as inputs I E.g., using tanh-activation, if v has in-degree k, use variance 1/k for all weights wu,v I Many initialization schemes for other activations (e.g., ReLU), dealing with bias parameters, . . . 25 / 36 Practical issues II: Architecture choice I Architecture can be regarded as a “hyperparameter” I Could use cross-validation to select, but . . . I Many “good” architectures are known for popular problems (e.g., image classification) I Unclear what to do for completely new problems I Optimization-inspired architecture choice I With wide enough network, can get zero training error I Use the smallest network that lets you get zero training error I Then add regularization term to objective (e.g., sum of squares of weights), and optimize the regularized ERM objective I Entire research communities are trying to figure out good architectures for their problems 26 / 36 Multi-class Vector-valued activation: σ : Rd` → Rd` I Softmax activation: σ(v)i = exp(vi)/ ∑ j exp(vj) I Common to use this in final layer 27 / 36 Convolutional nets I Neural networks with convolutional layers I Useful when inputs have locality structure I Sequential structure (e.g., speech waveform) I 2D grid structure (e.g., image) I . . . I Weight matrix W` is highly-structured I Determined by a small filter I Time to compute W`h`−1 is typically � d` × d`−1 (e.g., closer to max{d`, d`−1}) 28 / 36 Convolutions I I Convolution of two continuous functions: h := f ∗ g h(x) = ∫ +∞ −∞ f(y)g(x− y) dy I If f(x) = 0 for x /∈ [−w,+w], then h(x) = ∫ +w −w f(y)g(x− y) dy I Replaces g(x) with weighted combination of g at nearby points 29 / 36 Convolutions II I For functions on discrete domain, replace integral with sum h(i) = ∞∑ j=−∞ f(j)g(i− j) E.g., suppose only f(0), f(1), f(2) are non-zero. Then:  h(1) h(2) h(3) h(4) h(5) h(6) h(7)   =   f(0) 0 0 0 0 f(1) f(0) 0 0 0 f(2) f(1) f(0) 0 0 0 f(2) f(1) f(0) 0 0 0 f(2) f(1) f(0) 0 0 0 f(2) f(1) 0 0 0 0 f(2)     g(1) g(2) g(3) g(4) g(5)   (Here, we pretend g(i) = 0 for i < 1 and i > 5.)

30 / 36

Figure 8: Convolutional layer

31 / 36

Convolutions III

I Similar for 2D inputs (e.g., images), except now sum over two
indices
I g is the input image
I f is the filter
I Lots of variations (e.g., padding, strides, multiple “channels”)

I Use additional layers/activations to down-sample after
convolution
I E.g., max-pooling

32 / 36

Figure 9: 2D convolution

33 / 36

Figure 10: 2D convolution

34 / 36

Figure 11: 2D convolution

35 / 36

Figure 12: 2D convolution

36 / 36

Optimization II: Neural networks