程序代写代做代考 go deep learning algorithm chain graph Machine learning lecture slides

Machine learning lecture slides
COMS 4771 Fall 2020
0/36

Optimization II: Neural networks

Outline
􏰛 Architecture of (layered) feedforward neural networks 􏰛 Universal approximation
􏰛 Backpropagation
􏰛 Practical issues
1/36

Parametric featurizations
􏰛 So far: data features (x or φ(x)) are fixed during training 􏰛 Consider a (small) collection of feature transformations φ 􏰛 Select φ via cross-validation – outside of normal training
􏰛 “Deep learning” approach:
􏰛 Use φ with many tunable parameters
􏰛 Optimize parameters of φ during normal training process
􏰛 Neural network: parameterization for function f : Rd → R 􏰛 f(x) = φ(x)Tw
􏰛 Parameters include both w and parameters of φ
􏰛 Varying parameters of φ allows f to be essentially any function! 􏰛 Major challenge: optimization
φ
Figure 1: Neural networks as feature maps
2 / 36

Feedforward neural network
􏰛 Architecture of a feedforward neural network
􏰛 Directed acyclic graph G = (V, E)
􏰛 One source node (vertex) per input to the function (x1, . . . , xd)
􏰛 One sink node per output of the function
􏰛 Internal nodes are called hidden units
􏰛 Each edge (u, v) ∈ E has a weight parameter wu,v ∈ R
􏰛 Value hv of node v given values of parents πG(v)={u∈V :(u,v)∈E}is
hv:=σv(zv), zv:= 􏰊 wu,v·hu. u∈πG(v)
􏰛 σv : R → R is the activation function (a.k.a. link function)
􏰛 E.g., sigmoid function σv(z) = 1/(1 + e−z)
􏰛 Inspired by neurons in the brain
3/36

wu,v v u
···
Figure 2: Computation DAG of a feedforward neural network
4/36

Standard layered architectures
􏰛 Standard feedforward architecture arranges nodes into layers 􏰛 Initial layer (layer zero): source nodes (input)
􏰛 Final layer (layer L): sink nodes (output)
􏰛 (Layer counting is confusing; usually don’t count input)
􏰛 Edges only go from one layer to the next
􏰛 􏰛 Can
􏰛 􏰛 􏰛
(Non-standard feedforward architectures (e.g., ResNets) break this rule)
write function using matrices of weight parameters
f(x) = σL(WLσL−1(· · · σ1(W1x) · · · ))
Layer l has dl nodes
Wl ∈ Rdl×dl−1 are the weight parameters for layer l Scalar-valued activation function σl : R → R (e.g., sigmoid) is applied coordinate-wise to input
􏰛 Often also include “bias” parameters bl ∈ Rdl f(x)=σL(bL +WLσL−1(···σ1(b1 +W1x)···))
􏰛 The tunable parameters: θ = (W1, b1, . . . , WL, bL)
5/36

W1W2W3
input hidden units output
Figure 3: Layered feedforward neural network
6/36

Well-known activation functions
􏰛 Heaviside: σ(z) = 1{z≥0}
􏰛 Popular in the 1940s; also called step function
􏰛 Sigmoid (from logistic regression): σ(z) = 1/(1 + e−z) 􏰛 Popular since 1970s
􏰛 Hyperbolic tangent: σ(z) = tanh(z)
􏰛 Similar to sigmoid, but range is (−1, 1) rather than (0, 1)
􏰛 Rectified Linear Unit (ReLU): σ(z) = max{0, z} 􏰛 Popular since 2012
􏰛 Identity: σ(z) = z
􏰛 Popular with luddites
7/36

Power of non-linear activations
􏰛 What happens if every activation function is linear/affine? 􏰛 Overall function is affine
􏰛 An unusual way to parameterize an affine function
􏰛 Therefore, use non-linear/non-affine activation functions
8/36

Necessity of multiple layers (1)
􏰛 Suppose only have input and output layers, so function f is f(x) = σ(b + wTx)
whereb∈Randw∈Rd (sowT ∈R1×d)
􏰛 If σ is monotone (e.g., Heaviside, sigmoid, hyperbolic tangent,
ReLU, identity), then f has same limitations as a linear/affine classifier
Figure 4: XOR data set
9/36

Necessity of multiple layers (2)
􏰛 XOR problem
􏰛 Let x(1) = (+1, +1), x(2) = (+1, −1), x(3) = (−1, +1),
x(4) = (−1, −1).
􏰛 y(i) = +1 iff coordinates of x(i) are the same. (XNOR)
􏰛 Suppose (w, b) ∈ R2 × R satisfies
−b−wTx(1) <0, b+wTx(2) <0, b+wTx(3) <0. 􏰛 Add up the equations: b+wT(x(2) +x(3) −x(1))<0. 􏰛 Butx(2)+x(3)−x(1) =x(4),so b+wTx(4) <0. In other words, cannot correctly label x(4). 10/36 Neural network approximation theorems 􏰛 Theorem (Cybenko, 1989; Hornik, Stinchcombe, & White, 1989): Let σ1 be any continuous non-linear activation function from above. For any continuous function f : Rd → R and any ε > 0, there is a two-layer neural network (with parameters
θ = (W1, b1, w2)) s.t.
maxd |f(x) − w2Tσ1(b1 + W1x)| < ε. x∈[0,1] 􏰛 This property of such families of neural networks is called universal approximation. 􏰛 Many caveats 􏰛 “Width” (number of hidden units) may need to be very large 􏰛 Does not tell us how to find the network 􏰛 Does not justify deeper networks 11/36 Stone-Weierstrass theorem (polynomial version) Theorem (Weierstrass, 1885): For any continuous function f : [a, b] → R, and any ε > 0, there exists a polynomial p such that
sup |f(x)−p(x)|<ε. x∈[a,b] 12/36 Stone-Weierstrass theorem (general version) Theorem (Stone, 1937): Let K ⊂ Rd be any bounded set. Let A be a set of continuous functions on K such that the following hold. (1) A is an algebra (i.e., A is closed under addition, multiplication, and scalar multiplication). (2) A does not vanish on K (i.e., for all x ∈ K, there exists h ∈ A such that h(x) ̸= 0). (3) A separates points in K (i.e., for all distinct x, y ∈ K, there exists h ∈ A such that h(x) ̸= h(y)). For any continuous function f : K → R, and any ε > 0, there exists h ∈ A such that
sup |f(x) − h(x)| < ε. x∈K 13/36 Two-layer neural networks with cosine activation functions Let K = [0,1]d, and let m x􏰜→􏰊akcos(xTwk +bk): A= 􏰇 k=1 m∈N,ak,bk ∈R,wk ∈Rd fork=1,...,m . (Check that A satisfies properties of Stone-Weierstrass theorem.) 􏰈 14/36 Two-layer neural networks with exp activation functions Let K = [0,1]d, and let m x􏰜→􏰊akexp(xTwk +bk): A= 􏰇 k=1 m∈N,ak,bk ∈R,wk ∈Rd fork=1,...,m . (Check that A satisfies properties of Stone-Weierstrass theorem.) 􏰈 15/36 Fitting neural networks to data 􏰛 Trainingdata(x1,y1),...,(xn,yn)∈Rd×R 􏰛 Fix architecture: G = (V, E) and activation functions 􏰛 ERM: find parameters θ of neural network fθ to minimize empirical risk (possibly with a surrogate loss) 􏰛 Regression yi ∈ R 1 􏰊n (fθ(xi) − yi)2 􏰛 Binary classification yi ∈ {−1, +1} 1 􏰊n R􏰓(θ) = n (Could use other surrogate loss functions . . . ) 􏰛 Can also add regularization terms, but also common to use algorithmic regularization 􏰛 Typically objective is not convex in parameters θ 􏰛 Nevertheless, local search (e.g., gradient descent, SGD) often R􏰓(θ) = n i=1 ln(1 + exp(−yifθ(xi))) i=1 works well! 16 / 36 Backpropagation 􏰛 Backpropagation (backprop): Algorithm for computing partial derivatives wrt weights in a feedforward neural network 􏰛 Clever organization of partial derivative computations with chain rule 􏰛 Use in combination with gradient descent, SGD, etc. 􏰛 Consider loss on a single example (x, y), written as J := l(y,fθ(x)) 􏰛 Goal: compute ∂J for every edge (u, v) ∈ E ∂ wu,v 􏰛 Initial step of backprop: forward propagation 􏰛 Compute zv’s and hv’s for every node v ∈ V 􏰛 Running time: linear in size of network 􏰛 We’ll see that rest of backprop also just requires time linear in size of network 17/36 Derivative of loss with respect to weights 􏰛 Let yˆ1, yˆ2, . . . denote the values at the output nodes. 􏰛 Then by chain rule, ∂ J = 􏰊 ∂ J · ∂ yˆ i . wu,v i ∂ yˆi wu,v 􏰛 ∂J is just determined by the loss function (e.g., squared loss) ∂ yˆ i 􏰛 So just have to focus on ∂yˆi wu,v 􏰛 Assume for simplicity there is just a single output, yˆ 18/36 Derivative of output with respect to weights 􏰛 Chain rule, again: ∂yˆ = ∂yˆ · ∂hv ∂wu,v ∂hv ∂wu,v 􏰛 First term: trickier; we’ll handle later 􏰛 Second term: 􏰛 hv =σv(zv) 􏰛 zv = wu,v · hu + (terms not involving wu,v ) 􏰛 Therefore ∂hv =∂hv·∂zv =σ′(zv)·hu. ∂wu,v ∂zv ∂wu,v 􏰛 zv and hu were computed during forward propagation 19/36 v wu,v u ··· Figure 5: Derivative of a node’s output with respect to an incoming weight 20/36 Derivative of output with respect to hidden units 􏰛 Key trick: compute ∂yˆ for all vertices in decreasing order of ∂hv layer number 􏰛 If v is not the output node, then by chain rule (yet again), ∂yˆ= 􏰊 ∂yˆ·∂hv′ ∂hv v′:(v,v′)∈E ∂hv′ ∂hv 􏰛 ∂yˆ was already computed since v′ is in higher layer than v ∂hv′ 􏰛 hv′ =σv′(zv′) 􏰛 zv′ = wv,v′ · hv + (terms not involving hv) 􏰛 Therefore ∂hv′ = ∂hv′ · ∂zv′ ∂hv ∂zv′ ∂hv = σ′(zv′ ) · wv,v′ . 􏰛 zv′ ’s were computed during forward propagation 􏰛 wv,v′ ’s are the values of the weight parameters at which we want to compute the gradient 21/36 v1′ ··· vk′ wv,v1′ · · · wv,vk′ v Figure 6: Derivative of the network output with respect to hidden unit values 22/36 Example: chain graph (1) 􏰛 Functionfθ:R→R 􏰛 Architecture 􏰛 DAG:0−→1−→2−→···−→L 􏰛 Same activation σ in every layer 􏰛 Parameters θ = (w0,1, w1,2, . . . , wL−1,L) 􏰛 Input is at vertex 0, and output is at vertex L 􏰛 Fix input value x ∈ R; what is ∂hL for i = 1,...,L? 􏰛 Forward propagation: 􏰛 h0 := x 􏰛 For i = 1,2,...,L: ∂ wi−1,i zi := wi−1,ihi−1 hi := σ(zi) 23/36 Example: chain graph (2) 􏰛 Backprop: 􏰛 Fori=L,L−1,...,1: ∂h 􏰇1 L:=∂hL ′ if i = L if i < L ∂hi ∂hL ∂hi+1 · σ (zi+1)wi,i+1 := ∂hL · σ′(zi)hi−1 ∂ hi w0,1 w1,2 wL−1,L ∂ wi−1,i 01···L input output Figure 7: Neural network with a chain computation graph 24/36 Practical issues I: Initialization 􏰛 Ensure inputs are standardized: every feature has zero mean and unit variance (wrt training data) 􏰛 Even better: different features have zero covariance (again, on training data) 􏰛 But this can be expensive 􏰛 Initialize weights randomly for gradient descent / SGD 􏰛 Standard normal random variables (or similar) 􏰛 What should variance be? 􏰛 Heuristic: ensure hv have similar statistics as inputs 􏰛 E.g., using tanh-activation, if v has in-degree k, use variance 1/k for all weights wu,v 􏰛 Many initialization schemes for other activations (e.g., ReLU), dealing with bias parameters, . . . 25/36 Practical issues II: Architecture choice 􏰛 Architecture can be regarded as a “hyperparameter” 􏰛 Could use cross-validation to select, but . . . 􏰛 Many “good” architectures are known for popular problems (e.g., image classification) 􏰛 Unclear what to do for completely new problems 􏰛 Optimization-inspired architecture choice 􏰛 With wide enough network, can get zero training error 􏰛 Use the smallest network that lets you get zero training error 􏰛 Then add regularization term to objective (e.g., sum of squares of weights), and optimize the regularized ERM objective 􏰛 Entire research communities are trying to figure out good architectures for their problems 26/36 Multi-class Vector-valued activation: σ : Rdl → Rdl 􏰛 Softmax activation: σ(v)i = exp(vi)/􏰉j exp(vj) 􏰛 Common to use this in final layer 27/36 Convolutional nets 􏰛 Neural networks with convolutional layers 􏰛 Useful when inputs have locality structure 􏰛 Sequential structure (e.g., speech waveform) 􏰛 2D grid structure (e.g., image) 􏰛 ... 􏰛 Weight matrix Wl is highly-structured 􏰛 Determined by a small filter 􏰛 Time to compute Wlhl−1 is typically ≪ dl × dl−1 (e.g., closer to max{dl, dl−1}) 28/36 Convolutions I 􏰛 Convolution of two continuous functions: h := f ∗ g 􏰩 +∞ f (y)g(x − y) dy 􏰛 Iff(x)=0forx∈/[−w,+w],then 􏰩 +w h(x) = −∞ −w f (y)g(x − y) dy h(x) = 􏰛 Replaces g(x) with weighted combination of g at nearby points 29/36 Convolutions II 􏰛 For functions on discrete domain, replace integral with sum ∞ h(i)= 􏰊 f(j)g(i−j) j =−∞ E.g., suppose only f(0),f(1),f(2) are non-zero. Then: h(1) h(2) h(3) f(0) 0 0 0 0 f(1) f(0) 0 0 0  g(1) f(2) f(1) f(0) f(1) 0 f(0) f (1)  0  g(2) 0  g(3) h(4) =  0 f(2) f (0) g(4) h(6)  0 0 0 f (2) f (1) g(5) h(7) 0 0 0 0 f(2) (Here, we pretend g(i) = 0 for i < 1 and i > 5.)
h(5)  0 0
f (2)
 
30/36

Figure 8: Convolutional layer
31/36

Convolutions III
􏰛 Similar for 2D inputs (e.g., images), except now sum over two indices
􏰛 g is the input image
􏰛 f is the filter
􏰛 Lots of variations (e.g., padding, strides, multiple “channels”)
􏰛 Use additional layers/activations to down-sample after convolution
􏰛 E.g., max-pooling
32/36

Figure 9: 2D convolution
33/36

Figure 10: 2D convolution
34/36

Figure 11: 2D convolution
35/36

Figure 12: 2D convolution
36/36