Lecture 28: Neural networks (2) CS 189 (CDSS offering)
2022/04/06
Today’s lecture
Copyright By PowCoder代写 加微信 powcoder
Last time, we saw the basic structure of a neural network
• Successive nonlinear transformations of the input x that hopefully result in features that the final linear model (layer) will be successful with
How do we make the learned features actually good for the linear model?
• Usually, we utilize end-to-end learning: training the whole network on the overall objective (e.g., the negative log likelihood loss)
We will return to our old friend, gradient-based optimization, and see how gradients
can be computed in neural networks via the backpropagation algorithm
Recall: gradient descent
The gradient tells us how the loss value changes for small parameter changes
• We decrease the loss if we move (with a small enough !) along the direction of the negative gradient (basically, go “opposite the slope” in each dimension)
• Repeatedly performing ” ! ” ” ! #” N1 !N #(“; xi, yi) is gradient descent i=1
• Oftentimes, we will use stochastic gradient updates instead
We saw how to compute the (stochastic) gradient for logistic regression, but what
about for neural networks?
Visualizing losses and optimization
• Optimization is hard to visualize for any more than two parameters
• But neural networks have thousands, millions, billions of parameters…
• For visualization purposes, we will pretend they have two
• Some works have explored interesting ways to visualize loss “landscapes”
Li et al, NIPS ’18 Garipov et al, NIPS ‘18
Visualizing gradient descent in 2D
What gradients do we need?
• We want to update our parameters as ” ! ” ” ! #” N1 !N #(“; xi, yi) i=1
• ” represents all our parameters, e.g., [W(1), b(1), …, W(L), b(L), Wfinal, bfinal]
• So we need [ #W(1) #, #b(1) #, …, #W(L) #, #b(L) #, #Wfinal #, #bfinal #]
• How do we compute these gradients? Let’s talk about two different approaches:
• numerical (finite differences) vs. analytical (backpropagation)
Finite differences
• The method of finite differences says that, for any sufficiently smooth function f which operates on a vector x, the partial derivative $f is approximated by
$f f(x+$ei)”f(x”$ei) $xi
$xi % 2$ , where ei denotes a “one hot” vector
• This is the definition of (partial) derivatives as $ & 0
• Think about how slow this would be to do for all our network parameters…
Nevertheless, it can be useful as a method for checking gradients
Computing gradients via backpropagation
• The backpropagation algorithm is a much faster and more efficient method for computing gradients for neural network parameters
• It made training large neural networks feasible and practical
• Backpropagation works “backward” through the network, which allows for:
• reusing gradient values that have already been computed
• computing matrix-vector products rather than matrix-matrix products, since the loss is a scalar!
• It’s pretty confusing the first (or second, or third, …) time you see it
Backpropagation: the math
x a(1) nonlinear
first, let’s do the “forward pass” through our network, from input to prediction let’s work with two hidden layers, for concreteness
nonlinear
layer
linear layer
Backpropagation: the math
z = Wfinala(2) + bfinal represents our logits
x a(1) nonlinear
nonlinear
layer
linear layer
Backpropagation: the math
first let’s look at !Wfinal! and !bfinal!
remember: ! = log ! exp z ” zyi, and also z = Wfinala(2) + bfinal
nonlinear
layer
nonlinear
layer
linear layer
Backpropagation: the math
x a(1) nonlinear
now let’s look at !W(2) ! and !b(2) !
remember: a(2) = “(z(2)), and also z(2) = W(2)a(1) + b(2)
a pattern emerges… do you see it?
nonlinear
layer
linear layer
Backpropagation: the summary
• First, we perform a forward pass and cache all the intermediate z(l), a(l)
• Then, we work our way backwards to compute all the #W(l)#, #b(l)#
• Going backwards allows us to reuse gradients that have already been computed
• It also results in matrix-vector product computations, which are far more efficient than matrix-matrix product computations
• After all the gradients have been computed, we are ready to take a gradient step
• And neural network optimization basically just repeats this over and over
• Backpropagation can be tricky and unintuitive
• What can help is trying to work out the math on your own to see the patterns
• Implementing it for HW5 should also help solidify the concept
• But, most importantly: we don’t have to do it ourselves these days!
• Deep learning libraries do it for us via automatic differentiation, and you will feel the benefits of this in HW6 when you don’t have to implement it yourself
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com