CS代考 CS 189 (CDSS offering)

Lecture 28: Neural networks (2) CS 189 (CDSS offering)
2022/04/06

Today’s lecture

Last time, we saw the basic structure of a neural network
• Successive nonlinear transformations of the input x that hopefully result in features that the final linear model (layer) will be successful with
How do we make the learned features actually good for the linear model?
• Usually, we utilize end-to-end learning: training the whole network on the overall objective (e.g., the negative log likelihood loss)
We will return to our old friend, gradient-based optimization, and see how gradients
can be computed in neural networks via the backpropagation algorithm

Recall: gradient descent
The gradient tells us how the loss value changes for small parameter changes
• We decrease the loss if we move (with a small enough !) along the direction of the negative gradient (basically, go “opposite the slope” in each dimension)
• Repeatedly performing ” ! ” ” ! #” N1 !N #(“; xi, yi) is gradient descent i=1
• Oftentimes, we will use stochastic gradient updates instead
We saw how to compute the (stochastic) gradient for logistic regression, but what
about for neural networks?

Visualizing losses and optimization
• Optimization is hard to visualize for any more than two parameters
• But neural networks have thousands, millions, billions of parameters…
• For visualization purposes, we will pretend they have two
• Some works have explored interesting ways to visualize loss “landscapes”
Li et al, NIPS ’18 Garipov et al, NIPS ‘18

Visualizing gradient descent in 2D

What gradients do we need?
• We want to update our parameters as ” ! ” ” ! #” N1 !N #(“; xi, yi) i=1
• ” represents all our parameters, e.g., [W(1), b(1), …, W(L), b(L), Wfinal, bfinal]
• So we need [ #W(1) #, #b(1) #, …, #W(L) #, #b(L) #, #Wfinal #, #bfinal #]
• How do we compute these gradients? Let’s talk about two different approaches:
• numerical (finite differences) vs. analytical (backpropagation)

Finite differences
• The method of finite differences says that, for any sufficiently smooth function f which operates on a vector x, the partial derivative $f is approximated by
$f f(x+$ei)”f(x”$ei) $xi
$xi % 2$ , where ei denotes a “one hot” vector
• This is the definition of (partial) derivatives as $ & 0
• Think about how slow this would be to do for all our network parameters…
Nevertheless, it can be useful as a method for checking gradients

Computing gradients via backpropagation
• The backpropagation algorithm is a much faster and more efficient method for computing gradients for neural network parameters
• It made training large neural networks feasible and practical
• Backpropagation works “backward” through the network, which allows for:
• reusing gradient values that have already been computed
• computing matrix-vector products rather than matrix-matrix products, since the loss is a scalar!
• It’s pretty confusing the first (or second, or third, …) time you see it

Backpropagation: the math
x a(1) nonlinear 
first, let’s do the “forward pass” through our network, from input to prediction let’s work with two hidden layers, for concreteness
nonlinear  layer
linear layer

Backpropagation: the math
z = Wfinala(2) + bfinal represents our logits
x a(1) nonlinear 
nonlinear  layer
linear layer

Backpropagation: the math
first let’s look at !Wfinal! and !bfinal!
remember: ! = log ! exp z ” zyi, and also z = Wfinala(2) + bfinal
nonlinear  layer
nonlinear  layer
linear layer

Backpropagation: the math
x a(1) nonlinear 
now let’s look at !W(2) ! and !b(2) !
remember: a(2) = “(z(2)), and also z(2) = W(2)a(1) + b(2)
a pattern emerges… do you see it?
nonlinear  layer
linear layer

Backpropagation: the summary
• First, we perform a forward pass and cache all the intermediate z(l), a(l)
• Then, we work our way backwards to compute all the #W(l)#, #b(l)#
• Going backwards allows us to reuse gradients that have already been computed
• It also results in matrix-vector product computations, which are far more efficient than matrix-matrix product computations
• After all the gradients have been computed, we are ready to take a gradient step
• And neural network optimization basically just repeats this over and over

• Backpropagation can be tricky and unintuitive
• What can help is trying to work out the math on your own to see the patterns
• Implementing it for HW5 should also help solidify the concept
• But, most importantly: we don’t have to do it ourselves these days!
• Deep learning libraries do it for us via automatic differentiation, and you will feel the benefits of this in HW6 when you don’t have to implement it yourself

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts