Lecture 11:
Learning linear classifiers CS 189 (CDSS offering)
2022/02/11
Copyright By PowCoder代写 加微信 powcoder
Today’s lecture
• We have seen a few examples now of problems with no analytical solution • Least absolute deviations, LASSO, and logistic regression
• As mentioned, these problem are typically solved by iterative optimization
• We won’t cover any of the more advanced algorithms — take EECS 127/227
• In this class, we will mainly stick to gradient descent
• For logistic regression (today), and for neural networks (later in the course)
• Today, we will also discuss linear classifiers and the perceptron learning algorithm
Recap from last lecture
• The logistic regression model represents the predicted probability of class 1 as: p!(y = 1!x) = sigmoid(f!(x)) = exp{!!x}
exp{!!x} + 1 exp{!!x} + 1
And p!(y = 0!x) = 1
• We worked out that the gradient of the MLE optimization problem, in matrix-
vector form, is X!(y ” s!) where (s!)i = sigmoid(!!xi)
Gradient based optimization
• For the easy problems we saw previously, we would set the gradient equal to zero and just “read off” the optimal parameters
• This won’t work here — we can’t set X!(y ” s!) = 0 and solve for ! • Instead, we are going to try and “follow the gradient”
• Basically, use X!(y ” s!) to update ! iteratively, to push X!(y ” s!) to zero
• For maximization problems like MLE, this is called gradient ascent
• For minimizations (i.e., loss functions), this is gradient descent
Gradient descent
What does it mean to “follow the gradient”?
• Consider finding parameters ! that minimize N1 !Ni=1 “(!; xi, yi)
• E.g., the negative log likelihood loss (divided by N) which corresponds to MLE
• Gradient descent moves the parameters in the direction of the negative gradient of the average loss: ! # ! ” # $! N1 !Ni=1 “(!; xi, yi)
• # is some small number and is called the learning rate or step size
• For well conditioned problems and small enough #, we will eventually reach some ! for which
$! N1 !Ni=1 “(!; xi, yi) = 0 — this is not an obvious statement! 5
Some intuition for gradient descent in 1D
negative derivative move O in positive
direction positive derivative
in negative direction
zero derivative
Visualizing gradient descent in 2D
Gradient descent for logistic regression
• For logistic regression, start from some ! and repeatedly update: ! # ! ” #X!(s! ” y)
! is “moving towards” xi for which yi = 1 and “moving away from” xj for which yj = 0
• Exercise: why does this make sense?
• The more wrong (s!)i is, the bigger the move towards or away from xi
• There’s actually a problem — we’re not really moving towards or away, we are repeatedly adding to our !, so its magnitude may grow unboundedly
• What’s the fix here? Make sure you know (and review last lecture if you’re not sure)
Stochastic optimization
Or “stochastic gradient descent (SGD)”, colloquially
• Computing $! N1 !N “(!; xi, yi) every iteration for large N (think 1 million) is a bad idea i=1
• Instead, we pick a batch size (or mini batch size) B % N, we randomly sample {(x1, y1), …, (xB, yB)} from the training data, and we compute $! B1 !B “(!; xi, yi)
i=1 • For a problem like logistic regression, we can consider B as small as 1!
• I.e., randomly pick a (xi, yi), and update ! # ! ” #(sigmoid(!!xi) ” yi)xi 9
Classifiers
• Pedantically, a classifier is a combination of a model and a decision rule
• E.g., the model may be probabilistic, like logistic regression
• And the rule can be: predict the class that has the highest predicted probability
• For logistic regression, this turns out to have a nice simplification
• We predict class 1 if p!(y = 1!x) = sigmoid(!!x) > 0.5
• This corresponds to !!x > 0! So we can just look at the sign of the model output in order to perform classification
Linear classifiers
• For binary classification, linear classifiers can all be written as making decisions according to the sign of !!x
• !!x = 0 is referred to as the separating hyperplane or decision boundary
• Logistic regression is a probabilistic view of this problem, which allowed us to use
tools like MLE, MAP estimation, and gradient descent
• Next week, we will see a geometric view of this problem, which leads to a model
and algorithm known as the support-vector machine
• Let’s conclude with by looking at another very simple algorithm — the perceptron
The perceptron algorithm The simplest version
• The perceptron is best understood as repeating the following update to !
• Pick a (xi, yi) — if sign(!!xi) = yi, move on; otherwise, ! # ! + yixi
• We repeat this until sign(!!xi) = yi for all i
• For linearly separable datasets, this is guaranteed to eventually happen!
• If the data are not linearly separable, this algorithm will never terminate
• This is one of the earliest examples of a learning algorithm, and it has some interesting similarities to the (stochastic) gradient update for logistic regression
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com