Lecture 10: Logistic regression CS 189 (CDSS offering)
2022/02/09
Today’s lecture
Copyright By PowCoder代写 加微信 powcoder
• Yesterday, we started talking about classification and discriminative vs. generative probabilistic models
• We talked about linear discriminant analysis as a generative example and logistic regression as a discriminative example
• Today, we focus in on logistic regression and derive it out in detail
• This also starts our discussion into iterative optimization, as we will see that logistic regression has no analytical (set the gradient equal to zero) solution
Recap from last lecture
• First let’s focus on binary classification, where all of the labels yi ! {0, 1}
• We have a linear model f!(x) = !”x, but this outputs an unconstrained number
So we transform this output into a number between 0 and 1, and this represents p!(y = 1 ! x) = sigmoid( f!(x)) = exp{!”x} + 1
the predicted probability of class 1: exp{!”x}
And p!(y = 0!x) = 1 exp{!”x} + 1
The MLE of !
given = {(x1, y1), …, (xN, yN)}, where each yi ! {0, 1}
argmax I1yilogPoolxi tyilogPoIlxi O
argfaxIgl yo logexpOxit1
yi oxi logfexpOtxit1
argfaxÉyOTxi log expO xi t I 4
The MLE of !
so we have arg max !N yi!”xi $ log{exp{!”xi} + 1}
Po Id Ép8É Xi
fsigmoid tox
sigmoid O’Xi writing the gradient in matrix-vector form:
A closed form solution?
• How would we find an analytical solution for logistic regression?
• We would set X”(y $ s!) = 0 and solve for !
• If you manage to do this, let me know and we can write a research paper
• We instead use iterative optimization, specifically, gradient based optimization
• Before that, let’s analyze what the desired solution is for this setup
Characterizing the logistic regression solution
• The MLE of ! for logistic regression satisfies X”(y $ s!) = 0
• Assuming X is full rank, this only happens when s! = y, i.e., for all i,
p!(yi = 1!xi) = yi — this classifier assigns all probability to the correct classes
• But when does p!(yi = 1!xi) = 1 or 0?
• When !”x = % or $%… So we would need &!&2 ‘ %…
• Think of the slope between 0 and 1 getting infinitely “steep”
• We avoid this issue by adding regularization, i.e., from MLE to MAP estimation! 7
Introducing iterative optimization
For problems where we cannot find an analytical solution, we rely on iterative
optimization to find good parameters
Starting from an initial “guess”, continually refine that guess until we are satisfied By far the most commonly used set of iterative optimization techniques is (first order)
Basically, move the parameters in the direction of the negative gradient of the average loss: ! ( ! $ ” )! N1 !N #(!; xi, yi)
with our final answer
gradient based optimization and variants thereof
Gradient descent
• The gradient tells us how the loss value changes for small parameter changes
• We decrease the loss if we move (with a small enough !) along the direction of the negative gradient (basically, go “opposite the slope” in each dimension)
• Repeatedly performing ” ! ” ” ! #” N1 !N #(“; xi, yi) is gradient descent i=1
• For strictly convex problems (like regularized logistic regression), and with the right !, we can show that we will find parameters that minimize the loss
Multiclass classification
• Many classification problems are not binary — instead, there will be K possible classes, and each yi ! {0,…, K $ 1}
• For example, for digit recognition, K = 10 for the digits 0 through 9
Multiclass classification with logistic regression
• How do we rework our logistic regression model to handle K > 2 classes?
• We will start by making ! a d ! K matrix, where d is the dimensionality of x
• Then, !”x is a K-dimensional vector, one for each class!
• We now have K unconstrained real numbers, but we want probabilities
• We will accomplish this yet again by transforming the outputs, this time into K numbers between 0 and 1 that sum to 1
• Then, each number represents p!(y = k ! x) for a particular k 11
How do we output multiclass probabilities?
• How do we make our model output numbers between 0 and 1 that sum to 1?
• First, our model outputs unconstrained real numbers
• Then, we make all the numbers positive and normalize (divide by the sum) • There are many ways to make a number z positive
• In this context, the most commonly used choice is exp(z), which is bijective
• There are many other choices, but basically everyone uses exp(z)
A probabilistic model for multiclass classification
if there are K possible labels, then f!(x) = !”x is a vector of length K we represent the final probabilities using the softmax function:
everything else is as before: MLE or MAP estimation, gradient based optimization
softmax fol x
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com