Gradient Descent, Stochastic Gradient Descent and Loss Functions
Feb 1, 2022
DS-GA 1003 Feb 1, 2022 1 / 47
Copyright By PowCoder代写 加微信 powcoder
Review: ERM
DS-GA 1003 Feb 1, 2022 2 / 47
Our Setup from Statistical Learning Theory
The Spaces
X: input space Y: outcome space A: action space Prediction Function (or “decision function”)
A prediction function (or decision function) gets input x ∈ X and produces an action a ∈ A : f:X→A
Loss Function
A loss function evaluates an action in the context of the outcome y. l:A×Y→ R
(a,y) → l(a,y) DS-GA 1003
Feb 1, 2022
Risk and the Bayes Prediction Function
Definition
The risk of a prediction function f : X → A is
R(f ) = El(f (x),y).
In words, it’s the expected loss of f on a new example (x,y) drawn randomly from PX×Y. Definition
A Bayes prediction function f ∗ : X → A is a function that achieves the minimal risk among all possible functions:
f ∗ ∈ argminR(f ), f
where the minimum is taken over all functions from X to A.
The risk of a Bayes prediction function is called the Bayes risk. DS-GA 1003
Feb 1, 2022
The Empirical Risk
Let Dn = ((x1,y1),…,(xn,yn)) be drawn i.i.d. from PX×Y. Definition
The empirical risk of f : X → A with respect to Dn is
The unconstrained empirical risk minimizer can overfit.
i.e. if we minimize Rˆn(f ) over all functions, we overfit.
l(f(xi),yi).
DS-GA 1003
Feb 1, 2022
Constrained Empirical Risk Minimization
Definition
A hypothesis space F is a set of functions mapping X → A.
This is the collection of prediction functions we are choosing from.
An empirical risk minimizer (ERM) in F is
fn ∈ argmin n l(f (xi),yi). f∈F i=1
From now on “ERM” always means “constrained ERM”.
So we should always specify the hypothesis space when we’re doing ERM.
DS-GA 1003
Feb 1, 2022
Example: Linear Least Squares Regression
Input space X = Rd
Output space Y = R
Action space Y = R Loss: l(yˆ,y)=(y−yˆ)2
Hypothesis space: F=f :Rd →R|f(x)=wTx,w ∈Rd Given a data set Dn = {(x1,y1),…,(xn,yn)},
Our goal is to find the ERM fˆ∈ F.
DS-GA 1003
Feb 1, 2022
Example: Linear Least Squares Regression
Objective Function: Empirical Risk
We want to find the function in F, parametrized by w ∈ Rd , that minimizes the empirical risk: ˆ1n 2
wTxi −yi How do we solve this optimization problem?
min Rˆn(w)
(For OLS there’s a closed form solution, but in general there isn’t.)
DS-GA 1003
Feb 1, 2022
Gradient Descent
DS-GA 1003 Feb 1, 2022 9 / 47
Unconstrained Optimization
We assume that the objective function f : Rd → R is differentiable. We want to find
x∗ = arg min f (x) x ∈Rd
DS-GA 1003
Feb 1, 2022
The Gradient
Let f :Rd →R be differentiable at x0 ∈Rd.
The gradient of f at the point x0, denoted ∇x f (x0), is the direction in which f (x)
increases fastest, if we start from x0.
Figure A.111 from Newtonian Dynamics, by .
DS-GA 1003 Feb 1, 2022 11 / 47
Gradient Descent
To reach a local minimum as fast as possible, we want to go in the opposite direction from the gradient.
Gradient Descent
Initialize x ← 0. Repeat:
x ← x − η∇f (x )
until the stopping criterion is satisfied.
The “step size” η is not the amount by which we update x!
DS-GA 1003
Feb 1, 2022
Gradient Descent Path
DS-GA 1003 Feb 1, 2022 13 / 47
Gradient Descent: Step Size
A fixed step size will work, eventually, as long as it’s small enough (roughly — details to come)
If η is too large, the optimization process might diverge
In practice, it often makes sense to try several fixed step sizes
Intuition on when to take big steps and when to take small steps?
DS-GA 1003 Feb 1, 2022 14 / 47
Convergence Theorem for Fixed Step Size
Suppose f : Rd → R is convex and differentiable, and ∇f is Lipschitz continuous with constant L > 0, i.e.
∥∇f(x)−∇f(x′)∥L∥x−x′∥
for any x , x ′ ∈ Rd . Then gradient descent with fixed step size η 1/L converges. In particular,
(k) ∗ ∥x(0) −x∗∥2 )−f(x ) 2ηk .
This says that gradient descent is guaranteed to converge and that it converges with rate O (1/k ).
DS-GA 1003 Feb 1, 2022 15 / 47
Gradient Descent: When to Stop?
Wait until ∥∇f (x)∥2 ε, for some ε of your choosing. (Recall ∇f (x) = 0 at a local minimum.)
Early stopping:
evalute loss on validation data after each iteration;
stop when the loss does not improve (or gets worse).
DS-GA 1003
Feb 1, 2022
Gradient Descent for Empirical Risk – Scaling Issues
DS-GA 1003 Feb 1, 2022 17 / 47
Quick recap: Gradient Descent for ERM
We have a hypothesis space of functions F=fw :X→A|w ∈Rd Parameterized by w ∈ Rd .
Finding an empirical risk minimizer entails finding a w that minimizes
Suppose l(fw(xi),yi) is differentiable as a function of w.
Then we can do gradient descent on Rˆn(w) DS-GA 1003
l(fw(xi),yi)
Feb 1, 2022
Gradient Descent: Scalability
At every iteration, we compute the gradient at the current w:
∇wl(fw(xi),yi)
How does this scale with n?
We have to iterate over all n training points to take a single step. [O(n)] Will not scale to “big data”!
Can we make progress without looking at all the data before updating w?
DS-GA 1003
Feb 1, 2022
Stochastic Gradient Descent
DS-GA 1003 Feb 1, 2022 20 / 47
“Noisy” Gradient Descent
Instead of using the gradient, we use a noisy estimate of the gradient.
Turns out this can work just fine!
Intuition:
Gradient descent is an iterative procedure anyway.
At every step, we have a chance to recover from previous missteps.
DS-GA 1003
Feb 1, 2022
Minibatch Gradient
The full gradient is
∇wl(fw(xi),yi)
It’s an average over the full batch of data Dn = {(x1,y1),…,(xn,yn)}.
Let’s take a random subsample of size N (called a minibatch): (xm1,ym1),…,(xmN ,ymN )
The minibatch gradient is
∇RN (w ) = N
∇w l(fw (xmi ), ymi ) DS-GA 1003
Feb 1, 2022
Batch vs Stochastic Methods
(Slide adapted from )
Rule of thumb for stochastic methods:
Stochastic methods work well far from the optimum
But struggle close the the optimum
DS-GA 1003 Feb 1, 2022 23 / 47
Minibatch Gradient Properties
The minibatch gradient is an unbiased estimator for the [full] batch gradient. What does
that mean?
E ∇RN(w) =∇Rn(w)
The bigger the minibatch, the better the estimate.
N1 V a r ∇ Rˆ 1 ( w ) = V a r ∇ Rˆ N ( w )
Tradeoffs of minibatch size:
Bigger N =⇒ Better estimate of gradient, but slower (more data to process)
Smaller N =⇒ Worse estimate of gradient, but can be quite fast
Because of vectorization, we can often get minibatches of certain sizes for free
DS-GA 1003 Feb 1, 2022 24 / 47
Convergence of SGD
For convergence guarantee, use diminishing step sizes, e.g. ηk = 1/k Theoretically, GD is much faster than SGD in terms of convergence rate:
much faster to add a digit of accuracy.
but most of that advantage comes into play once we’re already pretty close to the minimum.
However, in many ML problems we don’t care about optimizing to high accuracy
DS-GA 1003 Feb 1, 2022 25 / 47
Step Sizes in Minibatch Gradient Descent
Minibatch Gradient Descent (minibatch size N) initialize w = 0
randomly choose N points {(xi , yi )}Ni =1 ⊂ Dn
w ←w−η1 N ∇wl(fw(xi),yi) N i=1
For SGD, fixed step size can work well in practice.
Typical approach: Fixed step size reduced by constant factor whenever validation performance stops improving.
Other tricks: Bottou (2012), “Stochastic gradient descent tricks”
DS-GA 1003 Feb 1, 2022 26 / 47
Gradient descent or “full-batch” gradient descent
Use full data set of size n to determine step direction
Minibatch gradient descent
Use a random subset of size N to determine step direction
Stochastic gradient descent
Minibatch with N = 1.
Use a single randomly chosen point to determine step direction.
These days terminology isn’t used so consistently, so always clarify the [mini]batch size.
SGD is much more efficient in time and memory cost and has been quite successful in large-scale ML.
DS-GA 1003 Feb 1, 2022 27 / 47
Example: Logistic regression with l2 regularization
Batch methods converge faster :
(Example from )
DS-GA 1003 Feb 1, 2022 28 / 47
Example: Logistic regression with l2 regularization
Stochastic methods are computationally more efficient:
(Example from )
DS-GA 1003 Feb 1, 2022 29 / 47
Example: Logistic regression with l2 regularization
Batch methods are much faster close to the optimum:
(Example from )
DS-GA 1003 Feb 1, 2022 30 / 47
Loss Functions: Regression
DS-GA 1003 Feb 1, 2022 31 / 47
Regression Problems
Predicting the stock price given history prices
Predicting medical cost of given age, sex, region, BMI etc. Predicting the age of a person based on their photos
Input space X = Rd
Action space A = R Outcome space Y = R.
yˆ is the predicted value (the action)
y is the actual observed value (the outcome) DS-GA 1003
Feb 1, 2022
Loss Functions for Regression
A loss function in general:
(yˆ,y) → l(yˆ,y) ∈ R Regression losses usually only depend on the residual r = y − yˆ.
what you have to add to your prediction to get the correct answer. A loss l(yˆ,y) is called distance-based if:
1 It only depends on the residual:
l(yˆ,y) = ψ(y −yˆ) for some ψ:R → R
2 It is zero when the residual is 0:
ψ(0)=0 DS-GA 1003
Feb 1, 2022
Distance-Based Losses are Translation Invariant
Distance-based losses are translation-invariant. That is, l(yˆ+b,y +b) = l(yˆ,y) ∀b ∈ R.
When might you not want to use a translation-invariant loss?
Sometimes the relative error yˆ−y is a more natural loss (but not translation-invariant)
Often you can transform response y so it’s translation-invariant (e.g. log transform)
DS-GA 1003 Feb 1, 2022 34 / 47
Some Losses for Regression
Residual: r=y−yˆ
Square or l2 Loss: l(r) = r2
Absolute or Laplace or l1 Loss: l(r ) = |r |
| r | = | y − yˆ |
r 2 = ( y − yˆ ) 2
Outliers typically have large residuals. (What is an outlier?) Square loss much more affected by outliers than absolute loss.
DS-GA 1003
Feb 1, 2022
Loss Function Robustness
Robustness refers to how affected a learning algorithm is by outliers. Linear data with noise and outliers
−1 −2 −3 −4 −5 −6
least squares
KPM Figure 7.6
0 0.2 0.4 0.6 0.8 1
DS-GA 1003
Feb 1, 2022
Some Losses for Regression
Square or l2 Loss: l(r ) = r 2 (not robust)
Absolute or Laplace Loss: l(r ) = |r | (not differentiable)
gives median regression
: Quadratic for |r | δ and linear for |r | > δ (robust and differentiable)
Equal values and slopes at r = δ 5
4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 −0.5
−3 −2 −1 0 1 2 3
KPM Figure 7.6
DS-GA 1003
Feb 1, 2022
Classification Loss Functions
DS-GA 1003 Feb 1, 2022 38 / 47
The Classification Problem
Predict whether the image contains a cat
Predict whether the email is SPAM Classification spaces:
Input space Rd
Outcome space Y = {−1, 1}
Action space A = R (easier to work with than A = {−1, 1})
Inference:
f(x)>0 =⇒Predict1 f(x)<0 =⇒Predict −1
DS-GA 1003
Feb 1, 2022
The Score Function
Action space A = R Output space Y = {−1, 1}
Real-valued prediction function f : X → R Definition
The value f (x ) is called the score for the input x .
In this context, f may be called a score function.
The magnitude of the score can be interpreted as our confidence of our prediction.
DS-GA 1003 Feb 1, 2022 40 / 47
The Margin
Definition
The margin (or functional margin) for a predicted score yˆ and the true class y ∈ {−1, 1} is y yˆ. The margin is often written as yf (x ), where f (x ) is our score function.
The margin is a measure of how correct we are:
If y and yˆ are the same sign, prediction is correct and margin is positive.
If y and yˆ have different sign, prediction is incorrect and margin is negative. We want to maximize the margin
Most classification losses depend only on the margin (they are margin-based losses).
DS-GA 1003 Feb 1, 2022 41 / 47
Classification Losses: 0−1 Loss
If f ̃ is the inference function (1 if f (x ) > 0 and −1 otherwise), then
The 0-1 loss for f : X → {−1,1}:
Empirical risk for 0−1 loss:
1(yif(xi)0)
Minimizing empirical 0−1 risk not computationally feasible
Rˆn(f ) is non-convex, not differentiable (in fact, discontinuous!). Optimization is NP-Hard.
l(f(x),y)=1(f ̃(x)̸=y)
DS-GA 1003
Feb 1, 2022
Classification Losses
Zero-One loss: l0-1 = 1(m 0)
x-axis is margin: m > 0 ⇐⇒ correct classification
Feb 1, 2022 43 / 47
Classification Losses
SVM/Hinge loss: lHinge = max (1 − m, 0)
Hinge is a convex, upper bound on 0 − 1 loss. Not differentiable at m = 1.
DS-GA 1003 Feb 1, 2022 44 / 47
Classification Losses
Logistic/Log loss: lLogistic = log (1 + e −m )
Logistic loss is differentiable. Logistic loss always rewards a larger margin (the loss is never 0).
DS-GA 1003 Feb 1, 2022 45 / 47
What About Square Loss for Classification?
Action space A = R Output space Y = {−1, 1}
Loss l(f (x),y) = (f (x)−y)2.
Turns out, can write this in terms of margin m = f (x )y :
l(f (x),y) = (f (x)−y)2 = (1−f (x)y)2 = (1−m)2 Prove using fact that y 2 = 1, since y ∈ {−1, 1}.
DS-GA 1003
Feb 1, 2022
What About Square Loss for Classification?
Heavily penalizes outliers (e.g. mislabeled examples).
May have higher sample complexity (i.e. needs more data) than hinge & logistic1. 1Rosasco et al’s “Are Loss Functions All the Same?” http://web.mit.edu/lrosasco/www/publications/loss.pdf
DS-GA 1003 Feb 1, 2022 47 / 47
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com