留学生作业代写 Machine Learning

Machine Learning
Lecture 6: Optimization
Prof. Dr. ̈nnemann
Data Analytics and Machine Learning Technical University of Munich

Copyright By PowCoder代写 加微信 powcoder

Winter term 2020/2021

Motivation
• Many machine learning tasks are optimization problems • Examples we’ve already seen:
– Linear Regression w⇤ = arg minW 12 (Xw y)T (Xw y)
– Logistic Regression w⇤ = arg minW ln p(y | w, X)
• Other examples:
– Support Vector Machines: find hyperplane that separates the classes
with a maximum margin
– k-means: find clusters and centroids such that the squared distances is minimized
– Matrix Factorization: find matrices that minimize the reconstruction error
– Neural networks: find weights such that the loss is minimized
– And many more…
Optimization 2
Data Analytics and Machine Learning

General Task
• Let ✓ denote the variables/parameters of our problem we want to learn
– e.g. ✓ = w in Logistic Regression
• Let X denote the domain of ✓; the set of valid instantiations
– constraints on the parameters!
– e.g. X = set of (positive) real numbers
• Let f(✓) denote the objective function – e.g. f is the negative log likelihood
• Goal: Find solution ✓⇤ minimizing function f :✓⇤ =argmin✓2X f(✓)
– find a global minimum of the function f!
– similarly, for some problems we are interested in finding the maximum
Optimization 3
Data Analytics and Machine Learning

Introductory Example
• Goal: Find minimum of function f(✓)=0.6⇤✓4 5⇤✓3 +13⇤✓2 12⇤✓+5
• Unconstrained optimization + di↵erentiable function
• Necessary condition for minima
– Gradient = 0 – Sucient?
• General challenge: multiple local minima possible
Optimization 4
Data Analytics and Machine Learning

Convexity: Sets
• X is a convex set i↵
forallx,y2X itfollowsthatx+(1)y2X for2[0,1]
Optimization 5
Data Analytics and Machine Learning

Convexity: Functions
• f(x) is a convex function on the convex set X i↵
for all x,y 2 X : f(x)+(1)f(y) f(x+(1)y) for 2 [0,1]
Optimization 6
Data Analytics and Machine Learning

Convexity and minimization problems • Region above a convex function is convex
f(x+(1)y)  f(x)+(1)f(y) hence f(x)+(1)f(y) 2 X for x,y 2 X
• Convex functions have no local minima which are not global minima – Proof by contradiction – linear interpolation breaks local minimum
• Each local minimum is a global minimum
– zero gradient implies (local) minimum for convex functions
– if f0 is a convex function and rf0(✓⇤) = 0 then ✓ is a gobal minimum
– minimization becomes ”relatively easy”
Optimization 7
Data Analytics and Machine Learning

Convexity and minimization problems • Region above a convex function is convex
f(x+(1)y)  f(x)+(1)f(y) hence f(x)+(1)f(y) 2 X for x,y 2 X
• Convex functions have no local minima which are not global minima – Proof by contradiction – linear interpolation breaks local minimum
• Each local minimum is a global minimum
– zero gradient implies (local) minimum for convex functions
– if f0 is a convex function and rf0(✓⇤) = 0 then ✓ is a gobal minimum
– minimization becomes ”relatively easy”
Optimization 7
Data Analytics and Machine Learning

First order convexity conditions (I)
• Convexity imposes a rate of rise on the function • f((1t)x+ty)  (1t)f(x)+tf(y)
• f (y) f (x) f ((1t)x+ty)f (x) t
• Di↵erence between f(y) and f(x) is bounded by function values between x and y
Optimization 8
Data Analytics and Machine Learning

First order convexity conditions (II)
• f (y) f (x) f ((1t)x+ty)f (x) t
• Let t ! 0 and apply the definition of the derivative
• f(y)f(x)(yx)Trf(x)
• Theorem:
Suppose f : X ! R is a di↵erentiable function and X is convex. Then f is convex i↵ for x, y 2 X
f(y) f(x) + (y x)T rf(x)
• Proof. See Boyd p.70
Optimization 9
Data Analytics and Machine Learning

Verifying convexity (I)
• Convexity makes optimization ”easier”
• How to verify whether a function is convex?
• For example: ex1+2⇤x2 +x1 log(x2) convex on [1,1)⇥[1,1)?
Optimization 10
Data Analytics and Machine Learning

Verifying convexity (I)
• Convexity makes optimization ”easier”
• How to verify whether a function is convex?
• For example: ex1+2⇤x2 +x1 log(x2) convex on [1,1)⇥[1,1)? 1. Prove whether the definition of convexity holds (See slide 6)
Optimization 10
Data Analytics and Machine Learning

Verifying convexity (I)
• Convexity makes optimization ”easier”
• How to verify whether a function is convex?
• For example: ex1+2⇤x2 +x1 log(x2) convex on [1,1)⇥[1,1)?
1. Prove whether the definition of convexity holds (See slide 6) 2. Exploit special results
Optimization 10
Data Analytics and Machine Learning

Verifying convexity (I)
• Convexity makes optimization ”easier”
• How to verify whether a function is convex?
• For example: ex1+2⇤x2 +x1 log(x2) convex on [1,1)⇥[1,1)?
1. Prove whether the definition of convexity holds (See slide 6) 2. Exploit special results
• First order convexity (See slide 9)
• Example: A twice di↵erentiable function of one variable is convex on
an interval if and only if its second-derivative is non-negative on this
• More general: a twice di↵erentiable function of several variables is
convex (on a convex set) if and only if its Hessian matrix is positive semidefinite (on the set)
Optimization 10
Data Analytics and Machine Learning

Verifying convexity (II)
3. Show that the function can be obtained from simple convex functions by operations that preserve convexity
a) Start with simple convex functions, e.g.
– f (x) = const and f (x) = xT · b (there are also concave functions)
b) Apply ”construction rules”(next slide)
Optimization 11
Data Analytics and Machine Learning

Convexity preserving operations
• Letf1 :Rd !Randf2 :Rd !Rbeconvexfunctions,and g : Rd ! R be a concave function, then
– h(x) = f1(x) + f2(x) is convex
– h(x) = max{f1(x),f2(x)} is convex
– h(x)=c·f1(x)isconvexifc0
– h(x)=c·g(x)isconvexifc0
– h(x) = f1(Ax + b) is convex (A matrix, b vector)
– h(x)=m(f1(x))isconvexifm:R!Risconvexand nondecreasing
• Example: ex1 +2⇤x2 + x1 log(x2 ) is convex on, e.g., [1, 1) ⇥ [1, 1)
Optimization 12
Data Analytics and Machine Learning

Verifying convexity of sets
1. Prove definition
– often easier for sets than for functions
2. Apply intersection rule
– Let A and B be convex sets, then A[B is a convex set
Optimization 13
Data Analytics and Machine Learning

An easy problem
Convex objective function f
• Objective function di↵erentiable on its whole
– i.e. we are able to compute gradient f0 at every point
• We can solve f0(✓) = 0 for ✓ analytically
– i.e. solution for ✓ where gradient = 0 is
• Unconstrained minimization
– i.e. above computed solution for ✓ is valid
• We are done!
• Example: Ordinary Least Squares Regression
x2 +ex3 2x+7
Optimization 14
Data Analytics and Machine Learning

• Unfortunately, many problems are harder… • No analytical solution for f0(✓) = 0
– e.g. Logistic Regression
– Solution: try numerical approaches, e.g. gradient descent • Constraint on ✓
– e.g. f0(✓) = 0 only holds for points outside the domain
– Solution: constrained optimization
• f not di↵erentiable on the whole domain
– Potential solution: subgradients; or is it a discrete optimization
• f not convex
– Potential solution: convex relaxations; convex in some variables?
Optimization 15
Data Analytics and Machine Learning

One-dimensional problems
• Key Idea
– For di↵erentiable f search for ✓ with rf(✓) = 0
– Interval bisection (derivative is monotonic)
Require: a, b, Precision ✏ Set A = a, B = b repeat
iff0(A+B)>0 then 2
end if until
solution on the left
(B A) min(|f0(A)|, |f0(B)|) 
Output: x = A+B 2
Optimization 16
Data Analytics and Machine Learning

One-dimensional problems
• Key Idea
– For di↵erentiable f search for ✓ with rf(✓) = 0
– Interval bisection (derivative is monotonic)
• Can be extended to nondi↵erentiable problems
Require: a, b, Precision ✏ Set A = a, B = b repeat
iff0(A+B)>0 then 2
end if until
solution on the left
(B A) min(|f0(A)|, |f0(B)|) 
Output: x = A+B 2
Optimization 16
Data Analytics and Machine Learning

One-dimensional problems
• Key Idea
– For di↵erentiable f search for ✓ with rf(✓) = 0
– Interval bisection (derivative is monotonic)
• Can be extended to nondi↵erentiable problems
– exploit convexity in upper bound and keep 5 points
Require: a, b, Precision ✏ Set A = a, B = b repeat
iff0(A+B)>0 then 2
end if until
solution on the left
(B A) min(|f0(A)|, |f0(B)|) 
Output: x = A+B 2
Optimization 16
Data Analytics and Machine Learning

Gradient Descent
• Key Idea
– Gradient points into steepest ascent direction
– Locally, the gradient is a good approximation of the objective function
• GD with Line Search
– Get descent direction, then unconstrained line search
– Turn a multidimensional problem into a one-dimensional problem that we already know how to solve
given a starting point ✓ 2 Dom(f)
1. ✓ := rf(✓)
2. Line search. t⇤ = arg mint>0 f(✓ + t · ✓) 3. Update. ✓ := ✓ + t⇤✓
until stopping criterion is satisfied.
Optimization 17
Data Analytics and Machine Learning

Gradient Descent convergence
• Let p⇤ be the optimal value, ✓⇤ be the minimizer – the point where the minimum is obtained, and ✓(0) be the starting point
• For strongly convex f (replace with > in the definition of convexity) the residual error ⇢, for the k-th iteration is:
⇢=f(✓(k))p⇤ ck(f(✓(0))p⇤), c<1 f(✓(k)) converges to p⇤ as k ! 1 • We must have f(✓(k)) p⇤  ✏ after at most log((f(✓(0))p⇤)✏)/ log(1/c) iterations • Linear convergence for strongly convex objective – k ⇠ log(⇢1) // k = number of iterations, ⇢ Linear convergence for strongly convex objective – i.e. linear when plotting on a log scale - old statistics terminology Optimization 18 Data Analytics and Machine Learning Distributed/Parallel implementation – f(✓)= iLi(✓)+g(✓) – where i iterates over, e.g., each data instance • Example OLS regression: // with regularization – Li(w)=(xTi wyi)2 g(w)=·kwk2 • Gradient can simple be decomposed based on the sum rule • Easy to parallelize/distribute Often problems are of the form Optimization 19 Data Analytics and Machine Learning Basic steps given a starting point ✓ 2 Dom(f) 1. ✓ := rf(✓) 2. Line search. t⇤ = arg mint>0 f(✓ + t · ✓) 3. Update. ✓ := ✓ + t⇤✓
until stopping criterion is satisfied.
• Distribute data over several machines
• Compute partial gradients (on each machine in parallel) • Aggregate the partial gradients to the final one
• BUT: Line search is expensive
– for each tested step size: scan through all datapoints
easy parallel computation
evaluating function might be done multiple times: expensive!
Optimization 20
Data Analytics and Machine Learning

Scalability analysis
+ Linear time in number of instances
+ Linear memory consumption in problem size (not data) + Logarithmic time in accuracy
+ ’Perfect’ scalability
– Multiple passes through dataset for each iteration
Optimization 21
Data Analytics and Machine Learning

A faster algorithm
• Avoid the line search; simply pick update
✓t+1 ✓t ⌧ · rf(✓t)
– ⌧ is often called the learning rate
• Only a single pass through data per iteration • Logarithmic iteration bound (as before)
– if learning rate is chosen ”correctly”
• How to pick the learning rate?
– too small: slow convergence
– too high: algorithm might oscillate, no convergence
• Interactive tutorial on optimization
– http://www.benfrederickson.com/numerical-optimization/
Optimization 22
Data Analytics and Machine Learning

The value of ⌧
• A too small value for ⌧ has two drawbacks
– We find the minimum more slowly
– We end up in local minima or saddle/flat points
Optimization 23
Data Analytics and Machine Learning

The value of ⌧
• A too large value for ⌧ has one drawback
– You may never find a minimum; oscillations usually occur
• We only need 1 step to overshoot
Optimization 24
Data Analytics and Machine Learning

Learning rate adaptation
• Simple solution: let the learning rate be a decreasing function ⌧t of the iteration number t
– so called learning rate schedule
– first iterations cause large changes in the parameters; later do
fine-tuning
– convergence easily guaranteed if limt!1 ⌧t = 0
– example: ⌧t+1 ↵·⌧t for0<↵<1 Optimization 25 Data Analytics and Machine Learning Learning rate adaptation • Other solutions: Incorporate ”history”of previous gradients • Momentum: – mt ⌧·rf(✓t)+·mt1 //often=0.5 – ✓t+1 ✓t mt – As long as gradients point to the same direction, the search accelerates • AdaGrad: – di↵erent learning rate per parameter – learning rate depends inversely on accumulated ”strength”of all previously computed gradients – large parameter updates (”large”gradients) lead to small learning rates Optimization 26 Data Analytics and Machine Learning Adaptive moment estimation (Adam) • mt = 1mt1 + (1 1)rf(✓t) – estimate of the first moment (mean) of the gradient – Exponentially decaying average of past gradients mt (similar to • vt = 2vt1 + (1 2)(rf(✓t))2 – estimate of the second moment (uncentered variance) of the gradient – exponentially decaying average of past squared gradients vt • To avoid bias towards zero (due to 0’s initialization) use bias-corrected version instead: – mˆ t = m t t vˆ t = v t t 11 12 Finally, the Adam update rule for parameters ✓: – ✓ t + 1 = ✓ t p ⌧ mˆ t Default values: 1 = 0.9, 2 = 0.999, ✏ = 108 Optimization 27 Data Analytics and Machine Learning Visualizing gradient descent variants • AdaGrad and variants http://sebastianruder.com/optimizing- – often have faster convergence gradient-descent/ – might help to escape saddlepoints Optimization 28 Data Analytics and Machine Learning Discussion • Gradient descent and similar techniques are called first-order optimization techniques – they only exploit information of the gradients (i.e. first order derivative) • Higher-order techniques use higher-order derivatives – e.g. second-order = Hessian matrix – Example: Newton Method Optimization 29 Data Analytics and Machine Learning Newton method • Convex objective function f • Nonnegative second derivative: r2f(✓) ⌫ 0 // Hessian matrix – r2f(✓) ⌫ 0 means that the Hessian is positive semidefinite • Taylor expansion of f at point ✓t f ( ✓ t + ) = f ( ✓ t ) + T r r r f ( ✓ t ) + 12 T r r r 2 f ( ✓ t ) + O ( 3 ) Optimization 30 Data Analytics and Machine Learning Newton method • Convex objective function f • Nonnegative second derivative: r2f(✓) ⌫ 0 // Hessian matrix – r2f(✓) ⌫ 0 means that the Hessian is positive semidefinite • Taylor expansion of f at point ✓t f ( ✓ t + ) = f ( ✓ t ) + T r r r f ( ✓ t ) + 12 T r r r 2 f ( ✓ t ) + O ( 3 ) • Minimize approximation: leads to ✓t+1 ✓t [r2f(✓t)]1rf(✓t) Repeat until convergence Optimization 30 Data Analytics and Machine Learning Parallel Newton method + Good rate for convergence + Few passes through data needed + Parallel aggregation of gradient and Hessian + Gradient requires O(d) data - Hessian requires O(d2) data - Update step is O(d3) & nontrivial to parallelize • Use it only for low dimensional problems! Optimization 31 Data Analytics and Machine Learning Large scale optimization • Higher-order techniques have nice properties (e.g. convergence) but they are prohibitively expensive for high dimensional problems • For large scale data / high dimensional problems use first-order techniques – i.e. variants of gradient descent • But for real-world large scale data even first-order methods are too • Solution: Stochastic optimization! Optimization 32 Data Analytics and Machine Learning Motivation: Stochastic Gradient Descent • Goal: minimize f(✓) = Pni=1 Li(✓) + potential constraints • For very large data: even a single pass through the data is very costly • Lots of time required to even compute the very first gradient • Is it possible to update the parameters more frequently/faster? Optimization 33 Data Analytics and Machine Learning Stochastic Gradient Descent • Consider the task as empirical risk minimization 1 (X nLi(✓)) = E [Li(✓)] n i=1 i⇠{1,...,n} (Exact) expectation can be approximated by smaller sample: • Ei⇠{1,...,n}[Li(✓)] ⇡ 1 j2S(Lj(✓)) or equivalently: Pn Li(✓) ⇡ n P i=1 |S| j2S // with S ✓ {1,...,n} Lj(✓) Optimization 34 Data Analytics and Machine Learning Stochastic Gradient Descent • Intuition: Instead of using ”exact”gradient, compute only a noisy (but still unbiased) estimate based on smaller sample • Stochastic gradient decent: 1. randomly pick a (small) subset S of the points ! so called mini-batch 2. compute gradient based on mini-batch 3. update:✓t+1 ✓t⌧· n ·P rLj(✓t) |S| j2S 4. pick a new subset and repeat with 2 • ”Original”SGD uses mini-batches of size 1 – larger mini-batches lead to more stable gradients (i.e. smaller variance in the estimated gradient) • In many cases, the data is sampled so that we don’t see any data point twice. Then, each full iteration over the complete data set is called one ”epoch”. Optimization 35 Data Analytics and Machine Learning Example: Perceptron • Simple linear binary classifier: (x)⇢ 1 if wTx+b>0
• Learning task:
Given (x1, y1P), . . . , (xn, yn) with yi{1, 1} Find minw,b i L(yi, wT xi + b)
• L is the loss function, with ✏ > 0
– e.g. L(u,v) = max(0,✏u·v) = ⇢ ✏uv 0
if uv < ✏ else Optimization 36 Data Analytics and Machine Learning Example: Perceptron • Simple linear binary classifier: (x)⇢ 1 if wTx+b>0
• Learning task:
Given (x1, y1P), . . . , (xn, yn) with yi{1, 1} Find minw,b i L(yi, wT xi + b)
• L is the loss function, with ✏ > 0
– e.g. L(u,v) = max(0,✏u·v) = ⇢ ✏uv 0
if uv < ✏ else incorrect prediction correct prediction Optimization 36 Data Analytics and Machine Learning Example: Perceptron • Let’s solve this problem via SGD initialize w = 0 and b = 0 repeat if yi · (wT xi + b) < ✏ then w w+⌧ ·n·yi ·xi and b until all classified correctly • Note: Nothing happens if classified correctly – gradient is zero • Does this remind you of the original learning rules for perceptron? Optimization 37 Data Analytics and Machine Learning Convergence in expectation • Subject to relatively mild assumptions, stochastic gradient descent converges almost surely to a global minimum when the objective function is convex – almost surely to a local minimum for non-convex functions • The expectation of the residual error decreases with speed E[⇢] ⇠ t1 // i.e. t ⇠ E[⇢]1 • Note: Standard GD has speed t ⇠ log ⇢1 – faster convergence speed; but each iteration takes longer Optimization 38 Data Analytics and Machine Learning Optimizing Logistic Regression • Recall we wanted to solve w⇤ = arg minw E(w) • E(w)=lnp(y|w,X) • Closed form solution does not exist • Solution: • Compute the gradient rE(w) • Find w⇤ using gradient descent • Is E(w) convex? • Can you use SGD? • How can you choose the learning rate? • What changes if we add regularization, i.e. Ereg (w) = E(w) + kwk2 ? yi ln (wT xi) + (1 yi) ln(1 (wT xi)) Optimization 39 Data Analytics and Machine Learning Large-Scale Learning - Distributed Learning • So far, we (mainly) assumed a single machine • SGD achieves speed-up by only operating on a subset of the data – Might still be too slow when operating with really large data and large models • In practice: We have often multiple machines available V Distributed learning – Distribute computation across multiple machines – Core challenge: distribute work so that communication doesn’t kill Optimization 40 Data Analytics and Machine Learning Distributed Learning: Data vs. Model Parallelism Use multiple model replicas to process di↵erent examples at the same time • all collaborate to update model state (parameters) in sh 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com