CS代写 Machine Learning and Data Mining in Business

Machine Learning and Data Mining in Business
Lecture 6: Training Machine Learning Models (Part 2)
Discipline of Business Analytics

Copyright By PowCoder代写 加微信 powcoder

Lecture 6: Training Machine Learning Models (Part 2)
Learning objectives
• Multivariate optimisation.
• Local descent methods.
• Stochastic gradient descent.

Lecture 6: Training Machine Learning Models (Part 2)
1. Multivariate optimisation
2. Application: Logistic Regression 3. Optimisation in machine learning 4. Local descent methods
5. Stochastic gradient descent

Multivariate optimisation

Multivariate optimisation
We now consider the problem
minimise f(x)
where f : X → R is a twice-differentiable objective function and X ⊆ Rp.

Background: partial derivatives
A partial derivative of a function of many variables is the derivative with respect to one of the variables, holding the others constant.
Example. Consider the function J(β0,β1)=(y−β0 −β1x)2
The partial derivatives are
∂J =−2(y−β0−β1x) ∂β0
∂J =−2x(y−β0 −β1x) ∂β1

Background: gradient
The gradient of a function of many variables is the vector of partial derivatives.
Example. The gradient of the function J(β0,β1)=(y−β0 −β1x)2
􏰅∂J ∂J􏰆 ∇J= ∂β,∂β
=􏰃−2(y−β0 −β1x),−2x(y−β0 −β1x)􏰄

Critical point
A critical point of a multivariate function is a point where the gradient is a vector of zeros
􏰉∂f ∂f ∂f􏰊
∇f(x) = ∂x , ∂x ,…, ∂x = (0,0,…,0) = 0.

Critical point
A critical point can be a local minimum, local maximum or a saddle point.
Figure from Algorithms for Optimisation by M.J Kochenderfer and T.A. Wheeler.

The Hessian of a function of many variables is the square matrix of second order partial derivatives,
∂2f ∂2f···∂2f
 ∂x21  ∂2f
∂x ∂x ∇2f(x)= 2 1
∂x1∂x2 ∂2f
∂x1∂xp ∂2f 
··· ∂x ∂x  2 p .
.. .   . . . .  
∂f ∂f···∂f ∂xp ∂x1 ∂xp ∂x2 ∂x2p

Necessary conditions to be a local minimum
A local minimum must satisfy these two conditions:
1. ∇f(x) = 0, the first-order necessary condition (FONC).
2. The Hessian matrix ∇2f(x) is positive semidefinite, the second-order necessary condition (SONC).
Note: the definition of positive semidefinite matrix and checking the SONC in the multivariate case is outside the scope of this unit.

Sufficient conditions to be a local minimum
A point x is guaranteed to be a strong local minimum if
1. ∇f(x) = 0.
2. The Hessian ∇2f(x) is positive definite.

Convex optimisation
As in the univariate case:
• If the objective function is convex, then a local minimum is guaranteed to be a global minimum.
• If the objective function is strictly convex, then it has at most one critical point. Such a point, if it exists, is the unique global minimum.

Convex function
Lemma (second-derivative characterisation of convexity). Let
X ⊆ Rp be a convex set. A twice-differentiable function
f : X → R is convex if and only if ∇2f(x) is positive semidefinite for all x ∈ X .
If ∇2f(x) is positive definite for all x ∈ X, then f : X → R is strictly convex. However, the converse does not hold.

Convex and non-convex functions
Examples of convex (left) vs. non-convex (right) functions of two variables.
Image credit: O’Reilly media.

Application: Logistic Regression

Logistic regression
From last week, we want to solve the problem
􏰍n􏰎 minimise 􏰐 −yi log(πi) − (1 − yi) log(1 − πi) ,
whereπ =σ􏰃β +􏰏p βx 􏰄. i 0 j=1jij

Background: Leibniz’s notation
Let y = f(z) and z = g(x). We can express the chain rule as
dy = dy dz . dx dz dx

Logistic regression
Consider the following representation of the objective function using auxiliary variables,
n J(β) = 􏰐Ji
Ji =−yilog(πi)−(1−yi)log(1−πi) πi = σ(zi),
zi = β0 + 􏰐 βj xij .

Logistic regression
Using the sum rule and the chain rule,
∂ J = 􏰐n ∂ J i = 􏰐n ∂ J i ∂ z i . ∂βk i=1 ∂βk i=1 ∂zi ∂βk

Logistic regression
We can then use the following result which is identical to last week’s tutorial,
∂Ji =σ(zi)−yi =πi −yi. ∂zi
Because zi = β0 + 􏰏pj=1 βjxij,
∂zi = xik. ∂βk

Logistic regression
Putting al the results together, we obtain a simple expression for the partial derivatives
∂ J = 􏰐n ∂ J i ∂ z i ,
∂βk i=1 ∂zi ∂βk n
= 􏰐 xik (πi − yi) . i=1

Backpropagation
Taking the tutorial derivation into account, we are repeatedly applying the chain rule
∂Ji = ∂Ji ∂zi ∂βk ∂zi ∂βk
= ∂Ji ∂πi ∂zi . ∂πi ∂zi ∂βk
This is a fundamental concept in deep learning.

Logistic regression
We conclude that the FNOC is
np 􏰑􏰑
􏰐 σ β0 +􏰐βjxij −yi = 0, i=1 j=1
np 􏰑􏰑
􏰐xi1 σ β0+􏰐βjxij −yi =0, i=1 j=1
np 􏰑􏰑
􏰐xip σ β0+􏰐βjxij −yi =0. i=1 j=1
This is a nonlinear system of equations, and it’s not possible to derive an analytical solution.

Optimisation in machine learning

Optimisation in machine learning
• Our goal in machine learning is generalisation, not optimisation.
• Therefore, there are some differences in how we use optimisation in machine learning compared to traditional optimisation.
• For example, the global minimum may overfit the data.

Convex optimisation
• Many learning algorithms, such as linear regression and logistic regression, lead to convex optimisation problems.
• We can usually solve convex problems in a reasonable amount time by using algorithms that converge to a global minimum.
• A potential challenge in some cases, such as in the lasso, is the non-differentiability of the cost function at certain points.

Non-convex optimisation
• Many machine learning methods, including random forests, gradient boosting and deep learning, lead to non-convex optimisation.
• Unfortunately, finding a global minimum of a non-convex objective function is computationally intractable.
• Instead, we will simply try to find a point that makes value of the cost function sufficiently low. We then use model selection methods to assess the quality of the solution.

Approximate minimisation
Figure from Deep Learning by , and .

Local descent methods

Local descent methods
• We are generally not able to find analytical solutions to optimisation problems in machine learning, as in the logistic regression example.
• Therefore, we need to use optimisation algorithms to compute solutions.

Local descent methods
In local descent methods, we use local approximations of the objective function to incrementally improve the design point according to
θ(k+1) ← θ(k) + α(k)d(k),
where d(k) is the descent direction, α(k) is the step size, and k indexes the iteration.

Background: Taylor series approximation
The Taylor series of an infinitely differentiable function J at a point θ0 represents the function as the infinite power series
J(θ) = J(θ0) + J′(θ0)(θ − θ0) + J′′(θ0)(θ − θ0)2 + · · · , 1! 2!
provided that the series converges.
A Taylor series approximation of order d uses the first d terms of a Taylor series to approximate the function as a polynomial in the neighbourhood of θ0.

Taylor approximation
First-order (left) and second-order (right) approximations around zero.

Newton’s method
In Newton’s method, we consider a second-order Taylor approximation of the cost function around the current point.
In the univariate case,
J(θ) ≈ J(θ(k)) + J′(θ(k))(θ − θ(k)) + 21J′′(θ(k))(θ − θ(k))2.

Newton’s method
J(θ) ≈ J(θ(k)) + J′(θ(k))(θ − θ(k)) + 21J′′(θ(k))(θ − θ(k))2. Minimising the approximation with respect to θ
θ∗ = θ(k) − J′(θ(k)) . J′′(θ(k))

Newton’s method
The update for Newton’s method is
θ(k+1) ← θ(k) − J′(θ(k)) . J′′(θ(k))

Newton’s method
Newton’s method:
θ(k+1) ← θ(k) − J′(θ(k)) . J′′(θ(k))
• Starting from an initial guess θ􏰑 , we repeat this update until (0)
|θ􏰑 −θ |<εforasmalltoleranceε,inwhichcasewe (k+1) (k) say that the estimate converged. • Newton’s method converges to a critical point. Newton’s method Image credit: blog Newton’s method In the multivariate case, the update for Newton’s method becomes θ(k+1) = θ(k) − 􏰃∇2J(θ(k))􏰄−1 ∇J(θ(k)), where ∇J(θ(k)) and ∇2J(θ(k)) are the gradient and Hessian of J at the current design point. Computational efficiency We need to consider these two factors when discussing the computational efficiency of optimisation algorithms: • The number of iterations required for convergence, and how it scales with n and p. • The computational cost per iteration, and how it scales with n and p. Newton’s method 􏰗 Fast convergence in general. 􏰗 Scales well with the problem size in terms of the number of steps required for convergence. 􏰗 No tuning required. 􏰖 The computational cost per step does not scale well with the number of parameters. 􏰖 Can get stuck at a saddle point. Gradient descent Gradient descent is a first-order method. The update is θ(k+1) = θ(k) − α(k)∇J(θ(k)), where α(k) is called the learning rate. Gradient descent θ(k+1) = θ(k) − α(k)∇J(θ(k)). • Remember that the partial derivatives measure how fast the cost function increases or decreases as we change each argument. • Therefore, the function decreases the fastest in the direction of the negative gradient. Gradient descent Image credit: Towards Data Science Gradient descent The learning rate is a hyperparameter of the optimisation algorithm. • A sufficiently small α(k) guarantees that the update will reduce the value of the cost function. However, an α(k) that is too small causes slow convergence. • Increasing the learning rate can speed up the convergence. However, an α(k) that is too high can overshoot the solution, leading to instability and even divergence. Gradient descent Learning rate too small: Figure from Hands-on Machine Learning with Scikit-Learn, Keras and Tensorflow by Aur ́elien G ́eron. Gradient descent Learning rate too large: Figure from Hands-on Machine Learning with Scikit-Learn, Keras and Tensorflow by Aur ́elien G ́eron. Computational efficiency • Gradient descent converges more slowly than Newton’s method, since it only uses first-order information. • The number of steps required for convergence can be high depending on the problem. • Because we only need to compute the gradient, the computational cost per step is proportional to the number of parameters. Gradient descent 􏰗 Simplicity. 􏰗 Each step only requires the calculation of the gradient. 􏰖 The convergence is inferior to Newton’s method and can be very slow in some problems. 􏰖 Requires choice of the learning rate. 􏰖 The performance is sensitive to the choice of learning rate. 􏰖 The performance can be sensitive to the scaling of the predictors. Local descent methods Key takeaway: Derivatives are useful in optimisation because they provide information about how to change the current design point to improve the objective function. – Wheeler and Kochenderfer (2019) Stochastic gradient descent Stochastic gradient descent • Stochastic gradient descent (SGD) is a modification of gradient descent that improves computational efficiency and reduces the memory requirement. • SGD currently plays a central role in machine learning practice as it’s suitable for very large datasets. It’s the optimisation approach of choice in deep learning. Stochastic gradient descent For simplicity, consider a cost function with the form Each step of batch gradient descent requires computing The computational cost is proportional to the number of training cases. If n is very large, it can become prohibitive. L(yi,f(xi;θ)). ∇L(yi, f(xi; θ)). Minibatch stochastic gradient descent Let B denote a set of nbatch indexes selected at a random from {1, . . . , n}. The insight of SGD is that the sample average ∇JB(θ) = 1 􏰐 ∇L(yi, f(xi; θ)) nbatch i∈B can be a good approximation to the gradient ∇L(yi, f(xi; θ)). Minibatch stochastic gradient descent In minibatch stochastic gradient descent, we perform the update θ(k+1)=θ(k)−α(k) 1 􏰐 ∇L(yi;f(xi;θ)). nbatch i∈B(k) The computational cost per iteration is much lower than in gradient descent. Minibatch stochastic gradient descent • We typically choose the batch size to be a small power of two such as 16, 32, 64 or 128. Crucially, we hold it fixed as n grows. • A higher batch improves the gradient approximation but increases the computational cost per iteration. • Standard SGD refers to case where we use a single observation for each update. • In practice, we select the batches such that the optimisation algorithm systematically cycle through all the observations. • We call each complete scan through the data an epoch. Minibatch gradient descent • A SGD step is noisy: it may increase or decrease the cost function. • Unlike batch gradient descent, SGD does not converge. After a sufficient number of steps, the algorithm wanders around a local minimum. Minibatch stochastic gradient descent Algorithm Minibatch stochastic gradient descent 1: 2: 3: 4: 5: 6: 7: 8: Select the batch size nbatch. Set an initial guess θ. repeat Randomly split the training data into M minibatches of size nbatch. Let Bm be the set of indexes for minibatch m. for m = 1,...,M do Updateθ←θ−α(1/nbatch)􏰏i∈Bm ∇L(yi;f(xi;θ)). end for until cost function reaches a low value Stochastic gradient descent Image credit: Towards Data Science. 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com