Machine Learning and Data Mining in Business
Lecture 5: Training Machine Learning Models (Part 1)
Discipline of Business Analytics
Copyright By PowCoder代写 加微信 powcoder
Lecture 5: Training Machine Learning Models (Part 1)
Learning objectives
• Regularised risk minimisation. • Maximum likelihood.
• Introduction to optimisation.
1. Regularised risk minimisation 2. Maximum likelihood
3. Basics of optimisation
4. Univariate optimisation
5. Example: Bernoulli model
Building blocks of a learning algorithm
• Learning rule.
• Computational algorithm.
Regularised risk minimisation
Empirical risk minimisation
In empirical risk minimisation (ERM), we fit the model by minimising the training error,
1n θ=argmin n L yi,f(xi;θ) .
Example: linear regression
For a linear regression under the squared error loss, ERM corresponds the OLS method for parameter estimation.
We learn the parameters as
n p 2
β=argmin yi−β0−βjxij ,
β where β = (β0,β1,…,βp).
Regularised risk minimisation
ERM typically leads to overfitting. In regularised risk minimisation, we add a complexity penalty to the objective function,
1n θ=argmin n L yi,f(xi;θ) +λC(θ) ,
where λ ≥ 0 is a hyperparameter and C(θ) measures the complexity of the model as function of the parameters.
Example: Ridge regression
2 n p p
βridge =argmin yi −β0 −βjxij +λβj2
β i=1 j =1 j =1
This is a special case of regularised risk minimisation with:
• LossfunctionL(y,f(x))=(y−f(x))2. • Modelf(xi;β)=β0+pj=1βjxij
• RegulariserC(β)=pj=1βj2.
Regularisation
1n θ=argmin n L yi,f(xi;θ) +λC(θ) ,
• When λ = 0, the method is equivalent to empirical risk minimisation.
• If λ → ∞, the learning algorithm fits the simplest possible model, such as a model that makes a constant prediction.
• For λ > 0, the higher λ, the higher the complexity term, and therefore the lower the complexity of the learned model.
Regularised risk minimisation
1n θ=argmin n L yi,f(xi;θ) +λC(θ)
The advantage of this method comes from the bias-variance trade-off:
• If we increase λ, we decrease the variance by inducing lower model complexity.
• On the other hand, decreasing model complexity leads to higher bias.
We use model selection methods to find the value of λ that optimally balances the two sources of error.
Types of regularisation
Three most common types of regularisation in machine learning are:
• l2 regularisation: C(θ) = ∥θ∥2
• l1 regularisation: C(θ) = ∥θ∥1 (induces sparsity)
• Elastic net regularisation: C(θ) = α∥θ∥2 + (1 − α)∥θ∥1
Learning rule
1n θ=argmin n L yi,f(xi;θ) +λC(θ)
This equation describes most supervised learning methods . Choose a model f, loss function L, and regulariser C, and you have a learning algorithm.
Maximum likelihood
Maximum likelihood
• We use the principle of maximum likelihood to train probabilistic models for machine learning, such as generalised linear models.
• For example, we use the method of maximum likelihood to estimate the logistic regression model and other models for classification.
Probabilistic model
Let Y1, Y2, . . . Yn be an IID sample from a distribution with probability mass function (pmf) or probability density function (pdf) denoted as pY (y; θ).
Our objective is to estimate the parameter vector θ.
Example: Bernoulli distribution
Let Y1, . . . , Yn ∼ Bernoulli(π). That is,
1 with probability π Yi =
0 with probability 1 − π,
for i = 1,…,n.
Our objective is to estimate π. The probability mass function is
pY (y; π) = πy(1 − π)(1−y). The formula for the pmf or pdf is always provided.
Likelihood function
Define the likelihood function as
L(θ) = pY (yi; θ)
In words, the likelihood is the product of the pmf or pdf evaluated for each data point. It’s a function of the parameters.
Example: Bernoulli distribution
The probability mass function is
The likelihood is
p(y; π) = πy(1 − π)(1−y)
L(π) = πyi (1 − π)(1−yi)
Log-likelihood
The log-likelihood is
l(θ) = log(L(θ))
Background: properties of logarithms
log(uv) = log(u) + log(v)
log(uv ) = v log(u) log(1/u) = − log(u)
Example: Bernoulli distribution
log(L(π)) n
= log πyi (1 − π)(1−yi)
= log πyi (1 − π)(1−yi) i=1
= yi log(π) + (1 − yi) log(1 − π)
= y log(π) + (n − y ) log(1 − π) ii
Maximum likelihood estimator
The maximum likelihood estimator (MLE) is the value of θ that maximises the log-likelihood l(θ),
θ = argmax l(θ) θ
Maximising the log-likelihood is the same as maximising the likelihood, but the log-likelihood is much easier to work with.
Example: Bernoulli distribution
Based on the log-likelihood that we derived earlier, the MLE for π is
π = argmax y log(π) + (n − y ) log(1 − π)
Probabilistic machine learning
• In the probalistic approach to supervised learning, we specify a model pY |X (y|x; θ) for the conditional distribution of the response given the predictors.
• We then learn the parameter vector θ by (regularised) maximum likelihood.
Example: logistic regression
Yi|Xi = xi ∼ Bernoulli π(xi), π(xi) = σf(xi),
p f(xi)=β0 +βjxij,
σ(z) = 1 . 1 + exp(−z)
Our objective is to estimate β.
Logistic regression
When talking about the Bernoulli distribution, the probability mass function evaluated for observation i is
p(yi; π) = πyi (1 − π)1−yi
In the context of the logistic regression model, the probability mass
function evaluated for observation i is
p (y |x ) = πyi (1 − π )1−yi ,
definingπ =σβ +p βx . i 0 j=1jij
Logistic regression
We can write the likelihood function as
L(β) = pY |X(yi|xi)
= πyi (1 − πi)1−yi i
Logistic regression
The log-likelihood is
l(β) = log
= yi log(πi) + (1 − yi) log(1 − πi)
n πyi (1 − πi)1−yi
=yilogσ β0+βjxij i=1 j=1
+(1−yi)log1−σ β0 +βjxij j=1
Negative log-likelihood loss
In machine learning, we work with minimisation problems by convention. Therefore, a general learning rule is
θ = argmin − l(θ)/n + λ C(θ). θ
We can write this as
θ = argmin (1/n) Lyi, f(xi) + λ C(θ),
where we call
Lyi , f (x) = − log pY |X (yi |xi ) the negative log-likelihood loss (NLL loss).
Example: logistic regression
Mutiplying the log-likelihood by −1, the cost function for training the logistic regression model is
J(β) = −yi log(πi) − (1 − yi) log(1 − πi).
The loss function
L(y, π) = −y log(π) − (1 − y) log(1 − π)
is also called the cross-entropy loss or log-loss in machine learning.
Why maximum likelihood?
The MLE has many useful theoretical properties that makes it an apppealing choice of estimator. Under technical conditions:
• The MLE is consistent. That is, θ −→ θ⋆, where θ⋆ is the
true parameter.
• The MLE is asymptotically normal.
• The MLE is asymptotically efficient. That is, the MLE has the lowest variance among all consistent estimators of θ when
Why maximum likelihood?
• Another justification for MLE is that the resulting predictive
distribution p(y|x;θMLE) is a close as possible, in a technical sense, to the empirical distribution of the data.
• These results motivate the choice of maximum likelihood as the preferred learning principle for many problems in machine learning.
Optimisation
θ = argmin − l(θ)/n + λ C(θ). θ
This equation defines the estimator, but doesn’t give us any idea about how to compute it!
Basics of optimisation
Optimisation problem
The basic optimisation problem is minimise f (x)
subjectto x∈X
where f is the objective function, x is a design point, and X is the feasible set.
Optimisation problem
In machine learning, we want to find the parameter value θ that minimises a cost function J given by the learning rule,
minimise J (θ) θ
subject to θ ∈ Θ,
where Θ is the feasible set of valid parameter values.
A solution to the optimisation problem minimise f (x)
subjectto x∈X
is a point x∗ such that f(x∗) ≤ f(x) for all x ∈ X The solution may not be unique.
Global minimum vs. local minimum
• We refer to the solution as the global minimum of the objective function.
• Unfortunately, it may be difficult to establish that a candidate solution is a global minimum.
• Often the best that we can is to verify whether a design point is a local minimum.
Global minimum vs. local minimum
Image credit: Inductiveload. Public domain, Wikimedia Commons.
Local minimum
Definition (local minimum). A point x is a local minimum of a function f if there exists a δ > 0 such that f(x) ≤ f(x′) whenever ∥x′ − x∥ < δ.
Local minima
• There are two types of local minima: strong local minima and weak local minima.
• A point x is a strong local minimum of the objective function if there exists a δ > 0 such that f(x) < f(x′) whenever
∥x′ − x∥ < δ.
Local minima
Figure from Algorithms for Optimisation by M.J Kochenderfer and T.A. Wheeler.
Univariate optimisation
Univariate optimisation
For simplicity, we start with the univariate minimisation problem minimise f(x)
where f : R → R is a twice-differentiable objective function.
Derivative
The derivative of the function at a certain point corresponds the slope of the tangent line to the graph at that point.
The derivative measures how quickly the function increases or decreases as we change the input.
Critical point
A critical point of f is a point x that has a derivative of zero, f′(x) = 0.
Our first step is try to find the critical points of the function, since minima can only occur at these points.
Critical point
A critical point can be a local minimum, local maximum or inflexion point.
Image credit: Deep Learning by , , and .
Necessary conditions to be a local minimum
A local minimum must satisfy the following conditions:
1. f′(x) = 0, the first-order necessary condition (FONC).
2. f′′(x) ≥ 0, the second-order necessary condition (SONC). If f′′(x) = 0 the point may or may not be a local minimum.
Sufficient conditions to be a local minimum
A point is guaranteed to be a strong local minimum if:
1. f′(x) = 0, the first-order necessary condition (FONC).
2. f′′(x) > 0.
We call these two conditions together the second-order sufficient condition (SOSC).
Convex objective functions
Convex objective functions have the nice property that a local minimum of such a function is automatically a global minimum.
A local minimum of a convex function occurs at the bottom of a bowl.
Convex functions
Lemma (second derivative characterisation of convexity). Let
X ⊆ R be an interval. A twice-differentiable function f : X → R is convex if and only if f′′(x) ≥ 0 for all x ∈ X.
If f′′(x) > 0 for all x ∈ X, then f : X → R is strictly convex. However, the converse does not hold.
Convex optimisation
• If the objective function is convex, then a local minimum is guaranteed to be a global minimum.
• If the objective function is strictly convex, then it has at most one critical point. Such a point, if it exists, is the unique global minimum.
Example: Bernoulli model
Bernoulli model
We want to solve
minimise J(π) π∈(0,1)
where the cost function is the negative log-likelihood
J(π) = − y log(π) − n − y log(1 − π), ii
as derived earlier in the lecture.
Background: differentiation rules
• Constant factor rule: if f(x) = cg(x), then f′(x) = cg′(x).
• Sum rule: if f(x) = g(x) + h(x), then f′(x) = g′(x) + h′(x). • Natural logarithm rule: if f(x) = log(x), then f′(x) = 1/x.
• Chain rule: if h(x) = f(g(x)), then h′(x) = f′(g(x))g′(x).
• Reciprocal rule: if f(x) = 1/g(x), then f′(x) = −g′(x)/g(x)2.
Bernoulli model: derivative
J(π) = − y log(π) − n − y log(1 − π) ii
Taking the derivative with respect to π,
′ yi n−yi J(π)=− π + 1−π
Bernoulli model: first-order condition
The MLE satisfies the first-order condition
yi n−yi −+ =0,
which implies
π 1 − π Solving for π, we find the unique critical point
ni=1 yi π= n .
π 1 − π yi =n−yi.
Bernoulli model: second-order condition
From the derivative
′ yi n−yi J(π)=− π + 1−π ,
we obtain the second derivative
′′ yi n−yi J(π)=π2 +(1−π)2,
which is positive for all π ∈ (0, 1). Therefore, J is convex on (0, 1) and we can conclude that π is the global minimum.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com