Machine Learning and Data Mining in Business
Lecture 2: Machine Learning Fundamentals
Discipline of Business Analytics
Copyright By PowCoder代写 加微信 powcoder
Lecture 2: Machine Learning Fundamentals
Learning objectives
• Predictions and decisions.
• Building blocks of learning algorithms.
• Overfitting and the bias-variance trade-off.
Basics of supervised learning
Supervised learning
• In supervised learning, we have a training set of input-output pairs {(x1, y1), . . . , (xn, yn)}, and the goal is build a model that predicts the output Y as a function of the inputs X1,…,Xp.
• In classification, the output Y is a nominal variable.
• In regression, the output Y is a continuous variable.
Example: credit scoring
• Output: whether a bank client defaults on a loan.
• Inputs: income, credit history, savings, assets, loan term,
among others.
• Available data: past loans.
Example: predicting house prices
• Output: the sale price of a residential property.
• Inputs: square footage, number of bedrooms, number of bathrooms, number of garage spots, neighbourhood, house quality, among several others.
• Available data: past house sales.
Training, validation and test sets
We often split the data into training, validation and test sets at the start of a machine learning project:
• Training set. For training machine learning models.
• Validation set. For measuring and comparing the accuracy of
different models.
• Test set. For measuring the accuracy of the final model.
A metric is a measure of the quality of a model’s predictions computed on the validation set or any other data.
Example: mean squared error
The most common metric for regression is the mean squared error (MSE),
MSE = n valid
valid 2 − yi ) ,
validation set, and i = 1, . . . , nvalid indexes the validation cases.
and yi are response values and predictions for the
Example: accuracy
The accuracy score is the proportion of correctly classified instances
Accuracy = n
where yi is the predicted class for instace i.
means to predict data that the learning algorithm has never seen before.
Our goal in supervised learning is to generalise to future data.
Predictions and decisions
Predictions and decisions
We want turn predictions into actions in the real-world. How can we decide which action is best?
Predictions and decisions
• In decision theory, we assume that a decision maker or agent has a set of possible actions, A, to choose from.
• Each possible action has costs and benefits which depend on the state of nature h ∈ H.
• The state of nature is uncertain, but the decision maker can partially infer it from an observed signal x.
Example: fraud detection
• Agent: fraud detection system.
• Situation: a credit card transaction comes in.
• Decision: to authorise or block the transaction.
• State of nature: whether the transaction is legitimate or fraudulent.
Example: credit scoring
• Agent: lender.
• Situation: a loan application comes in.
• Decision: to approve or reject the loan.
• State of nature: whether the applicant would able to repay the loan.
Example: content recommendation
• Agent: recommender system.
• Situation: user visits a website or app.
• Decision: which content to recommend.
• State of nature: whether the content is relevant to the user.
Loss function
A loss function L(h, a) specifies the loss from taking action a ∈ A when the state of nature is h ∈ H.
The loss function represents the preferences of the decision maker over different outcomes.
Example: credit risk
Repayment Default
Action Approve Reject
The risk of an action is the expected loss R(a|x) = Ep(h|x) [L(h, a)] .
Optimal policy
The optimal policy specifies what action to take in each context so as to minimize the risk:
π∗(x) = argmin E p(h|x) [L(h, a)] . a∈A
Prediction problem
• In a prediction problem, we can assume that the action has no effect on the state of nature.
• If the action has a potential effect on the state of nature, this is an intervention problem. This is a much harder problem that requires causal inference.
Prediction problem
Supervised learning is appropriate.
Output (Y)
Intervention problem
Supervised learning does not address this problem.
Output (Y)
Classification
• In classification, the possible states of nature correspond to class labels, H = Y = 1,…,C.
• The action corresponds to a prediction, A = Y. The prediction can map into an action in the real world.
Example: credit risk
Prediction Action
Default Reject Repayment Approve
Application: cost-sensitive classification
Prediction
y=0 y=1 y=0 0 LFP
Application: cost-sensitive classification
The agent should predict y = 1 if
R(1|x) < R(0|x),
that is, if
where π(x) = P(Y = 1|X = x).
1−π(x)LFP <π(x)LFN,
Therefore, the optimal classification rule is to predict y = 1 if
π(x) > LFP . LFP + LFN
Population risk
In supervised learning, let f(x) be a predictive function and p(y, x) be the true joint distribution.
We define the population risk, generalisation error, or test error as
R(f) = Ep(y,x) L Y,f(X)
The optimal predictive function f∗ minimises the population risk.
Building blocks of a learning algorithm
Building blocks of a learning algorithm
We can understand supervised learning algorithms in terms of three building blocks:
2. Learning rule.
3. Computational algorithm.
In supervised learning, a model is the set of predictive functions that the algorithm can learn from data.
Even though in principle the predictive functions f can be any function, the model allows us to make progress by restricting the set of functions that will be considered by the learning algorithm.
“All models are wrong, but some are useful.” – .
Example: linear regression
The linear regression model is the set of linear functions f(x)=β0+β1×1+…+βpxp :β∈Rp+1,
where β = (β0, β1, . . . , βp) is the parameter vector.
In practice, we simply state the formula for the predictive function
f(x)=β0 +β1×1 +…+βpxp.
Why would we want to restrict the set of functions that the learning algorithm can learn?
• Inductive bias: useful assumptions help the learning algorithm to generalise.
• It’s not computationally feasible to search over a large space of functions.
Parametric vs. nonparametric models
• A parametric model is a model that can be indexed by a finite number of parameters.
• A nonparametric model is a model that either cannot be indexed by a finite number of parameters or does not a have fixed structure (in practice, this means that the number of parameters grows with the number of observations).
• Parametric models: linear regression, logistic regression, polynomial regression.
• Nonparametric models: k-Nearest Neighbours, decision trees, random forests, gradient boosting, deep feedforward networks.
Parametric vs. nonparametric models
• Parametric models make relatively strong assumptions. Parametric methods are less flexible but are easier to train and usually more stable and interpretable.
• Nonparametric models avoid assumptions. Nonparametric methods are flexible but typically more difficult to train, less stable, and less interpretable.
Probabilistic models
A probabilistic model for supervised learning is a model for p(y|x) or p(y, x).
For example, the logistic regression model
is probabilistic.
p Y|X =x ∼Bernoulliσ β +β x ,
iii 0jij j=1
Generalised linear models
A generalised linear model (GLM) has three components:
1. A conditional distribution pY |X with mean parameter μi.
2. A linear predictive function f(xi) = β0 + pj=1 βjxij.
3. A smooth and invertible link function l such that l(μi) = β0 + pj=1 βjxij.
Example: logistic regression
Yi|Xi = xi ∼ Bernoulli πi p
logit(πi) = β0 + βjxij j=1
conditional distribution
linear predictor
Learning rule
The learning rule specifies how the algorithm will learn the model. There are two approaches:
1. Optimisation.
2. Bayesian methods.
Most popular methods are based on optimisation.
Empirical risk minimisation
In the empirical risk minimisation (ERM) approach, we train the model by minimising the empirical risk or training error
1n f=argmin n L yi,f(xi) ,
where F denotes the model and L is the loss function for optimisation.
Empirical risk minimisation
In practice, we use ERM for parameter estimation:
1n θ=argmin n L yi,f(xi;θ) ,
θ i=1 where θ is the parameter vector.
In words, we select the parameters that minimise the training error.
The learned function is then f(x;θ). We often just write
f(x) = f(x;θ).
Note: we will use the terms “learn”, “fit”, “train”, and “estimate” the model interchangeably.
Example: linear regression
For a linear regression together with squared error loss, empirical risk minimisation corresponds to the OLS method for parameter estimation.
We learn the parameters as
n p 2
β=argmin yi−β0−βjxij ,
where β = (β0, β1, . . . , βp), leading to the estimated predictive function
f(x)=β0 +β1×1 +…+βpxp.
Regularisation
ERM typically leads to overfitting. To address this problem, we add a regularisation term to the objective function for training the model,
1n θ=argmin n L yi,f(xi;θ) +λΩ(θ) ,
where λ ≥ 0 is a hyperparameter and (θ) is some form of complexity penalty.
Supervised learning in one equation
1n θ=argmin n L yi,f(xi;θ) +λΩ(θ)
This equation describes a wide range of methods . Choose a model f, loss function L, and regulariser C, and you have a learning algorithm.
Computational algorithm
• The learning rule states an optimisation problem, but does not tell us how to solve it in practice.
• We need to use an optimisation algorithm to compute the solution. Often, there are multiple algorithms available.
• Computation is a major theme in machine learning.
Study guide: learning algorithms
1. What is the model? Is it parametric or nonparametric? Is it probabilistic? What set of assumptions does it use to generalise?
2. What is the learning rule? What are the components of the cost function for training the model? Why does the loss function make sense for this problem? What type of regularisation does it use?
3. What is the computational algorithm? Can you write the steps? What are the computational challenges? Is the solution exact or approximate?
Overfitting and the bias-variance trade-off
Challenges in machine learning
The two main challenges in machine learning are: • Overfitting.
• Curse of dimensionality.
Complexity
• The complexity of a learning algorithm is its ability to fit the data.
• The capacity of a model is its ability to express a wide range of predictive functions.
occurs when the model is too complex, causing the learning algorithm to memorise random variations in the training data that do not generalise.
A model that overfits has low training error but predicts poorly.
occurs when the model is not sufficiently flexible to describe the relationship between the output and the inputs.
A model that underfits struggles to obtain a low training error.
Illustration: nonlinear regression function
Overfitting
Underfitting
Near optimal estimator
Our challenge is therefore to find the right model complexity.
Optimal capacity
Tip: memorise this figure.
Figure from Deep Learning (2016) by , , and .
Overfitting has many faces
Many factors can contribute to overfitting:
• Excessive capacity.
• The hyperparameters give the learning algorithm too much flexibility to fit the data.
• Noisy data.
• Large number of inputs.
• Irrelevant inputs.
• Outliers.
How to avoid overfitting?
• Inductive bias.
• Regularisation.
• Model selection. • Feature selection. • Model ensembles.
The bias-variance trade-off
A good way to understand the problem of overfitting is to relate it to the bias and variance of the estimator.
Bias and variance
Figure from A Few Useful Things to Know about Machine Learning by
Illustration: high bias
Illustration: no bias but high variance
Illustration: no bias
The bias-variance trade-off
Tip: memorise this figure.
The bias-variance trade-off
• The higher the model complexity, the lower the bias. However, increasing model complexity typically increases the variance.
• A method that overfits has high variance, but possibly low bias.
• A method that underfits has high bias, but possibly low variance.
• Our challenge is to find the optimal balance between bias and variance that minimises the expected prediction error.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com