Machine Learning and Data Mining in Business
Lecture 4: Practical Methodology
Discipline of Business Analytics
Copyright By PowCoder代写 加微信 powcoder
Lecture 4: Practical Methodology
Learning objectives
• Interpretability.
• Model selection.
• Hyperparameter optimisation. • Feature selection.
• Model stacking.
• Model evaluation.
Lecture 4: Practical Methodology for Machine Learning
1. Developing successful machine learning projects 2. Interpretability
3. Model selection
4. Hyperparameter optimisation
5. Feature selection 6. Model stacking 7. Model evaluation
Developing successful machine learning projects
Some principles to follow
1. It’s generalisation that counts.
2. More data beats cleverer algorithms.
3. Data understanding.
4. Develop an end-to-end system as quickly as possible.
5. Experimentation.
6. Rapid iteration.
7. Meaningful baselines.
8. Learn multiple models.
9. Robust evaluation.
10. Don’t just optimise your metrics, eliminate ways your model can fail.
Understanding why a machine learning model makes certain predictions can be important for many reasons:
• It allows to better trust the model.
• It can provide insights to improve the model.
• It can provide business insights.
• It can help to support human decisions based on the model’s predictions.
• It may be essential to meet regulatory requirements, such as in credit scoring.
Interpretability
Useful questions from Howard and Gugger (2019):
• How confident are we in a certain prediction?
• For a particular prediction, what are the most important factors, and how did they influence that prediction?
• Which inputs are strongest, and which can we ignore?
• Which inputs are redundant?
• How do they predictions vary as we vary the inputs?
Accuracy vs. interpretability
• A simple approach to interpretability is to use simple and interpretable models such as linear regression, logistic regression, and decision trees.
• However, it’s often beneficial to use complex models for prediction, leading to a trade-off between accuracy and interpretability.
Accuracy vs. interpretability
Complex methods tend to be less interpretable than simpler methods.
Subset Selection Squares
Generalized Additive Models Trees
Bagging, Boosting Support Vector Machines
Flexibility
• Models such as linear regression and decision trees are said to have intrinsic interpretability.
• Model agnostic tools that can help us to interpret the predictions of “black-box” algorithms.
SHAP values
• The SHAP (Shapley additive explanations) framework assigns each feature an importance value for a particular prediction.
• The feature attribution is called a SHAP value.
• SHAP views any explanation of a model’s prediction as a model in itself, called the explanation model.
SHAP values
SHAP values form an additive explanation,
f (x) = φ0 + φj ,
where f(x) is the original prediction, φ0 is a base value, and φj is
the SHAP value for feature j.
In words, the SHAP values add up exactly to the difference between the prediction and a base value.
Example: linear regression
The best explanation model for an interpretable model such as a linear regression is the model itself,
f(x)=β0 +β1×1 +…+βpxp.
In this case, we can directly see that the contribution of input j to the prediction is βjxj. The SHAP value is
φj =βj(xj −xj), where xj is the sample mean of feature j.
SHAP values
• SHAP values are based on results from cooperative game theory that ensure that the feature attributions satisfy desirable theoretical properties.
• The theory and implementation are quite advanced, so we do not cover it here.
SHAP values
SHAP values
SHAP feature importance
We can measure the global importance of a feature by computing
n ( i ) Ij= φj,
where φ(i) is the SHAP value of input j for prediction i. j
SHAP values
SHAP values
Additive explanations are easy to understand and communicate.
Solid theoretical basis.
Some implementations are computationally expensive.
The practical implementation assumes that the inputs are independent.
Model selection
Model selection
Given the training data, each learned model results from the combination of:
• Learning algorithm.
• Hyperparameter values. • Features.
• Random numbers.
Model selection
Model selection methods estimate the generalisation performance of a model from training data. We use the estimates to:
• Guide experimentation and iteration.
• Select hyperparameters.
• Select features.
• Combine predictions from different models. • Select a final model for prediction.
Model selection
We want to find an optimal level of model complexity.
Model selection
There are three approaches: • Validation set.
• Cross-validation.
• Analytical criteria.
Validation set
In the validation set approach, we randomly split the training data into a training set and a validation set. We estimate the models on the training set and compute predictions for the validation set.
We select the model with the best metrics the validation set.
Validation set
Simple and convenient.
There may not be enough validation cases to reliably estimate
performance.
The metrics can have high variability over random splits.
Biased estimation of performance since we fit the models with less than the full training data.
K-fold cross-validation
Figure by ethen8181 on Github.
Types of cross-validation
• 5-fold and 10-fold CV. The most common choices are K = 5 or K = 10.
• Leave-one-out cross CV (LOOCV). If we set K = n, this is called leave-one-out cross validation. We use all other observations to predict each i.
• Repeated K-fold CV. We repeat the K-fold CV algorithm with multiple random splits, which decreases variance. This is especially helpful when the dataset is not large.
Number of folds
• The higher the K, the higher the computational cost.
• The higher the K, the lower the bias for estimating performance (because we want to estimate the performance of the model when trained with all n examples).
• Increasing K may increase variance (because the training sets become more similar to each other).
K-fold cross-validation
More accurate than the validation set approach.
The estimate depends on the random split (but we can reduce
this source of variability with repeated K-fold).
Biased estimation of performance since we fit the models with less than the full training data.
Nested cross-validation
• Model selection can overfit the validation set. This becomes a problem especially if you use the same validation set to make multiple choices about the learning algorithm.
• In nested cross-validation, we implement two layers of cross-validation to mitigate this issue.
• For example, we may implement an outer loop for hyperparameter optimisation and an inner loop for feature selection.
Hyperparameter optimisation
Hyperparameter optimisation
Hyperparameter optimisation methods optimise a model selection criterion as function of the hyperparameters. The most common techniques are:
• Hand-tuning.
• Grid search.
• Random search.
• Bayesian optimisation. • Multi-fidelity methods. • Metaheuristic methods.
Grid search
• In the grid search approach, we specify a list of values for each hyperparameter and evaluate every possible configuration.
• This is only computationally feasible if the number of hyperparameters and possible values is not too high.
Example: k-Nearest Neighbours
Random search
In a random search, we specify a statistical distribution for each hyperparameter and randomly sample configurations to evaluate until the procedure exhausts the computational budget.
Figure by . 33/62
Random search
More efficient than a grid search.
Wastes computation on configurations that are unlikely to
perform well based on past trials.
Not guaranteed to find good hyperparameter values within the time allowed by the computational budget.
Bayesian optimisation
• Bayesian optimisation (BO) methods perform model-based optimisation.
• At each iteration, the algorithm selects a promising hyperparameter configuration to evaluate based on previous trials.
Multi-fidelity optimisation
• Multi-fidelity methods attempt to increase efficiency by combining full evaluations with trials based on subsets of the data or model.
• HyperBand is a popular multi-fidelity optimisation method that balances the number of hyperparameter configurations and the allocated computational budget for each trial.
• Bayesian Optimisation HyperBand is a state-of-the-art method that combines Bayesian optimisation and HyperBand.
Metaheuristic methods
• Metaheuristic optimisation refers to a large class of algorithms designed to find near-optimal solutions to difficult optimisation problems.
• Evolutionary optimisation methods, which are inspired by the theory of natural selection, are commonly used for hyperparameter optimisation.
Feature selection
Feature selection
There are three types of feature selection methods:
• Filter methods select the features before training.
• Wrapper methods evaluate models with different subsets of features.
• Embedded methods refers to learning algorithms that have feature selection properties.
Feature selection
It’s useful to make the following distinction:
• An irrelevant feature is one that has a very weak relationship with the response, whether in isolation or conditional on other features.
• A redundant feature is one that is potentially related to the response, but has a very weak relationship with it conditional on other features.
Feature selection
Features that look irrelevant in isolation may be relevant in combination. –
Feature screening
• In feature screening, we remove features that have a weak bivariate relationship with the response according to a suitable measure of dependence.
• Common measures of dependence include the mutual information, φk, and the Pearson correlation.
• We treat the dependence threshold for excluding features as a hyperparameter.
Feature screening
Low computational cost.
Helpful when there are many irrelevant features.
Excludes features that look irrelevant in isolation but are relevant in combination.
Does not remove redundant features.
Best subset selection and stepwise selection
Best subset selection and stepwise selection are wrapper methods for feature selection.
Most accurate.
High computational cost.
Overfits if performed without double cross-validation.
Recursive feature elimination
Recursive feature elimination (RFE) works as follows:
• Train the model with all the features and identify the weakest
feature according to a measure of feature importance.
• Train the model again with that feature removed and identify
the weakest feature in the new specification.
• Repeat until the model has the desired number of features.
Recursive feature elimination
• RFE is similar to backward selection, but uses a measure of feature importance rather than the training error to select which feature to remove at each step.
• RFE has a lower computational cost than stepwise selection, since it only fits the model once per iteration.
• As usual, we treat the number of features as a hyperparameter.
Embedded methods
Two examples of embedded methods are: • The lasso.
• Tree-based methods.
Model stacking
Choosing the final model for prediction
• Machine learning is iterative: we typically explore multiple learning algorithms, feature engineering strategies, and hyperparameters until we obtain one or multiple models perform well.
• We use model selection methods to guide this process, being careful not to overfit the validation set.
• Ultimately, we need to choose a final model for prediction. That is, a candidate model for deployment in a business production system.
Ensemble learning
In ensemble learning, we combine predictions from multiple learning algorithms as our final model.
Ensemble learning, rather than selecting a single best model, usually achieves the best generalisation performance.
Model averaging
A simple method is to compute a weighted average
fave(x) = wm fm(x),
where f1(x), . . . , fM (x) are predictions from M different models and w1, . . . , wm are the model weights.
Model averaging
How can we choose the weights? One option is to simply pick the model weights, say wm = 1/M for a simple average of the models:
fave(x) = M
This approach has the advantage of not adding variability through
the choice of the weights, but can lead to sub-optimal predictions.
Model averaging
Another approach is to select the weights by optimisation, for example
nM w = argmin (1/n)L yi, wm fm(xi) ,
w1 ,…,wm
where we often impose the restriction that the weights are
non-negative and sum to one.
This method does not work well if it’s based on the training set. In practice, it places too much weight on the most complex models.
Model averaging
A better approach is to select the weights using a validation set or cross-validation. In the validation set approach, we obtain the model weights as
nvalM
w1 ,…,wm
where f (xval) are the validation set predictions from model m. mi
w = argmin (1/n )L yval, w f (xval) ,
val i mmi i=1 m=1
Model averaging
Better generalisation performance than selecting the best individual model (in most cases).
It is useful to interpret the model weights.
Risk of overfitting the validation set.
Higher computational cost than using individual models.
The predictions are harder to interpret.
Model stacking
In model stacking, we go beyond model averaging by specifying a meta-model that takes predictions from different models as inputs.
Model stacking
Source: https://www.kdnuggets.com/2017/02/stacking-models-imropved-predictions.html
Model stacking
• The model averaging procedure that we have just described is a special case of model stacking where the meta-model is a linear model.
• More generally, the meta-model can be any learning algorithm.
• We fit the meta-model using the validation set or cross-validation.
Model stacking
Highest potential for maximising performance.
The higher the complexity of the meta-model, the higher the
risk of overfitting the validation set.
Higher computational cost than using individual models.
The predictions are harder to interpret.
Model evaluation
Model evaluation
• Model evaluation is the process of estimating the performance of your final model to ensure that is it meets the requirements of the project.
• Because model selection and experimentation can overfit the available data, we need to introduce another level of reserved data for the specific purpose of evaluation: the test set.
Model evaluation
• The most important concept in this unit is that the fundamental goal of supervised learning to generalise to future data.
• The biggest source of failure in machine learning projects is to overestimate how well the models will perform on new data.
• The best way to avoid such failures is to hold out a test set that is never seen or used in any way except to evaluate the final model instance.
Training, validation and test split
Model selection vs. evaluation
Model selection
Validation set
Experimentation Hyperparameter optimisation Feature selection
Model stacking
Biased estimation
Statistical objective
Evaluation
No optimisation of any kind
Unbiased estimation
Business goals
Data drift
• The fundamental assumption in machine learning is that the data used to train and evaluate the model is representative of future data that we want the model to generalise to.
• Deploying a machine learning model in a production system is subject to data drift, which occurs when the training data become less representative of real-time data.
• Therefore, machine learning teams need to continuously monitor the performance of their systems.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com