PowerPoint Presentation
Re-cap from Week 8 – Decision Tree
Classification and regression tree (continuous and binary variables)
The logic of decision-tree algorithm (information entropy; information gain)
Model evaluation (Lift, Misclassification, ROC)
Model comparison
Learning Objectives
Describe Automated Machine Learning (AML)
Explain why AML is different from traditional data science development
Describe the benefits to the organization of adopting automated machine learning
Load and explore data in DataRobot
Build predictive models using DataRobot
Introduction
to AML
AML
“The ability to truly democratise the process is perhaps the most important element of any enterprise machine learning platform. DataRobot automates the entire modelling lifecycle, enabling users to quickly and easily build highly accurate predictive models. The only ingredients needed are curiosity and data – coding and machine learning skills are completely optional!”
– DataRobot website
Benefits of AML
Reduces the technical skill set required to build models
Allowing for ML capabilities to reach across the whole organisation
Increase the scope of model development
Increase the timeliness of model development to deployment
Easy to interpret model comparisons
Easy to interpret model outputs and feature explanations
Simplified documentation process
What is DataRobot?
DataRobot is one of the leading providers of automated machine learning.
It runs on the cloud and is accessed via a web browser.
Some of its competitors include Amazon SageMaker and Microsoft Azure’s automated machine learning product.
https://community.datarobot.com/t5/datarobot-community/ct-p/en
Learningg Resource at DataRobot
Development Process
FIGURE 10.1., Vidgen et al. 2019
Building Models
with DataRobot
Leaderboard
Scores models iteratively and updates the results as it progresses through
A ‘survival of the fittest’ approach is used to build models
While the models and cross-validations are being run the leaderboard is active and the models jockey for position based on their performance
Once all of the models have been run the leaderboard will be finalised and have the models produced compared to each other
Assess Model Performance
Hence the results can vary depending on the data selected for training vs holdout. If the dataset is large enough, the random sample is more likely to capture the essence of the whole dataset, as opposed to a smaller dataset. A more robust approach would be to use cross-validation.
Image sourced from https://www.datavedas.com/holdout-cross-validation/
Training and Testing (Holdout) Data
Training, Validation and Holdout (TVH)
Cross-Validation in DataRobot
FIGURE 10.15., Vidgen et al. 2019
Auto-Pilot Steps
Partition data into training and holdout (80%/20%).
Subset training data into five folds (80%/5 = 16% of total data per fold).
Select one fold for validation and keep this hidden.
Build models using 16% of the data and validate.
Build models using 32% of the data and validate.
Build models using 64% of the data and validate.
Cross-validate the best models.
For the recommended model, retrain on 100% of the training data (80% of the full dataset) and assess performance on the holdout data (20% of the full dataset).
Benefits of
Blenders
What is a Blender?
Last lecture I said that, DataRobot performs a heuristic search for the ‘best’ model based on the target to be predicted.
But what I didn’t mention was that this ‘best’ model could very well be an ensemble – or blend – of models.
Blender Example
Summary
Unlock Holdout
The model recommended for deployment, has been automatically unlocked
Though, we have the ability to unlock holdout for any other remaining models
By allowing models to be trained on 100% of the data, these models would have an advantage over the models trained on a smaller dataset in terms of prediction accuracy
However, there is a potential risk of overfitting the model and hence it is recommended that you continue to look at the leaderboard sorted by the most robust estimate of performance that avoids overfitting (e.g. the cross-validation score)
Summary of AML/DataRobot
AML provides everybody in the organisation with the ability to access cutting-edge predictive model building
Information is consistently presented, making it easy for the user to compare and contrast model multiple performance
Users can scale the models to accomodate datasets with millions of records and thousands of features
Computing resources are used effectively, as the leaderboard is created through multiple rounds of ‘survival of the fittest’ competition, thus using small volumes of data and only investing in models when it’s proven to perform well
/docProps/thumbnail.jpeg