Online evaluation
Faculty of Information Technology, Monash University, Australia
FIT5149 week 5 Additional
(Monash) FIT5149 1 / 8
Two regimes for machine learning evaluation
Offline evaluation:
Happens during the prototyping phase.
Tries out different features, models, and hyperparameters.
An iterative process of many rounds of evaluation against a chosen baseline
on a set of chosen evaluation metrics.
Resampling methods: cross validation, bootstrapping.
Online evaluation:
A/B testing
Multi-Armed Bandits (MAB)
(Monash) FIT5149 2 / 8
An example of A/B testing
Example of A/B testing on a website. By randomly serving visitors two versions of a website that differ only in the design of a single button element, the relative efficacy of the two designs can be measured.(Cite source: https://en.wikipedia.org/wiki/A/B_testing)
(Monash) FIT5149 3 / 8
What is A/B testing?
Briefly, A/B testing involves the following steps:
1 Split into randomized control/experimentation groups.
2 Observe behavior of both groups on the proposed methods.
3 Compute test statistics.
4 Compute p-value.
5 Output decision.
(Monash) FIT5149 4 / 8
Pitfalls of A/B Testing
A/B tests are easy to understand but tricky to do right. Here are a list of things to watch out for, ranging from pedantic to pragmatic.
Complete Separation of Experiences.
Which Metric?
How Much Change Counts as Real Change?
One-Sided or Two-Sided Test?
How Many False Positives Are You Willing to Tolerate? How Many Observations Do You Need?
Is the Distribution of the Metric Gaussian?
Are the Variances Equal?
What Does the p-Value Mean?
Multiple Models, Multiple Hypotheses.
How Long to Run the Test?
Catching Distribution Drift.
(Monash) FIT5149 5 / 8
Multi-Armed Bandits: An Alternative
If the ultimate goal is to decide which model or design is the best, then A/B testing is the right framework, along with its many gotchas to watch out for.
However, if the ultimate goal is to maximize total reward, then multiarmed bandits and personalization is the way to go.
(Monash) FIT5149 6 / 8
Multi-Armed Bandits: An Alternative
The name Multi-Armed bandits (MAB) comes from gambling.
A slot machine is a one-armed bandit; each time you pull the lever, it outputs a certain reward (most likely negative).
Multiarmed bandits are like a room full of slot machines, each one with an unknown random payoff distribution.
The task is to figure out which arm to pull and when, in order to maximize the reward.
There are many MAB algorithms: linear UCB, Thompson sampling (or Bayesian bandits), and Exp3 are some of the most well known.
(Monash) FIT5149 7 / 8
A quick review: Evaluating Machine Learning Models
More details can be found at : “Evaluating Machine Learning Models” by Alice Zheng Published by O’Reilly Media, Inc., 2015
(Monash) FIT5149 8 / 8