COMP90051 Statistical Machine Learning
Project 2 Description1 (v3 updated 2021-09-19)
Due date: 4:00pm Friday, 8th October 2021 Weight: 25%; forming combined hurdle with Proj1
Copyright statement: All the materials of this project—including this specification and code skeleton—are
copyright of the University of Melbourne. These documents are licensed for the sole purpose of your assessment
in COMP90051. You are not permitted to share or post these documents.
Academic misconduct: You are reminded that all submitted Project 2 work is to be your own individual
work. Automated similarity checking software will be used to compare all submissions. It is University policy
that academic integrity be enforced. For more details, please see the policy at http://academichonesty.
unimelb.edu.au/policy.html. Academic misconduct hearings can determine that students receive zero for an
assessment, a whole subject, or are terminated from their studies. You may not use software libraries or code
snippets found online, from friends/private tutors, or anywhere else. You can only submit your own work.
Multi-armed bandits (MABs) are a simple yet powerful framework for sequential decision-making under
uncertainty. One of the many applications of MABs was pioneered by Yahoo! News, in deciding what news
items to recommend to users based on article content, user profile, and the historical engagement of the user
with articles. Given decision making in this setting is sequential (what do we show next?) and feedback is only
available for articles shown, Yahoo! researchers observed a perfect formulation for MABs like those (�-Greedy
and UCB) discussed in Lecture 172. These researchers realised that incorporating some element of user-article
state requires contextual bandits: articles are arms; context per round incorporates information about both
user and article (arm); and {0, 1}-valued rewards represent clicks. Therefore the per round cumulative reward
represents click-through-rate, which is exactly what online services want to maximise to drive user engagement
and advertising revenue. In this project, you will work individually to implement several MAB algorithms.
These are a little more advanced that what was covered in class, coming from real research papers that you will
have to read and understand yourself. By the end of the project you should have developed:
ILO1. A deeper understanding of the MAB setting and common MAB approaches;
ILO2. An appreciation of how MABs are applied;
ILO3. Demonstrable ability to implement ML approaches in code; and
ILO4. Ability to understanding recent ML papers, understand their focus, contributions, and algorithms enough
to be able to implement and apply them. (While ignoring details not needed for your task.)
Overview
You have three tasks.
1. Implement LinUCB contextual MAB (Li et al., 2010) [7 marks]
2. Implement MLinUCB (Bouneffouf et al., 2020) [9 marks]
3. Implement SquareCB (Foster & Rakhlin, 2020) [9 marks]
1Special thanks to Neil Marchant for generously co-designing this project.
2We’re releasing L17 with this project to help get you started
1
http://academichonesty.unimelb.edu.au/policy.html
http://academichonesty.unimelb.edu.au/policy.html
All tasks are to be completed in the provided Python Jupyter notebook proj2.ipynb.3 Detailed instructions
for each task are included in this document. These will require you to consult one or more academic papers. We
provide helpful pointers in this project spec to guide your reading and to correct any ambiguities, with margin
mark symbol .
MAB algorithms. The project’s tasks require you to implement MAB algorithms by completing provided
skeleton code in proj2.ipynb. All of the MAB algorithms must be implemented as sub-classes of a base MAB
class (defined for you). This ensures all MAB algorithms inherit the same interface, with the following methods:
• __init__(self, n_arms, …, rng): Initialises the MAB with the given number of arms n_arms and
pseudo-random number generator rng. Arms are indexed by integers in the set {0, . . . , n_arms− 1}. All
MAB algorithms take additional algorithm-specific parameters in place of ‘…’. Hint: When instantiating
a MAB, you should set rng to be the random generator initialised for you in the notebook. This will ensure
your results are reproducible.
• play(self, context): Plays an arm based on the provided context (a multi-dimensional array that
encodes user and arm features). The method must return the integer index of the played arm in the set
{0, . . . , n_arms−1}. Note: this method should not update the internal state of the MAB. All play methods
should tie-break uniformly at random in this project. If a MAB class’s play() method is undefined due to
insufficient update() calls, you should by default pick an unplayed arm uniformly at random.
• update(self, arm, context, reward): Updates the internal state of the MAB after playing an arm
with integer index arm for given context, receiving an (optional) real-valued reward. Only class MLinUCB
of Task 2 will handle missing rewards, all other classes should do nothing if updated with a missing reward.
Your implementations must conform to this interface. You may implement some functionality in the base
MAB class if you desire—e.g., to avoid duplicating common functionality in each sub-class. Your classes may also
use additional private methods to better structure your code. And you may decide how to use class inheritance.
Evaluating MAB classes. Each task directs you to evaluate your implementations in some way. You will
need to use the offline_eval() function and dataset file provided. See Appendix A for details.
Python environment. You must use the Python environment used in workshops to ensure markers can
reproduce your results if required. We assume you are using Python ≥ 3.8, numpy ≥ 1.19.0, scikit-learn ≥
0.23.0 and matplotlib ≥ 3.2.0.
Other constraints. You may not use functionality from external libraries/packages, beyond what is imported
in the provided Jupyter notebook highlighted here with margin marking -. You must preserve the structure
of the skeleton code—please only insert your own code where specified. You should not add new cells to the
notebook. You may discuss the bandit learning slide deck or Python at a high-level with others, including
via Piazza, but do not collaborate with anyone on direct solutions. You may consult resources to understand
bandits conceptually, but do not make any use of online code whatsoever. (We will run code comparisons against
online partial implementations to enforce these rules. See ‘academic misconduct’ statement above.)
Submission Checklist
You must complete all your work in the provided proj2.ipynb Jupyter notebook. When you are
ready to submit, follow these steps. You may submit multiple times. We will mark your last attempt. Hint: it
3We appreciate that while some have entered COMP90051 feeling less confident with Python, many workshops so far have
exercised and built up basic Python and Jupyter knowledge. Both are industry standard tools for the data sciences.
2
https://numpy.org/doc/stable/reference/random/generator.html
is a good idea to submit early as a backup. Try to complete Task 1 in the first week and submit it; it will help
you understand other tasks and be a fantastic start!
1. Restart your Jupyter kernel and run all cells consecutively.
2. Ensure outputs are saved in the ipynb file, as we may choose not to run your notebook when grading.
3. Rename your completed notebook from proj2.ipynb to username.ipynb where username is your uni-
versity central username4.
4. Upload your submission to the Project 2 Canvas page.
Marking
Projects will be marked out of 25. Overall approximately 50%, 30%, 20% of available marks will come from
correctness, code structure & style, and experimentation. Markers will perform code reviews of your submissions
with indicative focus on the following. We will endeavour to provide (indicative not exhaustive) feedback.
1. Correctness: Faithful implementation of the algorithm as specified in the reference or clarified in the
specification with possible updates in the Canvas changelog. It is important that your code performs
other basic functions such as: raising errors if the input is incorrect, working for any dataset that meets
the requirements (i.e., not hard-coded).
2. Code structure and style: Efficient code (e.g., making use of vectorised functions, avoiding recalculation
of expensive results); self-documenting variable names and/or comments; avoiding inappropriate data
structures, duplicated code and illegal package imports.
3. Experimentation: Each task you choose to complete directs you to perform some experimentation with
your implementation, such as evaluation, tuning, or comparison. You will need to choose a reasonable
approach to your experimentation, based on your acquired understanding of the MAB learners.
Late submission policy. Late submissions will be accepted to 4 days at −2.5 penalty per day or part day.
Task Descriptions
Task 1: Implement LinUCB Contextual MAB [7 marks total]
In this task, you will implement a MAB learner as a Python class. You are to read up to and including
Section 3.1 of this WWW’2010 paper to understand and then implement the LinUCB learner with disjoint (not
shared) linear models. LinUCB is described in Algorithm 1. This is a contextual bandit—likely the first you’ve
seen—however it’s workings are a direct mashup of UCB and ridge regression both of which we cover in lecture.
Lihong Li, Wei Chu, John Langford, Robert E. Schapire, ‘A Contextual-Bandit Approach to Per-
sonalized News Article Recommendation’, in Proceedings of the Nineteenth International Conference
on World Wide Web (WWW’2010), Raleigh, NC, USA, 2010.
https://arxiv.org/pdf/1003.0146.pdf
Your implementation of WWW’2010 Algorithm 1 should be done within the provided skeleton code for
the LinUCB class. You will need to implement the __init__, play, and update methods. The parameters and
return values for each method are specified in docstrings in proj2.ipynb. Note that tie-breaking in play should
be done uniformly-at-random among value-maximising arms.
4LMS/UniMelb usernames look like brubinstein, not to be confused with email such as benjamin.rubinstein.
3
https://arxiv.org/pdf/1003.0146.pdf
While the idea of understanding LinUCB enough to implement it correctly may seem daunting, the WWW
paper is written for a non-ML audience and is complete in its description. The pseudo-code is detailed. There
is one unfortunate typo however: pg. 3, column 2, line 3 of the linked arXiv version linked above should read
ca rather than ba. The pseudo-code uses the latter (correctly) as shorthand for the former times the contexts.
Another hint: for students who haven’t seen “design matrix” before, it means a feature matrix.
Experiments. Once your class is implemented, it is time to perform some basic experimentation.
(a) Include and run an evaluation on the given dataset where column 1 forms arms, column 2 forms rewards
(and missing indicator column 3 is ignored), and columns 4–103 form contexts:
mab = LinUCB(10, 10, 1.0, rng)
LinUCB_rewards, _ = offline_eval(mab, arms, rewards, contexts, 800)
print(“LinUCB average reward “, np.mean(LinUCB_rewards))
(b) Run offline evaluation as above, but now plot the per-round cumulative reward i.e., T−1
∑T
t=1 rt,a for
T = 1..800 as a function of round T . We have- imported matplotlib.pyplot to help here.
(c) Can you tune your MAB’s hyperparameters? Devise and run a grid-search based strategy to select alpha
for LinUCB as Python code in your notebook. Output the result of this strategy—which could be a graph,
number, etc. of your choice.
Task 2: Implement MLinUCB [9 marks total]
In this task, you will implement a contextual MAB that extends LinUCB (Task 1) for settings where observing
rewards may not always be possible. For example, rewards may be lost due to errors in a physical process,
they may be costly, or withheld due to privacy. The idea of this MLinUCB (LinUCB with missing rewards) is
similar to semi-supervised learning: even when MLinUCB doesn’t observe a reward in a round, it still tries to
learn from the context feature vector by guessing the missing reward using clustering of past context-reward
pairs. You are to read up to but not including Theorem 1 of this 2020 arXiv paper:
Djallel Bouneffouf, Sohini Upadhyay, and Yasaman Khazaeni. ‘Contextual Bandit with Missing
Rewards.’ arXiv:2007.06368 [cs.LG]. 2020. https://arxiv.org/pdf/2007.06368.pdf
Your implementation of Algorithm 2 should be done within the provided skeleton code for the MLinUCB
class. You will need to implement the __init__, play, and update methods. Recall that update() will
receive reward=None for rounds with no observed reward. The parameters and return values for each method
are specified in docstrings in proj2.ipynb. Your implementation should make use of k-means clustering as
implemented here in scikit-learn and- imported for you in the proj2.ipynb skeleton. You will need to
initialise using MAB hyperparameters for N,m from the paper. Except for the number of clusters, you should
just use all of KMeans defaults.
Hint: A number of minor errors and omissions in the paper might cause confusion. To help out, we’ve
clarified these and summarised the clarifications in a new pseudo-code algorithm, all in Appendix B.
Experiments. Repeat the same set of experiments from Task 1 here, except for (respectively):
(a) Here run your new class, this time with missing rewards as follows:
mab = MLinUCB(10, 10, 1.0, 10, 3)
MLinUCB_rewards, MLinUCB_ids = offline_eval(mab, arms, rewards_missing, contexts, 800)
print(’MLinUCB average reward’, np.mean(rewards[MLinUCB_ids]))
Then compare this with LinUCB also evaluated with the same missing reward version of the data.
4
https://arxiv.org/pdf/2007.06368.pdf
https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
(b) Once again plot your results for both bandits. This time with the missing reward data.
(c) This time consider tuning alpha, N and m.
Task 3: Implement SquareCB [9 marks total]
In this task, you will implement a final learner for the regular contextual MAB setting (the same setting as
Task 1), for ‘general purpose’ contextual MABs. Where LinUCB is designed to work only with ridge regression
estimating arm rewards, general-purpose MABs aim to be able to use any supervised learner instead. The
following ICML’2020 paper describes a state-of-the-art approach to general-purpose MABs. (Note in this paper
‘oracle’ or ‘online regression oracle’ refers to this user-defined supervised learning algorithm. It is an oracle
because the bandit can call it for answers without knowing how the oracle works.)
Dylan Foster and Alexander Rakhlin. ‘Beyond UCB: Optimal and Efficient Contextual Bandits
with Regression Oracles.’ In Proceedings of the 37th International Conference on Machine Learning
(ICML’2020), pp. 3199-3210, PMLR, 2020. http://proceedings.mlr.press/v119/foster20a/
foster20a.pdf
You are to read this paper up to and not including Theorem 1, then the first paragraph of Section 2.2
and Algorithm 1 (i.e., you might skip Theorem 1 to Section 2.1, and most of Section 2.2 after the algorithm
is described; these skipped discussions are really interesting theoretical developments but not needed for the
task).
You should implement the paper’s Algorithm 1 within the provided skeleton code for the SquareCB class. As
before, you will need to implement the __init__, play, and update methods. The parameters and return values
for each method are specified in docstrings in proj2.ipynb. While the paper allows for any ‘online regression
oracle’ to be input into Algorithm 1 as parameter ‘SqAlg’, you should hard-card your oracle to be this logistic
regression implementation using scikit-learn which is- imported for you in the proj2.ipynb skeleton. Use the
logistic regression class’s defaults. We have picked logistic regression since the data in this project has rewards
for clicks.
While this paper appears to be free of errors, so we do not provide a full appendix of clarifications, we offer
pointers now. Hint: this paper deals with losses not rewards. To fit this into our framework, it might help to
convert a loss to a reward by using its negative value. Hint: Line 6 of the algorithm defines pt,a of a distribution
but in general this might not sum to one to form a proper distribution. You should hard-code parameter µ to be
the number of arms to fix this. Finally, you should be sure to use a separate model per arm.
Experiments. Repeat the same set of experiments as above, except for (respectively):
(a) Here run mab = SquareCB(10, 10, 18.0, rng) with all rewards (ignoring the missing reward column 3),
save results in SquareCB_rewards instead. Once again also run LinUCB on the same data. Hint: Where
does γ = 18 come from? It was examined by a reference, Abe & Long.
(b) When plotting the results on all data (ignore missing rewards), include LinUCB’s curve too.
(c) This time consider tuning gamma.
5
http://proceedings.mlr.press/v119/foster20a/foster20a.pdf
http://proceedings.mlr.press/v119/foster20a/foster20a.pdf
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
A Appendix: Details for Off-Policy Evaluation and Dataset
You are directed to experiment with your bandits in each project task. To help with this, we have provided in
the skeleton notebook a useful offline_eval() function, and a dataset. This section describes both.
Off-policy evaluation: The offline_eval() function evaluates bandit classes on previously collected data.
The parameters and return value of the offline_eval function—its interface—are specified in the function’s
docstring. Recall that bandits are interactive methods: to train, they must interact with their environment.
This is true even for evaluating a bandit, as it must be trained at the same time as it is evaluated. But this
requirement to let a bandit learner loose on real data, was a major practical challenge for industry deployments
of MAB. Typically bandits begin with little knowledge about arm reward structure, and so a bandit must
necessarily suffer poor rewards in beginning rounds. For a company trying out and evaluating dozens of bandits
in their data science groups, this would be potentially prohibitively expensive. A breakthrough was made when
it was realised that MABs can be evaluated offline or off-policy. (‘Policy’ is the function outputting an arm
given a context.) With off-policy evaluation: you collect just one dataset of uniformly-random arm pulls and
resulting rewards; then you evaluate any interesting future bandit learners on that one historical dataset! We
have provided this functionality in offline_eval() for you, which implements Algorithm 3 “Policy Evaluator”
of the WWW’2010 paper referenced in Task 1. In order to understand this algorithm, you could read Section 4
of WWW’2010.5
Dataset. Alongside the skeleton notebook, we provide a 2 MB dataset.txt suitable for validating contex-
tual MAB implementations. You should download this file and place it in the same directory as your local
proj2.ipynb. It is formatted as follows:
• 10,000 lines (i.e., rows) corresponding to distinct site visits by users—events in the language of this task;
• Each row comprises 103 space-delimited columns of integers:
– Column 1: The arm played by a uniformly-random policy out of 10 arms (news articles);
– Column 2: The reward received from the arm played—1 if the user clicked 0 otherwise;
– Column 3: An indicator used in Task 2, only. In that task you will be considering a MAB that can
operate when some rewards are missing. This indicator value says whether the reward should be
missing (1) or available (0); and
– Columns 4–103: The 100-dim flattened context: 10 features per arm (incorporating the content of
the article and its match with the visiting user), first the features for arm 1, then arm 2, etc. up to
arm 10.
5What might not be clear in the pseudo-code of Algorithm 3 of WWW’2010, is that after the MAB plays (policy written as
π) an arm that matches the next logged record, off-policy evaluation notes down the reward as if the MAB really received this
reward—for the purposes of the actual evaluation calculation; it also runs update() of the MAB with the played arm a, reward ra,
and the context x1, . . . ,xK over the K arms.
6
Algorithm 1 MLinUCB correcting Algorithm 2 of Bouneffouf et al. (2020)
1: Input: value for α,b0,A0, N,m
2: for all k ∈ K do
3: Ak ← A0, bk ← b0
4: end for
5: for t = 1 to T do
6: Let {x1, . . . ,xs} be the updated contexts up to round t
7: Cluster {x1, . . . ,xs} into min{s,N} clusters
8: for all k ∈ K do
9: θk ← A−1k bk
10: pk ← θTk xt,k + α
√
xTt,kA
−1
k xt,k
11: end for
12: Choose arm kt = arg maxk∈K pk, and observe real-valued payoff rt
13: if rt available from data then
14: retrieve rt from data
15: else
16: m′ ← min{m, s}
17: rt ← g(xt) =
∑m′
j=1
rj
0.1+dj∑m′
j=1
1
0.1+dj
18: end if
19: if s > 0 or rt successfully retrieved then
20: Ak ← Ak + xt,kxTt,k
21: bk ← bk + rtxt,k
22: end if
23: end for
B Appendix: Corrected Pseudo-Code for MLinUCB
We now list some corrections to Algorithm 2 of Bouneffouf et al. (2020) for Task 2’s MLinUCB. These are
summarised in our pseudo-code Algorithm 1.
The purpose of inputs A0,b0 (matrix and vector) are to initialise the model parameters per arm (new line 3)
which wasn’t done originally.
There are several places where index kt was used but undefined, and can often just be k for arm. While
in some places the arm’s specific context xt,k should be used not just the overall context xt—these are largely
corrected now.
A detail omitted in the paper: you can’t have more clusters than the number of observations being clustered.
Hint: if s denotes the number of instances passed to clustering, then form min{s,N} clusters instead of N , and
find the closest min{s,m} clusters instead of m (new lines 7 and 16).
To prevent division by zero in the definition of function g() (Equation 2 of the paper) you should add
constant 0.1 to the denominators (new line 17).
You might need to play/update an arm without seeing a reward for it. You cannot update model parameters
in this case (new line 19).
7
Implement LinUCB Contextual MAB [7 marks total]
Implement MLinUCB [9 marks total]
Implement SquareCB [9 marks total]
Appendix: Details for Off-Policy Evaluation and Dataset
Appendix: Corrected Pseudo-Code for MLinUCB