COMP90051 Statistical Machine Learning Project 2 Description1 (v3 updated 2021-09-19)
Due date: 4:00pm Friday, 8th October 2021 Weight: 25%; forming combined hurdle with Proj1
Copyright statement: All the materials of this project—including this specification and code skeleton—are copyright of the University of Melbourne. These documents are licensed for the sole purpose of your assessment in COMP90051. You are not permitted to share or post these documents.
Academic misconduct: You are reminded that all submitted Project 2 work is to be your own individual work. Automated similarity checking software will be used to compare all submissions. It is University policy that academic integrity be enforced. For more details, please see the policy at http://academichonesty. unimelb.edu.au/policy.html. Academic misconduct hearings can determine that students receive zero for an assessment, a whole subject, or are terminated from their studies. You may not use software libraries or code snippets found online, from friends/private tutors, or anywhere else. You can only submit your own work.
Multi-armed bandits (MABs) are a simple yet powerful framework for sequential decision-making under uncertainty. One of the many applications of MABs was pioneered by Yahoo! News, in deciding what news items to recommend to users based on article content, user profile, and the historical engagement of the user with articles. Given decision making in this setting is sequential (what do we show next?) and feedback is only available for articles shown, Yahoo! researchers observed a perfect formulation for MABs like those (ε-Greedy and UCB) discussed in Lecture 172. These researchers realised that incorporating some element of user-article state requires contextual bandits: articles are arms; context per round incorporates information about both user and article (arm); and {0, 1}-valued rewards represent clicks. Therefore the per round cumulative reward represents click-through-rate, which is exactly what online services want to maximise to drive user engagement and advertising revenue. In this project, you will work individually to implement several MAB algorithms. These are a little more advanced that what was covered in class, coming from real research papers that you will have to read and understand yourself. By the end of the project you should have developed:
ILO1. A deeper understanding of the MAB setting and common MAB approaches;
ILO2. An appreciation of how MABs are applied;
ILO3. Demonstrable ability to implement ML approaches in code; and
ILO4. Ability to understanding recent ML papers, understand their focus, contributions, and algorithms enough to be able to implement and apply them. (While ignoring details not needed for your task.)
Overview
You have three tasks.
1. Implement LinUCB contextual MAB (Li et al., 2010)
2. Implement MLinUCB (Bouneffouf et al., 2020)
3. Implement SquareCB (Foster & Rakhlin, 2020)
1Special thanks to for generously co-designing this project. 2We’re releasing L17 with this project to help get you started
[7 marks] [9 marks] [9 marks]
1
AlltasksaretobecompletedintheprovidedPythonJupyternotebookproj2.ipynb.3 Detailedinstructions for each task are included in this document. These will require you to consult one or more academic papers. We provide helpful pointers in this project spec to guide your reading and to correct any ambiguities, with margin mark symbol .
MAB algorithms. The project’s tasks require you to implement MAB algorithms by completing provided skeleton code in proj2.ipynb. All of the MAB algorithms must be implemented as sub-classes of a base MAB class (defined for you). This ensures all MAB algorithms inherit the same interface, with the following methods:
• __init__(self, n_arms, …, rng): Initialises the MAB with the given number of arms n_arms and pseudo-random number generator rng. Arms are indexed by integers in the set {0, . . . , n_arms − 1}. All MAB algorithms take additional algorithm-specific parameters in place of ‘…’. Hint: When instantiating a MAB, you should set rng to be the random generator initialised for you in the notebook. This will ensure your results are reproducible.
• play(self, context): Plays an arm based on the provided context (a multi-dimensional array that encodes user and arm features). The method must return the integer index of the played arm in the set {0, . . . , n_arms − 1}. Note: this method should not update the internal state of the MAB. All play methods should tie-break uniformly at random in this project. If a MAB class’s play() method is undefined due to insufficient update() calls, you should by default pick an unplayed arm uniformly at random.
• update(self, arm, context, reward): Updates the internal state of the MAB after playing an arm with integer index arm for given context, receiving an (optional) real-valued reward. Only class MLinUCB of Task 2 will handle missing rewards, all other classes should do nothing if updated with a missing reward.
Your implementations must conform to this interface. You may implement some functionality in the base MAB class if you desire—e.g., to avoid duplicating common functionality in each sub-class. Your classes may also use additional private methods to better structure your code. And you may decide how to use class inheritance.
Evaluating MAB classes. Each task directs you to evaluate your implementations in some way. You will need to use the offline_eval() function and dataset file provided. See Appendix A for details.
Python environment. You must use the Python environment used in workshops to ensure markers can reproduce your results if required. We assume you are using Python ≥ 3.8, numpy ≥ 1.19.0, scikit-learn ≥ 0.23.0 and matplotlib ≥ 3.2.0.
Other constraints. You may not use functionality from external libraries/packages, beyond what is imported in the provided Jupyter notebook highlighted here with margin marking -. You must preserve the structure of the skeleton code—please only insert your own code where specified. You should not add new cells to the notebook. You may discuss the bandit learning slide deck or Python at a high-level with others, including via Piazza, but do not collaborate with anyone on direct solutions. You may consult resources to understand bandits conceptually, but do not make any use of online code whatsoever. (We will run code comparisons against online partial implementations to enforce these rules. See ‘academic misconduct’ statement above.)
Submission Checklist
You must complete all your work in the provided proj2.ipynb Jupyter notebook. When you are ready to submit, follow these steps. You may submit multiple times. We will mark your last attempt. Hint: it
3We appreciate that while some have entered COMP90051 feeling less confident with Python, many workshops so far have exercised and built up basic Python and Jupyter knowledge. Both are industry standard tools for the data sciences.
2
is a good idea to submit early as a backup. Try to complete Task 1 in the first week and submit it; it will help you understand other tasks and be a fantastic start!
1. Restart your Jupyter kernel and run all cells consecutively.
2. Ensure outputs are saved in the ipynb file, as we may choose not to run your notebook when grading.
3. Rename your completed notebook from proj2.ipynb to username.ipynb where username is your uni- versity central username4.
4. Upload your submission to the Project 2 Canvas page.
Marking
Projects will be marked out of 25. Overall approximately 50%, 30%, 20% of available marks will come from correctness, code structure & style, and experimentation. Markers will perform code reviews of your submissions with indicative focus on the following. We will endeavour to provide (indicative not exhaustive) feedback.
1. Correctness: Faithful implementation of the algorithm as specified in the reference or clarified in the specification with possible updates in the Canvas changelog. It is important that your code performs other basic functions such as: raising errors if the input is incorrect, working for any dataset that meets the requirements (i.e., not hard-coded).
2. Code structure and style: Efficient code (e.g., making use of vectorised functions, avoiding recalculation of expensive results); self-documenting variable names and/or comments; avoiding inappropriate data structures, duplicated code and illegal package imports.
3. Experimentation: Each task you choose to complete directs you to perform some experimentation with your implementation, such as evaluation, tuning, or comparison. You will need to choose a reasonable approach to your experimentation, based on your acquired understanding of the MAB learners.
Late submission policy. Late submissions will be accepted to 4 days at −2.5 penalty per day or part day. Task Descriptions
Task 1: Implement LinUCB Contextual MAB [7 marks total]
In this task, you will implement a MAB learner as a Python class. You are to read up to and including Section 3.1 of this WWW’2010 paper to understand and then implement the LinUCB learner with disjoint (not shared) linear models. LinUCB is described in Algorithm 1. This is a contextual bandit—likely the first you’ve seen—however it’s workings are a direct mashup of UCB and ridge regression both of which we cover in lecture.
Lihong Li, , , . Schapire, ‘A Contextual-Bandit Approach to Per- sonalized News Article Recommendation’, in Proceedings of the Nineteenth International Conference on World Wide Web (WWW’2010), Raleigh, NC, USA, 2010. https://arxiv.org/pdf/1003.0146.pdf
Your implementation of WWW’2010 Algorithm 1 should be done within the provided skeleton code for the LinUCB class. You will need to implement the __init__, play, and update methods. The parameters and return values for each method are specified in docstrings in proj2.ipynb. Note that tie-breaking in play should be done uniformly-at-random among value-maximising arms.
4LMS/UniMelb usernames look like brubinstein, not to be confused with email such as benjamin.rubinstein. 3
While the idea of understanding LinUCB enough to implement it correctly may seem daunting, the WWW paper is written for a non-ML audience and is complete in its description. The pseudo-code is detailed. There is one unfortunate typo however: pg. 3, column 2, line 3 of the linked arXiv version linked above should read ca rather than ba. The pseudo-code uses the latter (correctly) as shorthand for the former times the contexts.
Another hint: for students who haven’t seen “design matrix” before, it means a feature matrix. Experiments. Once your class is implemented, it is time to perform some basic experimentation.
–
(a) Include and run an evaluation on the given dataset where column 1 forms arms, column 2 forms rewards (and missing indicator column 3 is ignored), and columns 4–103 form contexts:
mab = LinUCB(10, 10, 1.0, rng)
LinUCB_rewards, _ = offline_eval(mab, arms, rewards, contexts, 800)
print(“LinUCB average reward “, np.mean(LinUCB_rewards))
(b) Run offline evaluation as above, but now plot the per-round cumulative reward i.e., T −1 Tt=1 rt,a for T = 1..800 as a function of round T . We have imported matplotlib.pyplot to help here.
(c) Can you tune your MAB’s hyperparameters? Devise and run a grid-search based strategy to select alpha for LinUCB as Python code in your notebook. Output the result of this strategy—which could be a graph, number, etc. of your choice.
Task 2: Implement MLinUCB [9 marks total]
In this task, you will implement a contextual MAB that extends LinUCB (Task 1) for settings where observing rewards may not always be possible. For example, rewards may be lost due to errors in a physical process, they may be costly, or withheld due to privacy. The idea of this MLinUCB (LinUCB with missing rewards) is similar to semi-supervised learning: even when MLinUCB doesn’t observe a reward in a round, it still tries to learn from the context feature vector by guessing the missing reward using clustering of past context-reward pairs. You are to read up to but not including Theorem 1 of this 2020 arXiv paper:
, , and . ‘Contextual Bandit with Missing Rewards.’ arXiv:2007.06368 [cs.LG]. 2020. https://arxiv.org/pdf/2007.06368.pdf
Your implementation of Algorithm 2 should be done within the provided skeleton code for the MLinUCB class. You will need to implement the __init__, play, and update methods. Recall that update() will receive reward=None for rounds with no observed reward. The parameters and return values for each method are specified in docstrings in proj2.ipynb. Your implementation should make use of k-means clustering as
– implemented here in scikit-learn and imported for you in the proj2.ipynb skeleton. You will need to initialise using MAB hyperparameters for N,m from the paper. Except for the number of clusters, you should just use all of KMeans defaults.
Hint: A number of minor errors and omissions in the paper might cause confusion. To help out, we’ve clarified these and summarised the clarifications in a new pseudo-code algorithm, all in Appendix B.
Experiments. Repeat the same set of experiments from Task 1 here, except for (respectively): (a) Here run your new class, this time with missing rewards as follows:
mab = MLinUCB(10, 10, 1.0, 10, 3)
MLinUCB_rewards, MLinUCB_ids = offline_eval(mab, arms, rewards_missing, contexts, 800)
print(’MLinUCB average reward’, np.mean(rewards[MLinUCB_ids]))
Then compare this with LinUCB also evaluated with the same missing reward version of the data. 4
(b) Once again plot your results for both bandits. This time with the missing reward data. (c) This time consider tuning alpha, N and m.
Task 3: Implement SquareCB [9 marks total]
In this task, you will implement a final learner for the regular contextual MAB setting (the same setting as Task 1), for ‘general purpose’ contextual MABs. Where LinUCB is designed to work only with ridge regression estimating arm rewards, general-purpose MABs aim to be able to use any supervised learner instead. The following ICML’2020 paper describes a state-of-the-art approach to general-purpose MABs. (Note in this paper ‘oracle’ or ‘online regression oracle’ refers to this user-defined supervised learning algorithm. It is an oracle because the bandit can call it for answers without knowing how the oracle works.)
Dylan Foster and . ‘Beyond UCB: Optimal and Efficient Contextual Bandits with Regression Oracles.’ In Proceedings of the 37th International Conference on Machine Learning (ICML’2020), pp. 3199-3210, PMLR, 2020. http://proceedings.mlr.press/v119/foster20a/ foster20a.pdf
You are to read this paper up to and not including Theorem 1, then the first paragraph of Section 2.2 and Algorithm 1 (i.e., you might skip Theorem 1 to Section 2.1, and most of Section 2.2 after the algorithm is described; these skipped discussions are really interesting theoretical developments but not needed for the task).
You should implement the paper’s Algorithm 1 within the provided skeleton code for the SquareCB class. As before, you will need to implement the __init__, play, and update methods. The parameters and return values for each method are specified in docstrings in proj2.ipynb. While the paper allows for any ‘online regression oracle’ to be input into Algorithm 1 as parameter ‘SqAlg’, you should hard-card your oracle to be this logistic
– regression implementation using scikit-learn which is imported for you in the proj2.ipynb skeleton. Use the logistic regression class’s defaults. We have picked logistic regression since the data in this project has rewards for clicks.
While this paper appears to be free of errors, so we do not provide a full appendix of clarifications, we offer pointers now. Hint: this paper deals with losses not rewards. To fit this into our framework, it might help to convert a loss to a reward by using its negative value. Hint: Line 6 of the algorithm defines pt,a of a distribution but in general this might not sum to one to form a proper distribution. You should hard-code parameter μ to be
the number of arms to fix this. Finally, you should be sure to use a separate model per arm. Experiments. Repeat the same set of experiments as above, except for (respectively):
(a) Here run mab = SquareCB(10, 10, 18.0, rng) with all rewards (ignoring the missing reward column 3), save results in SquareCB_rewards instead. Once again also run LinUCB on the same data. Hint: Where does γ = 18 come from? It was examined by a reference, Abe & Long.
(b) When plotting the results on all data (ignore missing rewards), include LinUCB’s curve too.
(c) This time consider tuning gamma.
5
A Appendix: Details for Off-Policy Evaluation and Dataset
You are directed to experiment with your bandits in each project task. To help with this, we have provided in the skeleton notebook a useful offline_eval() function, and a dataset. This section describes both.
Off-policy evaluation: The offline_eval() function evaluates bandit classes on previously collected data. The parameters and return value of the offline_eval function—its interface—are specified in the function’s docstring. Recall that bandits are interactive methods: to train, they must interact with their environment. This is true even for evaluating a bandit, as it must be trained at the same time as it is evaluated. But this requirement to let a bandit learner loose on real data, was a major practical challenge for industry deployments of MAB. Typically bandits begin with little knowledge about arm reward structure, and so a bandit must necessarily suffer poor rewards in beginning rounds. For a company trying out and evaluating dozens of bandits in their data science groups, this would be potentially prohibitively expensive. A breakthrough was made when it was realised that MABs can be evaluated offline or off-policy. (‘Policy’ is the function outputting an arm given a context.) With off-policy evaluation: you collect just one dataset of uniformly-random arm pulls and resulting rewards; then you evaluate any interesting future bandit learners on that one historical dataset! We have provided this functionality in offline_eval() for you, which implements Algorithm 3 “Policy Evaluator” of the WWW’2010 paper referenced in Task 1. In order to understand this algorithm, you could read Section 4 of WWW’2010.5
Dataset. Alongside the skeleton notebook, we provide a 2 MB dataset.txt suitable for validating contex- tual MAB implementations. You should download this file and place it in the same directory as your local proj2.ipynb. It is formatted as follows:
• 10,000 lines (i.e., rows) corresponding to distinct site visits by users—events in the language of this task; • Each row comprises 103 space-delimited columns of integers:
– Column 1: The arm played by a uniformly-random policy out of 10 arms (news articles);
– Column 2: The reward received from the arm played—1 if the user clicked 0 otherwise;
– Column 3: An indicator used in Task 2, only. In that task you will be considering a MAB that can operate when some rewards are missing. This indicator value says whether the reward should be missing (1) or available (0); and
– Columns 4–103: The 100-dim flattened context: 10 features per arm (incorporating the content of the article and its match with the visiting user), first the features for arm 1, then arm 2, etc. up to arm 10.
5What might not be clear in the pseudo-code of Algorithm 3 of WWW’2010, is that after the MAB plays (policy written as π) an arm that matches the next logged record, off-policy evaluation notes down the reward as if the MAB really received this reward—for the purposes of the actual evaluation calculation; it also runs update() of the MAB with the played arm a, reward ra, and the context x1,…,xK over the K arms.
6
Algorithm 1 MLinUCB correcting Algorithm 2 of Bouneffouf et al. (2020)
1: 2: 3: 4: 5: 6: 7: 8: 9:
10:
11: 12: 13: 14: 15: 16:
17:
18: 19: 20: 21: 22: 23:
B
Input: value for α, b0, A0, N, m forallk∈K do
Ak ←A0,bk ←b0 end for
fort=1toT do
Let {x1, . . . , xs} be the updated contexts up to round t Cluster {x1, . . . , xs} into min{s, N} clusters forallk∈K do
θk ← A−1bk k
pk ← θT xt,k + αxT A−1xt,k k t,k k
end for
Choose arm kt = arg maxk∈K pk, and observe real-valued payoff rt if rt available from data then
retrieve rt from data else
m′ ← min{m, s}
m′ rj
j=1 0.1+dj rt ← g(xt) = m′ 1
j=1 0.1+dj
end if
if s > 0 or rt successfully retrieved then
Ak ← Ak + xt,kxTt,k
bk ← bk + rtxt,k end if
end for
Appendix: Corrected Pseudo-Code for MLinUCB
We now list some corrections to Algorithm 2 of Bouneffouf et al. (2020) for Task 2’s MLinUCB. These are summarised in our pseudo-code Algorithm 1.
The purpose of inputs A0, b0 (matrix and vector) are to initialise the model parameters per arm (new line 3) which wasn’t done originally.
There are several places where index kt was used but undefined, and can often just be k for arm. While in some places the arm’s specific context xt,k should be used not just the overall context xt—these are largely corrected now.
A detail omitted in the paper: you can’t have more clusters than the number of observations being clustered. Hint: if s denotes the number of instances passed to clustering, then form min{s,N} clusters instead of N, and
find the closest min{s, m} clusters instead of m (new lines 7 and 16).
To prevent division by zero in the definition of function g() (Equation 2 of the paper) you should add
constant 0.1 to the denominators (new line 17).
You might need to play/update an arm without seeing a reward for it. You cannot update model parameters
in this case (new line 19).
7