CS计算机代考程序代写 Bayesian arm algorithm Bandit problems for online pricing

Bandit problems for online pricing

Dr. Tanut Treetanthiploet

October 31, 2021

The idea behind the bandit problem is how can we make a decision when facing a trade-o↵ between
exploration and exploitation. In this exercise, you will be asked to prove a theoretical results to guarantee
performance of bandits algorithm when applying to an online pricing problem.You will also be asked to
run simulations to study performance of a few bandit algorithms.

It is worth pointing out that mathematical formulation for bandit problems are slightly di↵erent within
literature. It is recommended that you should try to write your report/ give a presentation using mathe-
matical setup as described in [7].

Your report and presentation on this topic should self contained and should not directly refer to exercises.
You are advised to begin by a brief introduction on the ideas of bandit problems when each arm corresponds
to di↵erent parameters and discuss about sub-Gaussian distribution. You then introduce framework for an
online pricing, prove theoretical results and present simulations.

Exercise 1: Show that any bounded random variable with mean 0 is Sub-Gaussian.

Exercise 2: Describe at least 3 bandit algorithms for a classical stochastic bandit setting (e.g. when each
arm corresponds to di↵erent parameters). Clearly stated related assumptions of such algorithms and give
the corresponding theoretical guarantee on the regret of your illustrated examples (You are not required to
prove these regret bounds). Examples of algorithms with regret bounds may be found in [1, 2, 3, 4, 5, 6, 7, 8].

Suppose that an agent wants to fix the price of a single product from a finite set of prices {c1, c2, …, cK}.
When the price ck is displayed, the number of customers who buy products on the tth day is

Dt(ck) = ack + b+ ✏t (*)

where ✏t ⇠IID N(0, 1).

Exercise 3: Suppose that the agent choose to display the prices cA1 , cA2 , …, cAt and observes the number
of customers D1, …., Dt. The least square estimate of a and b at time t are given by

(ât, b̂t) = argmin
a,b

tX

s=1


acAt + b�Dt

⌘2
.

What is the condition such that (ât, b̂t) are unique? Assume that such condition is satisfied. Find an
expression of (ât, b̂t) in terms of cAt and Dt.

Exercise 4: What is the corresponding estimate (ât, b̂t) for the Explore-Then-Commit (ETC) algorithm
after the first mK attempts? Assume (for this exercise) that K = 2. Recall the result from our lecture
that there exists a constant � > 0 for � = |µ1 � µ2| where µk = ac2k + bck, the expected regret of the ETC
algorithm satisfies

Ea,b[R(T,AETC)]  �m+�T exp(�m�2/�2).

Show that for any � > 0, there exists a constant C (depending on �, � and �) such that the expected
regret of the ETC algorithm with m = b�T 2/3c satisfies Ea,b[R(T,AETC)]  CT 2/3. (Hint: you may first
find an upper-bound for �T exp(�m�2/�2) which does not depends on �. )

1

ehhhsesonoeshhhnsoeesoestdooti

Exercise 5: Assume that (a, b) is sampled from the multivariate normal prior with mean


�2
4


and

variance


1 0
0 1


. What is the posterior distribution of (a, b)? Describe Thompson Sampling (TS) corre-

sponding to this setting. (You may find [7, Chapter 36] or [9] helpful.)

Exercise 6: Propose another possible bandit algorithm (which is not ETC and Thompson algorithm) that
can be used to solve the stated online pricing. You may construct algorithm by yourself or modify/apply
any existing algorithms. You are not expected to prove any theoretical guarantee of the proposed algorithm.
However, you should explain how your algorithm explore options and exploit available information.

Exercise 7: Run 100 repeats of bandit simulation with T = 103 of the proposed model (*) with a =
�1, b = 4 and ck = 0.25k + 1 for k = 1, …, 10 where you should consider the following algorithms:

(i) Explore-Then-Commit (ETC) algorithm with m = �T 2/3 for � = 0.01, 0.1, 0.5, 1.

(ii) Thompson Sampling (TS) with prior

• N


�2
4


,


1 0
0 1

◆!

• N


20
10


,


1 0
0 1

◆!

• N


�2
4


,


10�5 0
0 10�5

◆!
.

(iii) Your proposed algorithm in Exercise 6 where you should vary hyperparameter(s) of the algorithm (if
exists).

Illustrate performance of theses algorithm by demonstrating cumulative pseudo regret

Rpseudo(T,A) :=
TX

t=1


µk⇤ � µAt

where µk = ac
2
k + bck. Discuss e↵ects of hyperparameter in each of these algorithms.

References

[1] S. Agrawal and N. Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In
JMLR: Workshop and Conference Proceedings, 2012.

[2] P. Auer. Using Confidence Bounds for Exploitation-Exploration Trade-o↵. Journal of Machine Learning
Research, pages 397–422, 2002.

[3] N. Cesa-Bianchi, C. Gentile, G. Lugosi, and G. Neu. Boltzmann exploration done right. In Proceedings
of the 31st International Conference on Neural Information Processing Systems, 2017.

[4] S. Filippi, O. Cappé, A. Garivier, and C. Szepesvári. Parametric bandits: the Generalized Linear
case. In NIPS’10: Proceedings of the 23rd International Conference on Neural Information Processing
Systems, 2010.

[5] E. Kaufmann, O. Cappé, and Aur�elien Garivier. On Bayesian Upper Confidence Bounds for Bandit
problems. In Artificial intelligence and statistics, pages 592–600, 2012.

[6] T. Lattimore. Regret analysis of the finite-horizon Gittins index strategy for multi-armed bandits.
arXiv:1511.06014, 2015.

2

[7] T. Lattimore and C. Szepesvári. Bandit Algorithms. Cambridge University Press, 2019.

[8] P. Rusmevichientong, A.J. Mersereau, and J. N. Tsitsiklis. A Structured Multiarmed Bandit Problem
and the Greedy Policy. In Proceedings of the IEEE Conference on Decision and Control, 2009.

[9] D. Russo, B. Van Roy, A. Kazerouni, I. Osband, and Z. Wen. A tutorial on thompson sampling.
Foundations and Trends in Machine Learning, 11(1):1–96, 2018.

3