程序代写代做代考 algorithm 10-601 Introduction to Machine Learning

10-601 Introduction to Machine Learning
Machine Learning Department School of Computer Science Carnegie Mellon University
Reinforcement Learning:
Q-Learning + Deep RL
Matt Gormley Lecture 17 Mar. 29, 2021
1

Reminders
• Homework 5: Neural Networks – Out: Thu, Mar. 18
– Due: Mon, Mar. 29 at 11:59pm
• Homework 6: Deep RL
– Out: Mon, Mar. 29
– Due: Wed, Apr. 07 at 11:59pm
2

Q-LEARNING
3

Whiteboard
Q-Learning
– Q-Learning Algorithm
• Case 1: Deterministic Environment
• Case 2: Nondeterministic Environment
– ε-greedy Strategy
6

Q-Learning
Remarks
– Q converges to Q* with probability 1.0, assuming… 1. each is visited infinitely often
– –
2. 0≤ɣ<1 3. rewards are bounded |R(s,a)| < β, for all
4. initial Q values are finite
Q-Learning is exploration insensitive Q
⇒ any state visitation strategy will work assuming
May take many iterations to converge in practice
7

Reordering Experiences
8

Designing State Spaces
Q: Do we have to retrain our RL agent every time we change our state space?
A:
Yes. But whether your state space changes from one setting to another is determined by your design of the state representation.
Two examples:
– State Space A: position on map e.g. st = <74, 152>
– State Space B: window of pixel colors centered at current Pac Man location e.g. st =
0
0
1
1
0
1
0
0
1
9

Q-Learning
Question:
For the R(s,a) values shown on the arrows below, which are the corresponding Q*(s,a) values?
Assume discount factor = 0.5.
0
0
8
8
0
16
0
Answer:
0
0
4
8
8
2
18
9
4
8
8
2
16
4
0
0
8
8
8
4
8
4
4
8
8
4
16
8
11

DEEP RL EXAMPLES
13

TD GammonàAlpha Go Learning to beat the masters at board games
THEN
NOW
“…the world’s top computer program for backgammon, TD-GAMMON (Tesauro, 1992, 1995), learned its strategy by playing over one million practice games against itself…”
(Mitchell, 1997)
14

Atari Example: Reinforcement Learn
Playing Atari with Deep RL
• Setup: RL system
observes the pixels on the screen
• Itreceives rewards as the game score
• Actionsdecide how to move the joystick / buttons
Figures from David Silver (Intro RL lecture)
15
observation
Ot
action
At
reward
Rt
i
n

Playing Atari with Deep RL
Figure1: ScreenshotsfromfiveAtari2600Games:(Left-to-right)Pong,Breakout,SpaceInvaders, Seaquest, Beam Rider
Videos:
an experience replay mechanism [13] which randomly samples previous transitions, and thereby sm–oothAs thae traiinBingredisatrikbuotiounto:ver many past behaviors.
https://www.youtube.com/watch?v=V1eYniJ0Rn
We apply our approach to a range of Atari 2600 games implemented in The Arcade Learning Envi-
ronment (ALE) [3]. Atari 2600 is a challenging RL testbed that presents agents with a high dimen-
sional kvisual input (210 ⇥ 160 RGB video at 60Hz) and a diverse and interesting set of tasks that
were designed to be difficult for humans players. Our goal is to create a single neural network agent
– Space Invaders:
vided with any game-specific information or hand-designed visual features, and was not privy to the
that is able to successfully learn to play as many of the games as possible. The network was not pro-
internahl stateposf :th/e/wemuwlatwor; .ityleoarunetd ufrobmeno.tchiongmbut/wthe avitdecohin?puvt,=the Prewva0rd Fansd9tercmGinal signals, and the set of possible actions—just as a human player would. Furthermore the network ar-
gU
chitecture and all hyperparameters used for training were kept constant across the games. So far the network has outperformed all previous RL algorithms on six of the seven games we have attempted and surpassed an expert human player on three of them. Figure 1 provides sample screenshots from
five of the games used for training.
16
Figures from Mnih et al. (2013)

Playing Atari with Deep RL
Figure1: ScreenshotsfromfiveAtari2600Games:(Left-to-right)Pong,Breakout,SpaceInvaders, Seaquest, Beam Rider
nwmvtreby tae
WeapplyourapproachtoarangeofAtari2600gaementedinTheArcadeLearningEnvi-
ronment (ALE) [3]. Atari 2600 is a challenging RL testbed that presents agents with a high dimen-
6 d that
bugent
abpro-
mndtothe
B. Rider
Breakout
Enduro
Pong
Q*bert
Seaquest
S. Invaders
aRnaenxdpoemrience rep
lay 3m5e4cha
ism1[.123]
hich0rand
oml2y0.s4a havi1o9rs.
17
mes impl
20
ple1s57pre
ious11tr0ansi
ions,1a7n9d the
sSmaorosath[s3]the traini
ng d9i9st6ribu
ion o5v.2er m
ny 1p2a9st b
614
665
271
Contingency [4]
1743
6
159
960
723
268
DQN
4092
168
470
1952
1705
581
Human
sional visual inpu
7456
t(210 1
31
0 RGB vid
368
eo at 60H
3 z) and a
18900
iverse an
28010
d interestin
3690
g set of tasks
HNeat Best [8] were designed to
⇥
3616
e difficult
52
for humans
106
players. O
19
ur goal is
1800
to create
920
a single ne
1720
ral network a
HNeat Pixel [8] that is able to succ
1332
essfully le
4
rntoplaya
91
s many of
16 the game
21
esigned
1325
s as possi
800
le. The net
1145
work was not
DQN Best
vided with any ga
5184
e-specific
225
informatio
661
or hand-
4500
visual fea
1740
ures, and w
1075
as not privy t
internal state of the emulator; it learned from nothing but the video input, the reward and terminal
Table 1: The upper table compares average total reward for various learning methods by running
signals, and the set of possible actions—just as a human player would. Furthermore the network ar-
an ✏-greedy policy with ✏ = 0.05 for a fixed number of steps. The lower table reports results of chitecture and all hyperparameters used for training were kept constant across the games. So far the
the single best performing episode for HNeat and DQN. HNeat produces deterministic policies that
network has outperformed all previous RL algorithms on six of the seven games we have attempted
always get the same score while DQN used an ✏-greedy policy with ✏ = 0.05.
and surpassed an expert human player on three of them. Figure 1 provides sample screenshots from
five of the games used for training.
17
Figures from Mnih et al. (2013)
types of objects on the Atari screen. The HNeat Pixel score is obtained by using the special 8 color

Deep Q-Learning
Question: What if our state space S is too large to represent with a table?
Examples:
• st = pixels of a video game
• st = continuous values of a sensors in a manufacturing robot
• st = sensor output from a self-driving car
Answer: Use a parametric function to approximate the table entries
Key Idea:
1. Use a neural network Q(s,a; θ) to approximate Q*(s,a)
2. Learn the parameters θ via SGD with training examples < st, at, rt, st+1 >
18

Whiteboard
Deep Q-Learning
– Strawman loss function (i.e. what we cannot compute)
– Approximating the Q function with a neural network
– Approximating the Q function with a linear model
– Deep Q-Learning
– function approximators (àq-value
vs.
stateàall action q-values)
19

• •
Experience Replay
Problems with online updates for Deep Q-learning:
– not i.i.d. as SGD would assume
– quickly forget rare experiences that might later be useful to learn from
Uniform Experience Replay (Lin, 1992):
– KeepareplaymemoryD={e1,e2,…,eN}ofNmostrecent
experiences et =
– Alternate two steps:
1. Repeat T times: randomly sample ei from D and apply a Q-
Learning update to ei
2. Agent selects an action using epsilon greedy policy to receive new experience that is added to D
Prioritized Experience Replay (Schaul et al, 2016)
– similar to Uniform ER, but sample so as to prioritize experiences with high error
•
20

Game of Go (圍棋)
• 19x19board
• Playersalternately play black/white stones
• Goalistofully encircle the largest region on the board
• Simplerules,but extremely complex game play
Game 1
Fan Hui (Black), AlphaGo (White) AlphaGo wins by 2.5 points
203
93 92 201151 5
165
51 164 33
Alpha Go
RESEARCH ARTICLE
156154205352023415257434522122042239 155157320485149150844468222231160161
87 84 186184 228 39 31 169271 86 183 58
230148 36 37 41 168185181182188
38 40 96 90 167189187264 70 238251 67 63 6 192175197
229
196 236232235 5
82 81 224 79 83 7 268153199
88 195 89 80 68 32 176177
242 269 194193207
270 145 143 146 255 191 227 237 253 65 64
95 139142144261265254170252 69 66 248 78
178 198
112158
25 113
60 208 52 59 54 210 55
211
42 56 53 23
267
240
241
99 98 97 94 91 100
226225213209214266 71 72 256174136212140260 73 114 190171172257141219115116
77 76 75 12 15 16 14 11 28 18
133 101 102 258 173 262 121 263 119 110 27 130 131 103 104 259 217 129 215 218 117 62 120
26 111 2 249 107 159
135 1 132 128 29 134 22 61 24 166
216 137 206 233 118 74 13 108 138 30 179 200 20 105 109 126
17 19
239 272 136 246 247
180
106 124 21 127 244 243 123 122 125
163 147 162
234at 179 245at 122 250at 59
182at
Figure from Silver et al. (2016)
21
74
01
9 04
4
Game 4 Ga
Ga Al Al
8
7 6
1 1 1
1
AlphaGo (Black), Fan Hui (White) Fan
p p
2
1 9
3 2 3
6

• •
Alpha Go
State space is too large to represent explicitly since
# of sequences of moves is O(bd) – Go:b=250andd=150
– Chess:b=35andd=80
Key idea:
– Define a neural network to approximate the value function ARTICLE RESEARCH
– Trainbypolicygradient ab
Rollout policy SL policy network RL policy network Value network Policy network
Value network 􏰜􏰚 (s′)
p􏱀 p􏱁 p􏱃 􏰜􏰚 Policy gradient
p􏱁􏱂􏱃 (a⎪s)
Classification
Self Play
Classification
Regression
Neural network Data
Human expert positions
s s′ the current player wins) in positions from the self-play data set.
Figure 1 | Neural network training pipeline and architecture. a, A fast rollout policy pπ and supervised learning (SL) policy network pσ are
22
Figure from Silver et al. (2016)
b, Schematic representation of the neural network architecture used in AlphaGo. The policy network takes a representation of the board position s as its input, passes it through many convolutional layers with parameters σ (SL policy network) or ρ (RL policy network), and outputs a probability
trained to predict human expert moves in a data set of positions.
A reinforcement learning (RL) policy network pρ is initialized to the SL policy network, and is then improved by policy gradient learning to
Self-play positions

Alpha Go
a
3,500
9p 7p 5p
• Results of a tournament
• From Silver et al. (2016): “a 230 point gap corresponds to a 79% probability of winning”
3,000 3p 1p
2,500 2,000 1,500 1,000
500 0
9d
7d
5d
3d
1d 1k
3k 5k
7k
Figure from Silver et al. (2016)
23
Figure 4 | Tournament evaluation of AlphaGo. a, Results of
tournament between different Go programs (see Extended Da
6–11). Each program used approximately 5 s computation tim
To provide a greater challenge to AlphaGo, some programs (p
V P
bars) were given four handicap stones (that is, free moves at the
alu oli
a ta
e al
Elo Rating
every game) against all opponents. Programs were evaluated on s
c
p e
Professional dan (p)
Amateur dan (d)
Beginner kyu (k)
GnuGo
Fuego
Pachi
􏱄􏰛􏰜
Crazy Stone
Fan Hui
AlphaGo
AlphaGo distributed

Learning Objectives
Reinforcement Learning: Q-Learning
You should be able to…
1. Apply Q-Learning to a real-world environment
2. Implement Q-learning
3. Identify the conditions under which the Q- learning algorithm will converge to the true value function
4. Adapt Q-learning to Deep Q-learning by employing a neural network approximation to the Q function
5. Describe the connection between Deep Q- Learning and regression
24

ML Big Picture
Learning Paradigms:
What data is available and when? What form of prediction?
• supervised learning
• unsupervised learning
• semi-supervised learning
• reinforcement learning
• active learning
• imitation learning
• domain adaptation
• online learning
• density estimation
• recommender systems
• feature learning
• manifold learning
• dimensionality reduction
• ensemble learning
• distant supervision
• hyperparameter optimization
Problem Formulation:
What is the structure of our output prediction?
boolean
categorical
ordinal
real
ordering
multiple discrete multiple continuous
both discrete & cont.
Binary Classification Multiclass Classification Ordinal Classification Regression
Ranking
Structured Prediction
(e.g. dynamical systems) (e.g. mixed graphical models)
Facets of Building ML Systems:
How to build systems that are robust, efficient, adaptive, effective?
1. Data prep
2. Model selection
3. Training (optimization / search)
4. Hyperparameter tuning on validation data
5. (Blind) Assessment on test data
Big Ideas in ML:
Which are the ideas driving development of the field?
• inductive bias
• generalization / overfitting
• bias-variance decomposition
• generative vs. discriminative
• deep nets, graphical models
• PAC learning
• distant rewards
25
Theoretical Foundations:
What principles guide learning?
q probabilistic
q information theoretic q evolutionary search q ML as optimization
Application Areas
Key challenges?
NLP, Speech, Computer Vision, Robotics, Medicine, Search

Learning Paradigms
26

Related Posts