程序代写代做代考 python 10-601 Introduction to Machine Learning

10-601 Introduction to Machine Learning
Machine Learning Department School of Computer Science Carnegie Mellon University
Reinforcement Learning:
Markov Decision Processes +
Value Iteration
Matt Gormley Lecture 15 Mar. 24, 2021
1

Reminders
• Homework 5: Neural Networks – Out: Thu, Mar. 18
– Due: Mon, Mar. 29 at 11:59pm
• Homework 6: Deep RL
– Out: Mon, Mar. 29
– Due: Wed, Apr. 07 at 11:59pm
2

3

LEARNING PARADIGMS
4

Learning Paradigms
6

Learning Paradigms
7

Learning Paradigms
8

Learning Paradigms
9

Learning Paradigms
10

Learning Paradigms
11

Learning Paradigms
12

REINFORCEMENT LEARNING
13

Examples of Reinforcement Learning
• Howshouldarobotbehavesoas to optimize its “performance”?
(Robotics)
• How to automate the motion of a helicopter? (Control Theory)
• Howtomakeagoodchess-playing program? (Artificial Intelligence)
© Eric Xing @ CMU, 2006-2011 14

Autonomous Helicopter
Video:

15

Robot in a room
• what’sthestrategytoachievemaxreward? • what if the actions were NOT deterministic?
© Eric Xing @ CMU, 2006-2011 16

History of Reinforcement Learning
• Roots in the psychology of animal learning
(Thorndike,1911).
• Another independent thread was the problem of optimal control, and its solution using dynamic programming (Bellman, 1957).
• Idea of temporal difference learning (on-line method), e.g., playing board games (Samuel, 1959).
• A major breakthrough was the discovery of Q- learning (Watkins, 1989).
© Eric Xing @ CMU, 2006-2011 17

What is special about RL?
• RLislearninghowtomapstatestoactions,so
as to maximize a numerical reward over time.
• Unlikeotherformsoflearning,itisamultistage
decision-making process (often Markovian).
• AnRLagentmustlearnbytrial-and-error.(Not
entirely supervised, but interactive)
• Actionsmayaffectnotonlytheimmediate reward but also subsequent rewards (Delayed effect).
© Eric Xing @ CMU, 2006-2011 18

Elements of RL
• Apolicy
– A map from state space to action space. – May be stochastic.
• Arewardfunction
– It maps each state (or, state-action pair) to
a real number, called reward. • A value function
– Value of a state (or, state-action pair) is the total expected reward, starting from that state (or, state-action pair).
© Eric Xing @ CMU, 2006-2011 19

Policy
Question:
Is this policy optimal: yes or no? Briefly justify your answer.
Answer: (Hint: both yes and no are acceptable answers, I’m interested in your justification.)
© Eric Xing @ CMU, 2006-2011 20

Reward for each step -2
© Eric Xing @ CMU, 2006-2011 21

Reward for each step: -0.1
© Eric Xing @ CMU, 2006-2011 22

The Precise Goal
• To find a policy that maximizes the Value function.
– transitions and rewards usually not available
• There are different approaches to achieve this goal in
various situations.
• Value iteration and Policy iteration are two more classic approaches to this problem. But essentially both are dynamic programming.
• Q-learning is a more recent approaches to this problem. Essentially it is a temporal-difference method.
© Eric Xing @ CMU, 2006-2011 23

MARKOV DECISION PROCESSES
24

Markov Decision Process
• For supervised learning the PAC learning framework provided assumptions about where our data came from:
• For reinforcement learning we assume our data comes from a Markov decision process (MDP)
25

Markov Decision Process
Whiteboard
– Components: states, actions, state transition probabilities, reward function
– Markovian assumption
– MDP Model
– MDP Goal: Infinite-horizon Discounted Reward
– deterministic vs. nondeterministic MDP
– deterministic vs. stochastic policy
26

MARKOV DECISION PROCESSES
27

Markov Decision Process
• For supervised learning the PAC learning framework provided assumptions about where our data came from:
• For reinforcement learning we assume our data comes from a Markov decision process (MDP)
28

Markov Decision Process
Whiteboard
– Components: states, actions, state transition probabilities, reward function
– Markovian assumption
– MDP Model
– MDP Goal: Infinite-horizon Discounted Reward
– deterministic vs. nondeterministic MDP
– deterministic vs. stochastic policy
29

Exploration vs. Exploitation
Whiteboard
– Explore vs. Exploit Tradeoff – Ex: k-Armed Bandits
– Ex: Traversing a Maze
30

FIXED POINT ITERATION
31

Fixed Point Iteration for Optimization
• Fixed point iteration is a general tool for solving systems of equations
• It can also be applied to optimization.
1. Given objective function:
2. Compute derivative, set to zero (call this function f ).
3. Rearrange the equation s.t. one of parameters appears on the LHS.
4. Initialize the parameters.
5. For i in {1,…,K}, update each
parameter and increment t:
6. Repeat #5 until convergence
J(✓)
dJ(✓) d✓i
= 0 = f(✓)
0=f(✓))✓i =g(✓)
✓(t+1) = g(✓(t)) i
32

Fixed Point Iteration for Optimization
• Fixed point iteration is a general tool for solving systems of equations
• It can also be applied to optimization.
1. Given objective function:
2. Compute derivative, set to zero (call this function f ).
3. Rearrange the equation s.t. one of parameters appears on the LHS.
4. Initialize the parameters.
5. For i in {1,…,K}, update each
parameter and increment t:
6. Repeat #5 until convergence
x3 32 J(x)= 3 +2x +2x
dJ(x) =f(x)=x2 3x+2=0 dx
)x= x2 +2 =g(x) 3
x
x2 + 2 3
33

Fixed Point Iteration for Optimization
We can implement our example in a few lines of python.
x3 32 J(x)= 3 +2x +2x
dJ(x) =f(x)=x2 3x+2=0 dx
)x= x2 +2 =g(x) 3
x
x2 + 2 3
34

Fixed Point Iteration for Optimization
$ python fixed-point-iteration.py i= 0 x=0.0000 f(x)=2.0000
i= 1 x=0.6667 f(x)=0.4444
i= 2 x=0.8148 f(x)=0.2195
i= 3 x=0.8880 f(x)=0.1246
i= 4 x=0.9295 f(x)=0.0755
i= 5 x=0.9547 f(x)=0.0474
i= 6 x=0.9705 f(x)=0.0304
i= 7 x=0.9806 f(x)=0.0198
i= 8 x=0.9872 f(x)=0.0130
i= 9 x=0.9915 f(x)=0.0086
i=10 x=0.9944 f(x)=0.0057
i=11 x=0.9963 f(x)=0.0038
i=12 x=0.9975 f(x)=0.0025
i=13 x=0.9983 f(x)=0.0017
i=14 x=0.9989 f(x)=0.0011
i=15 x=0.9993 f(x)=0.0007
i=16 x=0.9995 f(x)=0.0005
i=17 x=0.9997 f(x)=0.0003
i=18 x=0.9998 f(x)=0.0002
i=19 x=0.9999 f(x)=0.0001
i=20 x=0.9999 f(x)=0.0001
x3 32 J(x)= 3 +2x +2x
dJ(x) =f(x)=x2 3x+2=0 dx
)x= x2 +2 =g(x) 3
x
x2 + 2 3
35

VALUE ITERATION
36

Definitions for Value Iteration
Whiteboard
– State trajectory
– Value function
– Bellman equations
– Optimal policy
– Optimal value function
– Computing the optimal policy – Ex: Path Planning
37