CS代写 CSC 311: Introduction to Machine Learning

CSC 311: Introduction to Machine Learning
Lecture 12 – Reinforcement Learning
. of Toronto, Fall 2021
Intro ML (UofT) CSC311-Lec12 1 / 60

Copyright By PowCoder代写 加微信 powcoder

Reinforcement Learning Problem
Recall: we categorized types of ML by how much information they provide about the desired behavior.
Supervised learning: labels of desired behavior
Unsupervised learning: no labels
Reinforcement learning: reward signal evaluating the outcome of past actions
Bandit problems (Lecture 10) are a simple instance of RL where each decision is independent.
More commonly, we focus on sequential decision making: an agent
Reinforcement Learning (RL)
chooses a sequence of actions which each affect future possibilities
available to the agent.
An agent observes the takes an action and with the goal of world its states changes achieving long-term
Intro ML (UofT) CSC311-Lec12

Reinforcement Learning
Most RL is done in a mathematical framework called a Markov Decision Process (MDP).
Intro ML (UofT) CSC311-Lec12 3 / 60

MDPs: States and Actions
First let’s see how to describe the dynamics of the environment.
The state is a description of the environment in sufficient detail to determine its evolution.
Think of Newtonian physics.
What would be the state variables for a puck sliding on a frictionless table?
Markov assumption: the state at time t + 1 depends directly on the state and action at time t, but not on past states and actions.
To describe the dynamics, we need to specify the transition probabilities P(St+1 | St, At).
In this lecture, we assume the state is fully observable, a highly nontrivial assumption.
Intro ML (UofT) CSC311-Lec12

MDPs: States and Actions
Suppose you’re controlling a robot hand. What should be the set of states and actions?
In general, the right granularity of states and actions depends on what you’re trying to achieve.
Intro ML (UofT) CSC311-Lec12 5 / 60

MDPs: Policies
The way the agent chooses the action in each step is called a policy.
We’ll consider two types:
Deterministic policy: At = π(St) for some function π : S → A Stochastic policy: At ∼ π(· | St) for some function π : S → P(A). (Here, P(A) is the set of distributions over actions.)
With stochastic policies, the distribution over rollouts, or trajectories, factorizes:
p(s1,a1,…,sT ,aT ) = p(s1)π(a1 |s1)P(s2 |s1,a1)π(a2 |s2)···P(sT |sT−1,aT−1)π(aT |sT )
Note: the fact that policies need consider only the current state is a powerful consequence of the Markov assumption and full observability.
If the environment is partially observable, then the policy needs to depend on the history of observations.
Intro ML (UofT) CSC311-Lec12 6 / 60

MDPs: Rewards
In each time step, the agent receives a reward from a distribution that depends on the current state and action
Rt ∼ R(·|St,At)
For simplicity, we’ll assume rewards are deterministic, i.e.
Rt =r(St,At)
What’s an example where Rt should depend on At?
The return determines how good was the outcome of an episode.
Undiscounted: G=R0 +R1 +R2 +··· Discounted: G=R0 +γR1 +γ2R2 +···
The goal is to maximize the expected return, E[G].
γ is a hyperparameter called the discount factor which determines how much we care about rewards now vs. rewards later.
What is the effect of large or small γ?
Intro ML (UofT) CSC311-Lec12 7 / 60

MDPs: Rewards
How might you define a reward function for an agent learning to play a video game?
Change in score (why not current score?)
Some measure of novelty (this is sufficient for most Atari games!)
Consider two possible reward functions for the game of Go. How do you think the agent’s play will differ depending on the choice?
Option 1: +1 for win, 0 for tie, -1 for loss
Option 2: Agent’s territory minus opponent’s territory (at end)
Specifying a good reward function can be tricky.

Intro ML (UofT) CSC311-Lec12 8 / 60

Markov Decision Processes
Putting this together, a Markov Decision Process (MDP) is defined by a tuple (S, A, P, R, γ).
S: State space. Discrete or continuous
A: Action space. Here we consider finite action space, i.e., A = {a1,…,a|A|}.
P: Transition probability
R: Immediate reward distribution
γ: Discount factor (0 ≤ γ < 1) Together these define the environment that the agent operates in, and the objectives it is supposed to achieve. Intro ML (UofT) CSC311-Lec12 Finding a Policy Now that we’ve defined MDPs, let’s see how to find a policy that achieves a high return. We can distinguish two situations: Planning: given a fully specified MDP. Learning: agent interacts with an environment with unknown dynamics. I.e., the environment is a black box that takes in actions and outputs states and rewards. Which framework would be most appropriate for chess? Super Mario? Intro ML (UofT) CSC311-Lec12 Value Functions Intro ML (UofT) CSC311-Lec12 11 / 60 Value Function The value function V π for a policy π measures the expected return if you start in state s and follow policy π. Vπ(s),Eπ[Gt|St =s]=Eπ"X∞ γkRt+k |St =s#. k=0 This measures the desirability of state s. Intro ML (UofT) CSC311-Lec12 12 / 60 Value Function [Slide credit: D. Silver] Rewards: −1 per time-step Actions: N, E, S, W States: Agent’s location Intro ML (UofT) CSC311-Lec12 Value Function [Slide credit: D. Silver] Arrows represent policy π(s) for each state s Intro ML (UofT) CSC311-Lec12 Value Function [Slide credit: D. Silver] Numbers represent value V π(s) of each state s Intro ML (UofT) CSC311-Lec12 Bellman equations The foundation of many RL algorithms is the fact that value functions satisfy a recursive relationship, called the Bellman equation: Vπ(s)=Eπ[Gt|St =s] =Eπ[Rt +γGt+1|St =s] "# =Xπ(a|s) r(s,a)+γXP(s′|a,s)Eπ[Gt+1|St+1 =s′] =Xπ(a|s) r(s,a)+γXP(s′|a,s)Vπ(s′) a s′ Viewing V π as a vector (where entries correspond to states), define the Bellman backup operator Tπ. (TπV)(s) , Xπ(a|s)"r(s,a)+γXP(s′ |a,s)V(s′)# a s′ The Bellman equation can be seen as a fixed point of the Bellman operator: Intro ML (UofT) CSC311-Lec12 16 / 60 To formulate playing a hole of golf as a reinforcement learning t y (negative reward) of 1 for each stroke until we hit the ball Value Function is the location of the ball. The value of a state is the negative of o the hole from that location. Our actions are how we aim and rse, and which club we select. Let us take the former as given oice of club, which we assume is either a putter or a driver. The A value function for golf: shows a possible state-value function, vputt(s), for the policy that . The terminal alue of 0. From e assume we can s have value 1. t reach the hole ue is greater. If from a state by must have value ’s value, that is, s assume we can deterministically, e. This gives us abeled 2 in the een that line and y two strokes to arly, any location * qQ⇤(s(,sd,drriivveerr)) — Sutton and Barto, Reinforcement Learning: An Introduction sand v putt putt sa green n Intro ML (UofT) CSC311-Lec12 s a f the 2 contour !3 State-Action Value Function A closely related but usefully different function is the state-action value function, or Q-function, Qπ for policy π, defined as: Qπ(s,a) , Eπ XγkRt+k | St = s,At = a. k≥0 If you knew Qπ, how would you obtain V π? V π(s) = X π(a | s) Qπ(s, a). If you knew V π, how would you obtain Qπ? Apply a Bellman-like equation: Qπ(s, a) = r(s, a) + γ X P(s′ | a, s) V π(s′) This requires knowing the dynamics, so in general it’s not easy to recoverQπ fromVπ. Intro ML (UofT) CSC311-Lec12 18 / 60 State-Action Value Function Qπ satisfies a Bellman equation very similar to V π (proof is analogous): Qπ(s, a) = r(s, a) + γ X P(s′ | a, s) X π(a′ | s′)Qπ(s′, a′) |s′{za′ } ,(T π Qπ )(s,a) Intro ML (UofT) CSC311-Lec12 19 / 60 Dynamic Programming and Value Iteration Intro ML (UofT) CSC311-Lec12 20 / 60 Optimal State-Action Value Function Suppose you’re in state s. You get to pick one action a, and then follow (fixed) policy π from then on. What do you pick? arg max Qπ (s, a) a If a deterministic policy π is optimal, then it must be the case that for any state s: π(s) = arg max Qπ (s, a), a otherwise you could improve the policy by changing π(s). (see Sutton & Barto for a proper proof) Intro ML (UofT) CSC311-Lec12 Optimal State-Action Value Function Bellman equation for optimal policy π∗: Qπ∗(s,a) = r(s,a)+γXP(s′, |s,a)Qπ⋆(s′,π⋆(s′)) = r(s, a) + γ X p(s′ | s, a) max Qπ⋆ (s′, a′) Now Q∗ = Qπ⋆ is the optimal state-action value function, and we can rewrite the optimal Bellman equation without mentioning π⋆: Q∗(s, a) = r(s, a) + γ X p(s′ | s, a) max Q∗(s′, a′) ,(T ∗ Q∗ )(s,a) Turns out this is sufficient to characterize the optimal policy. So we simply need to solve the fixed point equation T∗Q∗ = Q∗, and then we can choose π∗(s) = arg maxa Q∗(s, a). Intro ML (UofT) CSC311-Lec12 22 / 60 So far: showed that some interesting problems could be reduced to finding fixed points of Bellman backup operators: Evaluating a fixed policy π Finding the optimal policy Idea: keep iterating the backup operator over and over again. Q ← TπQ (policy evaluation) Q ← T ∗Q (finding the optimal policy) We’re treating Qπ or Q∗ as a vector with |S| · |A| entries. This type of algorithm is an instance of dynamic programming. Intro ML (UofT) CSC311-Lec12 An operator f (mapping from vectors to vectors) is a contraction map if ∥f(x1) − f(x2)∥ ≤ α∥x1 − x2∥ for some scalar 0 ≤ α < 1 and vector norm ∥ · ∥. Let f(k) denote f iterated k times. A simple induction shows ∥f(k)(x1) − f(k)(x2)∥ ≤ αk∥x1 − x2∥. Let x∗ be a fixed point of f. Then for any x, ∥f(k)(x) − x∗∥ ≤ αk∥x − x∗∥. Hence, iterated application of f, starting from any x, converges exponentially to a unique fixed point. Intro ML (UofT) CSC311-Lec12 Finding the Optimal Value Function: Value Iteration Let’s use dynamic programming to find Q∗. Value Iteration: Start from an initial function Q1 . For each k = 1, 2, . . . , Writing out the update in full, Qk+1 ←T∗ +1(s, a) ← r(s, a) + γ X P(s′|s, a) max Qk(s′, a′) s′ ∈S a′ ∈A Observe: a fixed point of this update is exactly a solution of the optimal Bellman equation, which we saw characterizes the Q-function of an optimal policy. Intro ML (UofT) CSC311-Lec12 25 / 60 Value Iteration T⇤ (or T⇡) T⇤Q1 Claim: The value iteration update is a contraction map: ∥T∗Q1 − T∗Q2∥∞ ≤ γ ∥Q1 − Q2∥∞ ∥·∥∞ denotes the L∞ norm, defined as: ∥x∥∞ = max |xi| i If this claim is correct, then value iteration converges exponentially to the unique fixed point. The exponential decay factor is γ (the discount factor), which means longer term planning is harder. Intro ML (UofT) CSC311-Lec12 26 / 60 is a Contraction (optional) "# |(T∗Q1)(s,a)−(T∗Q2)(s,a)|= r(s,a)+γXP(s′|s,a)maxQ1(s′,a′) − r(s,a)+γXP(s′ |s,a)maxQ2(s′,a′) =γ XP(s′|s,a) maxQ1(s′,a′)−maxQ2(s′,a′) a′ a′ ≤γXP(s′|s,a)max Q1(s′,a′)−Q2(s′,a′) ≤γmax Q1(s′,a′)−Q2(s′,a′) XP(s′|s,a) =γmax Q1(s′,a′)−Q2(s′,a′) = γ ∥Q1 − Q2∥∞ This is true for any (s, a), so ∥T∗Q1 −T∗Q2∥∞ ≤γ∥Q1 −Q2∥∞, which is what we wanted to show. Intro ML (UofT) CSC311-Lec12 Value Iteration Recap So far, we’ve focused on planning, where the dynamics are known. The optimal Q-function is characterized in terms of a Bellman fixed point update. Since the Bellman operator is a contraction map, we can just keep applying it repeatedly, and we’ll converge to a unique fixed point. What are the limitations of value iteration? assumes known dynamics requires explicitly representing Q∗ as a vector |S| can be extremely large, or infinite |A| can be infinite (e.g. continuous voltages in robotics) But value iteration is still a foundation for a lot of more practical RL algorithms. Intro ML (UofT) CSC311-Lec12 28 / 60 Towards Learning Now let’s focus on reinforcement learning, where the environment is unknown. How can we apply learning? 1 Learn a model of the environment, and do planning in the model (i.e. model-based reinforcement learning) You already know how to do this in principle, but it’s very hard to get to work. Not covered in this course. 2 Learn a value function (e.g. Q-learning, covered in this lecture) 3 Learn a policy directly (e.g. policy gradient, not covered in this How can we deal with extremely large state spaces? Function approximation: choose a parametric form for the policy and/or value function (e.g. linear in features, neural net, etc.) Intro ML (UofT) CSC311-Lec12 29 / 60 Q-Learning Intro ML (UofT) CSC311-Lec12 30 / 60 Monte Carlo Estimation Recall the optimal Bellman equation: Q∗(s, a) = r(s, a) + γEP(s′ | s,a) hmax Q∗(s′, a′)i Problem: we need to know the dynamics to evaluate the expectation Monte Carlo estimation of an expectation μ = E[X]: repeatedly sample X and update μ ← μ + α(X − μ) Idea: Apply Monte Carlo estimation to the Bellman equation by sampling S′ ∼ P(· | s, a) and updating: Q(s, a) ← Q(s, a) + αh r(s, a) + γ max Q(S′, a′) − Q(s, a) i | a′{z } = Bellman error This is an example of temporal difference learning, i.e. updating our predictions to match our later predictions (once we have more information). Intro ML (UofT) CSC311-Lec12 31 / 60 Monte Carlo Estimation Problem: Every iteration of value iteration requires updating Q for every state. There could be lots of states We only observe transitions for states that are visited Idea: Have the agent interact with the environment, and only update Q for the states that are actually visited. Problem: We might never visit certain states if they don’t look promising, so we’ll never learn about them. Idea: Have the agent sometimes take random actions so that it eventually visits every state. ε-greedy policy: a policy which picks arg maxa Q(s, a) with probability 1 − ε and a random action with probability ε. (Typical value: ε = 0.05) Combining all three ideas gives an algorithm called Q-learning. Intro ML (UofT) CSC311-Lec12 32 / 60 Q-Learning with ε-Greedy Policy Parameters: Learning rate α Exploration parameter ε Initialize Q(s, a) for all (s, a) ∈ S × A The agent starts at state S0. For time step t = 0, 1, ..., Choose A( according to the ε-greedy policy, i.e., t argmaxa∈A Q(St , a) with probability 1 − ε Uniformly random action in A with probability ε Take action At in the environment. The state changes from St to St+1 ∼ P(·|St, At) Observe St+1 and Rt (could be r(St,At), or could be stochastic) Update the action-value function at state-action (St,At): Q(St, At) ← Q(St, At) + α Rt + γ max Q(St+1, a′) − Q(St, At) a′ ∈A Intro ML (UofT) CSC311-Lec12 33 / 60 Exploration vs. Exploitation The ε-greedy is a simple mechanism for managing the exploration-exploitation tradeoff. πε(S; Q) = (argmaxa∈A Q(S, a) with probability 1 − ε Uniformly random action in A with probability ε The ε-greedy policy ensures that most of the time (probability 1 − ε) the agent exploits its incomplete knowledge of the world by chooses the best action (i.e., corresponding to the highest action-value), but occasionally (probability ε) it explores other actions. Without exploration, the agent may never find some good actions. The ε-greedy is one of the simplest, but widely used, methods for trading-off exploration and exploitation. Exploration-exploitation tradeoff is an important topic of research. Intro ML (UofT) CSC311-Lec12 Examples of Exploration-Exploitation in the Real World Restaurant Selection Exploitation: Go to your favourite restaurant Exploration: Try a new restaurant Online Banner Advertisements Exploitation: Show the most successful advert Exploration: Show a different advert Oil Drilling Exploitation: Drill at the best known location Exploration: Drill at a new location Game Playing Exploitation: Play the move you believe is best Exploration: Play an experimental move [Slide credit: D. Silver] Intro ML (UofT) CSC311-Lec12 35 / 60 An Intuition on Why Q-Learning Works? (Optional) Consider a tuple (S, A, R, S′). The Q-learning update is Q(S, A) ← Q(S, A) + α R + γ max Q(S′, a′) − Q(S, A) . To understand this better, let us focus on its stochastic equilibrium, i.e., where the expected change in Q(S, A) is zero. We have E R + γ max Q(S′, a′) − Q(S, A)|S, A = 0 a′ ∈A ⇒(T∗Q)(S,A) = Q(S,A) So at the stochastic equilibrium, we have (T ∗Q)(S, A) = Q(S, A). Because the fixed-point of the Bellman optimality operator is unique (and is Q∗), Q is the same as the optimal action-value function Q∗. Intro ML (UofT) CSC311-Lec12 Off-Policy Learning Q-learning update again: Q(S, A) ← Q(S, A) + α R + γ max Q(S′, a′) − Q(S, A) . Notice: this update doesn’t mention the policy anywhere. The only thing the policy is used for is to determine which states are visited. This means we can follow whatever policy we want (e.g. ε-greedy), and it still coverges to the optimal Q-function. Algorithms like this are known as off-policy algorithms, and this is an extremely useful property. Policy gradient (another popular RL algorithm, not covered in this course) is an on-policy algorithm. Encouraging exploration is much harder in that case. Intro ML (UofT) CSC311-Lec12 37 / 60 Function Approximation Intro ML (UofT) CSC311-Lec12 38 / 60 Function Approximation So far, we’ve been assuming a tabular representation of Q: one entry for every state/action pair. This is impractical to store for all but the simplest problems, and doesn’t share structure between related states. Solution: approximate Q using a parameterized function, e.g. linear function approximation: Q(s, a) = w⊤ψ(s, a) compute Q with a neural net Update Q using backprop: t ← r(st, at) + γ max Q(st+1, a) θ ← θ + α(t − Q(s, a))∇θQ(st, at). Intro ML (UofT) CSC311-Lec12 39 / 60 Function Approximation Approximating Q with a neural net is a decades-old idea, but DeepMind got it to work really well on Atari games in 2013 (“deep Q-learning”) They used a very small network by today’s standards Main technical innovation: store experience into a replay buffer, and perform Q-learning using stored experience Gains sample efficiency by separating environment interaction from optimization — don’t need new experience for every SGD update! Intro ML (UofT) CSC311-Lec12 40 / 60 Mnih et al., Nature 2015. Human-level control through deep reinforcement learning Network was given raw pixels as observations Same architecture shared between all games Assume fully observable environment, even though that’s not the case After about a day of training on a particular game, often beat “human-level” performance (number of points within 5 minutes of play) Did very well on reactive games, poorly on ones that require planning (e.g. Montezuma’s Revenge) https://www.youtube.com/watch?v=V1eYniJ0Rnk https://www.youtube.com/watch?v=4MlZncshy1Q Intro ML (UofT) CSC311-Lec12 Recap and Other Approaches All discussed approaches estimate the value function first. They are called value-based methods. There are methods that directly optimize the policy, i.e., policy search methods. Model-based RL methods estimate the true, but unknown, model of environment P by an estimate Pˆ, and use the estimate P in order to plan. There are hybrid methods. Intro ML (UofT) CSC311-Lec12 Reinforcement Learning Resources . Sutton and . Barto, Reinforcement Learning: An Introduction, 2nd edition, 2018. , Algorithms for Reinforcement Learning, 2010. , 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com