Neural Net Introduction
Reinforcement Learning
Main sources:
Marsland “Machine Learning” (CRC) Chapter 11
Sutton and Barto “Reinforcement Learning” (MIT)
‹#›
© Eric Braude 2012-15
In reinforcement learning, a random action is taken. If it leads to a favorable outcome, it is strengthened.
Learning Goals
Recognize Reinforcement Learning potential
Use basic RL techniques
Apply Monte Carlo decision processes to ML
Compare end-state with discounting
‹#›
© Eric Braude 2012-15
Reinforcement Learning
Definition
Basic Technique
Finite Monte Carlo Decision Processes
Discounting, Example, References
‹#›
© Eric Braude 2012-15
Definition: Context
Learning what to do in an uncertain environment … with a clear goal
so as to maximize … a (numerical) reward.
‹#›
© Eric Braude 2012-15
Reinforcement learning is often applicable in less-than-certain situations. (If an environment is certain, it is much rarer to be able to learn anything.) There must be a reward of some kind so that actions can be evaluated.
Definition*
Learning what to do in an uncertain environment (i.e., map situations to actions) with a clear goal
so as to maximize a (possibly delayed) reward.
Learner must discover what to do by trials (interacting with the environment).
…
Sutton & Barto p3
‹#›
© Eric Braude 2012-15
It is also assumed that the effects and reward of an action can only be determined by actually taking the action.
Definition*
Learning what to do in an uncertain environment (i.e., map situations to actions) with a clear goal
so as to maximize a (possibly delayed) reward.
Learner must discover what to do by trials.
Trade-off of exploitation and exploration of an agent …
Sutton & Barto p3
What it knows
What it tries to find out
‹#›
© Eric Braude 2012-15
A reinforcement-learning agent thus decides to either exploit what is already knows, or to take an action to find additional knowledge. This is a bit like some card games, such as poker.
Definition*
Learning what to do in an uncertain environment (i.e., map situations to actions) with a clear goal
so as to maximize a (possibly delayed) reward.
Learner must discover what to do by trials.
Trade-off of exploitation and exploration of an agent …
…that —
— has explicit goals
— can sense aspects of its environment
— can select actions to influence its environment
Sutton & Barto p3
‹#›
© Eric Braude 2012-15
What makes reinforcement learning (RL) particularly interesting is that actions may influence, not jus the agent, but the environment itself.
Benefit of Reinforcement Learning
No model needed
Don’t need to understand the whole
No explicit search performed
Not a search method
Analogous to GA’s
‹#›
© Eric Braude 2012-15
As the figure notes, RL has the benefit of operating without a model, somewhat like genetic algorithms, and is not fundamentally a search technique.
Examples
Playing chess
Controlling chemical process
Simulation
E.g., newborn gazelle
Trash-collecting robot
Person making a meal
Sutton & Barto p4
‹#›
© Eric Braude 2012-15
The figure notes a few examples. “Playing chess” refers not just generically, but in the context of a particular opponent.
Reinforcement Learning
Definition
Basic Technique
Finite Monte Carlo Decision Processes
Discounting, Example, References
‹#›
© Eric Braude 2012-15
In this section, we describe the basic reinforcement algorithm.
Architecture of Reinforcement Learning
Marsland
state
Environment
goal state
action
Example: US pandemic data
today.
Example: US pandemic data
enables 90% of normal life
Example: Stop social distancing
‹#›
© Eric Braude 2012-15
Reinforcement learning takes place in some environment. State reflects the circumstances under which an action is taken. This is what’s strengthened: in other words, we improve then environment in order to make better decisions.
Architecture of Reinforcement Learning
Full diagram and following based on Marsland
state
Environment
Value V of a state = probability of attaining goal state from here
‹#›
© Eric Braude 2012-15
Technically, a state is defined by particular values of particular variables. For example, the state of a Person object may be defined as sick if temperature > 100.4 and bathing = False.
Example: Tic-Tac-Toe. A state could be
a board configuration or a ranked set of rules …
(Assume we play X)
X
X
X
Value of such a state*: 1
.
.
.
.
.
.
Value of such a state: 0
.
.
.
.
.
.
O
O
O
* Interpreted as a specific board configuration
‹#›
© Eric Braude 2012-15
Consider, as an example, playing tic-tac-toe as player X.
Architecture of Reinforcement Learning
Marsland
state
reward
action
Environment
new state
Example: Business picks up 60%
Reward r from taking the action
‹#›
© Eric Braude 2012-15
Reinforcement learning takes place in an environment. State reflects the circumstances under which the action is taken.
Immediate and End-Point Reward Components
Learn via …
immediately visible reward
and / or
payoff in the end
Marsland
‹#›
© Eric Braude 2012-15
Reinforcement and reward is either immediate (i.e., visible) or else measured by end-state reward. The latter is more valuable in theory but its estimation is less reliable.
Architecture of Reinforcement Learning
state
update value
Environment
Marsland
Value V of this state (vis-a-vis goal state)
new state
reward
action
‹#›
© Eric Braude 2012-15
States have values, and when an action is taken with positive reward, the state’s value is increased. For example, if your state is “daydreaming” and the action is to wake from your reverie, which results in a dreaming up an improvement in your business (your new state), then you would increase the value of the daydreaming state.
Square A
Square B
Square D
Square C
Square F
Marsland
Square E
Example: Learning Routes to Hotel in Old City
State: “At square X”
Hotel
‹#›
© Eric Braude 2012-15
Marsland takes as an example learning to navigate in an old city. The squares are shown in the figure. It is assumes that the squares are well-marked. In this example, a state corresponds to a location. The goal is to learn by reinforcement how to get to our hotel.
Square B
Square C
Square F
Marsland
Example: Learning Routes to Hotel in Old City
Hotel
Start at random
Will recognize goal when arrive
‹#›
© Eric Braude 2012-15
It is assumed that we will recognize our hotel (in square F) when we see it.
Recall that we learn in RL only via actions, so we begin by being randomly placed, and take random actions, reinforcing beneficial results as we go.
Example: Given Map
Square A
Square B
Square D
Square C
Square F
Marsland
Square E
‹#›
© Eric Braude 2012-15
The figure shows the actual map of connections among squares.
Examples: Reward of State/Actions
Marsland
Square A
Square B
Square F
10
100
-10
-5
Square E
Maximum (reaches goal)
Negative (leaves goal)
Penalty for wasting move
B better than A
‹#›
© Eric Braude 2012-15
This figure shows a selection of rewards:
100 when a connection yields the hotel,
-10 when we leave our hotel,
-5 when we simply start and end at the same place, and
10 when we know that our destination is better than our source.
Examples: Reward of State/Actions
Marsland
Square A
Square B
Square D
Square C
Square F
10
reward
= 0
30
0
10
0
40
10
30
-10
100
100
-10
-5
-5
-5
-5
-5
Square E
-5
‹#›
© Eric Braude 2012-15
This figure shows what an RL outcome could look like. It would enable us to find a route to our hotel from anywhere. We discuss next how to create this.
Initialization of Values (Little Knowledge)
Goal State: 1
Other states:
Typically: 0.5
Sutton & Barto p3
‹#›
© Eric Braude 2012-15
Since we have little knowledge, we can begin by giving the goal state the value 1 and all others significantly less—0.5, say.
Initialization
Square A
Square B
Square D
Square C
Square F
Square E
0.5
0.5
0.5
0.5
0.5
1.0
‹#›
© Eric Braude 2012-15
The figure shows the result.
To Move
Mostly: Greedily (exploitation)
Occasionally: Randomly (exploration)
Sutton & Barto p3
‹#›
© Eric Braude 2012-15
Now we need to take an action (move, good or bad) in order to make progress. Most of the time, we move rationally i.e., in accordance with the best knowledge (this is “exploitation”) but we know that this needs some shake-up because our knowledge itself is imperfect. As a result, we occasionally make a random move (“exploration”).
“Policy” Options (for Selecting Actions)
Greedy: Highest reward for this state at this time
-Greedy: Greedy, but with small probability of random other action
Softmax: rescaled probability of highest reward compared with rewards of the other actions
Dampen over time
Adapted from Marsland
‹#›
© Eric Braude 2012-15
The figure shows three overall methods for carrying this out, including pure exploitation (greedy), -greedy (a name for the “occasionally greedy” approach we outlined before), to an approach that normalizes the options at a given state. The latter converts the estimated rewards to numbers that sum to 1, using values that include exploitation and exploration, but decreasing the exploration part over time.
Example Action; e.g., starting at D, (rand.) move to E
Square A
Square B
Square D
Square C
Square F
Square E
0.5
0.5
0.5
0.5
0.5
1.0
‹#›
© Eric Braude 2012-15
Suppose, for example, that the random start place selected is D. No next-square is better than any other (except back to D) so we pick one at random. Suppose it’s the alley to E.
Update Value of a State:
Get Closer to Value of Next State
V(st ) V(st ) + α[V(st+1) – V(st )]
Original value of this state
Value of next state
Learning rate
OLD V(st )
NEW V(st)
V(st+1)
[V(st+1) – V(st )]
———-
———–
Updated value of this state
‹#›
© Eric Braude 2012-15
A key question is how to reinforce the value of a state after taking an action when the answer is not black-and-white. We have already seen the more obvious reinforcements. The figure shows a general way to reinforce a value for state i which is based on the difference in value between state I and state i+1. The learning rate is to be determined, but let’s take it as 0.2 for now, and suppose that we transition from a state i worth 10 to a state i+1 worth 30. Since the second state is more valuable, state i should be upgraded in value because it is a way to get there. The formula upgrades the value of state i to
10 + 0.2[30 – 10] = 14.
1. Move (e.g., from D) Greedily Most of the Time
Square A
Square B
Square D
Square C
Square F
Square E
No update for D’s value results.
Record this.
0.5
0.5
0.5
0.5
0.5
1.0
‹#›
© Eric Braude 2012-15
The figure shows no reinforcement going from a state to one of equal value.
2. Move & Propagate Backwards (e.g., α = 0.25)
Square F
Square E
1.0
V(st ) V(st ) + α [V(st+1) – V(st )]
0.5 + 0.25[ 1.0 – 0.5]
= 0.625
‹#›
© Eric Braude 2012-15
The figure shows an increase in the value of state (square) E.
3: Propagate Backwards (e.g., α = 0.25)
Square A
Square B
Square D
Square C
Square F
Square E
0.5
0.5
0.625
0.5 + 0.25[ 0.625 – 0.5 ] = 0.53125
0.5
1.0
V(st ) V(st ) + α[ V(st+1) – V(st )]
‹#›
© Eric Braude 2012-15
This change in value propagates backwards through the state diagram, as shown in the figure.
Reduce α Over Time
V(st ) V(st ) + α[V(st+1) – V(st )]
Learning rate
OLD V(st )
NEW V(st )
V(st+1)
[V(st+1) – V(st )]
‹#›
© Eric Braude 2012-15
The wild card so far is the value of . For learning in general, we tend to allow a lot of leeway at the beginning, but reduce it over time. We can do this by, for example, reducing the value of itself by 10% every time it is applied to an updated value.
Reinforcement Learning
Definition
Basic Technique
Finite Monte Carlo Decision Processes
Discounting, Example, References
‹#›
© Eric Braude 2012-15
Simplifying Assumption: Markovian Process
(Multiple possible next states are allowed.)
Probability of next state St+1 and reward Rt+1
depend only on current state St and action At
(i.e., not on how St was arrived at)
‹#›
© Eric Braude 2012-15
Markov processes may sound sophisticated but the opposite is true: they depend only on how matters stand (current state), not on how the state was arrived at.
Advantages of Monte Carlo
No model required
Conceptually simple
Barto & Sutton p23
‹#›
© Eric Braude 2012-15
The advantages of simplifying in this way are that we do not depend on a model, and that the process becomes uncomplicated. As with most simplifications, we learn from experience when what’s lost is not too great.
Example of Markov Process
http://setosa.io/ev/markov-chains/
can be used, …
“… to check how frequently a new dam will overflow, which depends on the number of rainy days in a row.”
… to include real-world rules such as
“if it’s sunny one day, then the next day is also much more likely to be sunny.”
‹#›
© Eric Braude 2012-15
The reference in the figure is a simple Markov process that can be used, for example, (quoting from the site) “to check how frequently a new dam will overflow, which depends on the number of rainy days in a row.” It allows us to include real-world rules such as “if it’s sunny one day, then the next day is also much more likely to be sunny.”
Reinforcement Learning with Markov Dec. Proc.
action At
state St
Barto & Sutton p48
agent
environment
reward Rt
1
2
Random variables
Random variables
‹#›
© Eric Braude 2012-15
Recall that reinforcement learning takes place in some environment, where state Si reflects the circumstances under which the action is taken. This is what’s strengthened. The state and reward are initially random since we have no knowledge about them.
Reinforcement Learning with MDP’s
action At
state St
Barto & Sutton p48
agent
environment
reward Rt
Rt+1
St+1
1
2
3
‹#›
© Eric Braude 2012-15
Once an action is taken in a Markov decision process, the state and reward are updated.
Recycling Robot (Barto & Sutton)
At each step, robot must decide:
(1) actively search for a can,
(2) wait for someone to bring it a can, or
(3) go to home base and recharge
Searching is better* but runs down battery
runs out of power needs rescue!
Decisions made per current energy level
high, low.
Value = # cans collected
* all other things being equal Adapted from Barto & Sutton
‹#›
© Eric Braude 2012-15
Barto and Sutton describe a recycling robot implemented via reinforcement learning, with the rules shown in the figure.
Recycling Robot ctd.
Barto & Sutton
set of states
actions available when high
reward from search
…
…
…
‹#›
© Eric Braude 2012-15
The resulting states, actions, and rewards are shown in the figure.
Recycling Robot ctd.
S = {high, low} (battery)
A(high) = {search, wait}
A(low) = {search, wait, recharge}
Rsearch = expected # cans searching
Rwait = expected # cans while waiting
[Rsearch > Rwait]
Barto & Sutton
set of states
actions available when high
reward from search
‹#›
© Eric Braude 2012-15
The resulting states, actions, and rewards are shown in the figure.
Probability Notations
1. Begin searching, at high energy
p(high energy afterwards) = (vs. low)
2. Begin searching, at low energy
p(low energy afterwards) = (vs. depleted)
Barto & Sutton
‹#›
© Eric Braude 2012-15
We can parameterize the probability of energy levels after searching for a can. This depends on whether the robot begins searching when its energy is high (which leaves energy either high or low afterwards) or begins when its energy is low (a high-risk action, which results in low or depleted).
Transition Probabilities
Barto & Sutton
Probability of high outcome, given search action.
‹#›
© Eric Braude 2012-15
The figure concentrates on the robot’s energy level (the key factor). “Depleted” is not shown because it’s a dead end at which the robot is incapable of any action. Assuming that the robot’s maintainer has set all of the parameter values, this MDP state diagram (or its tabular form) contains all the information that the robot needs to operate. For example, when the robot starts, it is in high state, with a choice of searching for a can or else waiting for one to be deposited with it. The choice is made by comparing Rwait with Rsearch. Given that Rwait < Rsearch, the robot will search for a standard time. At the conclusion, its energy will be high again or else low. The probabilities are used for decision-making and for simulation. Simulation mode allows maintainers to set the reward quantities. Not shown is data collection for the number of cans found, which also influences the value of the parameters.
Reinforcement Learning
43
Definition
Basic Technique
Finite Monte Carlo Decision Processes
Discounting, Example, References
‹#›
© Eric Braude 2012-15
Learning with MDP’s
IF “FINAL STEP” MAKES SENSE*:
Add rewards over time-episodes
IF NOT, APPLY DISCOUNTING.
* (for the application in question)
‹#›
© Eric Braude 2012-15
In deciding on actions, there are two possibilities.
1. If the concept of the “last step” makes sense, it is possible to compute the overall rewards for various actions, and to select the one with highest value.
2. In many cases, it is not feasible to think all the way to a final step. This is closer to the real world. In this case, a dampening technique known as discounting is used.
Discounting
Marsland
Discounted reward at time t
= rt+1 + …
Reward from next action
‹#›
© Eric Braude 2012-15
To calculate long-term reward, we can use a discount factor. For example, if I am trying to evaluate the eventual payoff of buying a restaurant, I can map out a sequence of what will follow but I discount each of the steps to reflect my uncertainty about whether they will come to pass. If my time frame is in months, I may calculate the payoff in terms of …
Buy restaurant………..$10K profit first month then …
Remodel…………………$5K additional profit second month then …
Modernize menu……$3K additional profit third month
But we don’t simply add these ($18K) because the further out in time an event, the less certain we can be about it; so we apply a discount (let’s say 20%), obtaining:
$10K + 20%$5K + 20%20%$3K (=$11,120). This reflects the increasing uncertainty of the $2K and $3K.
Discounting: Factoring Subsequent Rewards
Adapted from Barto & Sutton
Discount factor 0 < 1
Discounted reward at time t
= rt+1 + rt+2 + 2rt+3 + …
Guaranteed to converge
Reward from action following
‹#›
© Eric Braude 2012-15
The figure shows this mathematically.
Applications
Crowdsourced reinforcement
https://www.youtube.com/watch?v=cbHlC_SJKMQ
‹#›
© Eric Braude 2012-15
RL can be applied in many ways. The application referenced uses RL to train a robot to lift diverse objects. It begins with rudimentary gripping skills and computes rewards via crowdsourcing every time it takes and action.
TensorFlow RL Libraries
https://tensorlayer.readthedocs.io/en/stable/modules/rein.html
https://tensorlayer.readthedocs.io/en/stable/user/examples.html#reinforcement-learning
‹#›
© Eric Braude 2012-15
Tensorflow has RL libraries, including these referenced.
Summary
Reinforcement learning encourages rewards
Define states and actions
Monte Carlo RL applies probability
Use discounting when there is no clear ending
‹#›
© Eric Braude 2012-15
RL is based on rewards and actions. Learning takes place when an action is taken. In realist application, we often use Markov Decision Processes, an approach that simplifies by ignoring how the system arrives at a state. Another heuristic is to apply is the discounting of outcomes into the future as a way of dealing with the impracticability, in the real world, of seeing all chains of actions through to final conclusions.