Final Exam
Practice
Total
m arks
m in s
100
120
I. (20)
For the average reward setting ,
the
the G) Show
value function EG) : Er [?÷fRk- re’t)))SED
differential
state
is
defined as :
where the reward
,
average [LEIRen] rich = laing Ee .
Derive the following Bellman equation
for the
ZaRats)3.PG’s- Is.DZ- restore’s]. steps .
=
8
is
Ans o [ II. ( R – r EG)=Ee .
=
–
r
1St =D
= Er[Rea- ren)t ftp.CRK-rcr)))Se- s]
= E r [ R t – N I T ) + E r [ E I ( R x – ..
= Er [Rei – ,
= E,
Se ‘ s ] ELOTE] )
[Rt
St
(“)1st )
MD
[ * definitions
M t) Pre At= a
t YT ,r
+,
–
+,
H IT) t En [ TE⇐z(Rk-
) 1St Se
Seis’sRt#⇒ I
÷(( [?)
GDtools.)
. “”]
=.
ZIT als)-2Pfs:n Isa)[r – re + ya(s. ))
a
sin
H N ) ) S e t i
k N + i ] – s
⇒
E =s
2 Go) .
I. 2.
3 .
4. g. 6. 7. 8. 9 10 . 11 .
.
Either give the full pseudo code o re
in terms of modifications to the above .
describe
I. 2.
3 .
4. g. 6. F. 8. 9 10 . 11 .
.
Either give the full pseudocode o re
in terms of modifications to the above
Ans .
*In step3 , replace “VCs)”
and ” ✓(terminal) = O ” with Qfterminal, * switch step o & 7
*
O ” *Add beforestep9 ”A’c-actiongivenbyrfors’77
Replace step 9 w ith
cc
QCS. A)← QCS. A)HER+ rQCs’sA’)- QB. AJ]9g * Add to step 10 : ” A c – A ‘ ”
describe
with “QCs, a)foralla EA”
=
cc)
u
3
.
( 2 0) show that
off- policy TD update for he is correct for transition S, Arb , s ‘s R :
)
following
✓ G)c – vcs)+aEGlsKR+rvG)) – USD
the
Earp[RtrVCs’ )-
by
=
following
the
: EauLeAls)(RtrV(s’))- VCs)IS- s]
Here ,
S(Afs) distributions w ith
,
showing
TL (A)s) =¥y
policy
TL(Als)>o =Db(Als)>o.
VCs)15=D TL and b ane
assum ption
.
lo
Let’s first work on the target ) Eanb CAIS)(R-18VCS’))/S- s
=
=
– 3 * 1 0 6 ( A – a . S ⇒’ S R – r s e a l s ) G t r V G ‘ ) )
=a¥
=
.
7Eag(rtrVG’D
blats)pls’m. . a)
totems. a) reals) Gettys’ ))
,
Earn [ R -18 VCs ‘ ) IS –
s ] ? F-Arb [VCs))s- s)- V(s)
On the other hand,
Therefore [VCS)IS ] G)IS2]
Therefore ,
the identity follows from 1 – 2.
,Eau -s- Ean -s
.
.
4. (20)
Give the specification of the off- policy
method .
Expected
Sars a
control
4. (20)
C=reve the specification of the off-poli.cz
Expected Sarsa control method . .,,,
And for the transition S A R S’ where
the actionArt is drawn from policy
distribution b ,
update the action value
the
QCS. A) e – Qfs, A)
:
where
uses
9>Oo
the target
following
way
t x [ R t r a ¥ a I –
policy
gYn
control
such as
E –
, greedyo
III.and off- policy
,
QG. AM
5.
A n
state
–
15433
( 2 0)
S { following
agent is in a
MDP,
has twoactionsoff,pig
3 –
observed
trajectory :
So=/,Ao-I, Ri- I,S,-2.AF2,
Tabular Dyna – Q .
, .
where
the
agent
the
Assume
each state
Rz=- I, Sa=3Ao=/Rz= 2,Sz=L,AEl,Rg-2,54=1.
The agent
uses
,,
which of the following a re possible (a r not possible) simulated transition
{S, A.R, s’}giventhe above observed trajectory w ith a deterministic m o d e l
and random search control. i{ 5=1}
.
A=L 12=2 } 5=1, ,
,
ii.{s. – 3AH 12=2
,
‘ ,
S= iii.{=LA=L12=1,s’ 2
SI
.,
iv.{5=2 A- 2, R= – I, ,
,
s
-I
– 2 Sk S=3A- , .
v.{,2,
Justmentionpossibleor notpossibleforeach.
Rz
ng
} 33
6 (203. Thesame
the activation
worksheet11question gradient computations fo r
neural networks except that
instead of re la .
regarding
as
z
g
is
tanh
Ans .
g(a)= tanh(a)= eaten ‘
– GD
g’ = ag = t- gCx32,
2R g(o)
=
O
.
-4:- e” e
n
G)Iii;= x;= g(4i)=
e+
(b) .
:at
And B K o
o
=
=o
.
He= 213kg
en- e- r
Md ,- B.. :gcolossi⇒* – :÷I÷))si. 3I÷ ‘ if feet
Aeese- o (as Aij- O Fini) ,
=0
EE ),
=o ,e
÷ 3¥÷ = .jo
=o
.
glue )
Bis, i 814;)sj
g (
“4 ie.: