CS计算机代考程序代写 ER Final Exam

Final Exam
Practice
Total
m arks
m in s
100
120

I. (20)
For the average reward setting ,
the
the G) Show
value function EG) : Er [?÷fRk- re’t)))SED
differential
state
is
defined as :
where the reward
,
average [LEIRen] rich = laing Ee .
Derive the following Bellman equation
for the
ZaRats)3.PG’s- Is.DZ- restore’s]. steps .
=
8
is

2 Go) .
I. 2.
3 .
4. g. 6. 7. 8. 9 10 . 11 .
.

3
.
( 2 0) show that
off- policy TD update for he is correct for transition S, Arb , s ‘s R :
)
following
✓ G)c – vcs)+aEGlsKR+rvG)) – USD
the
Earp[RtrVCs’ )-
by
=
following
the
: EauLeAls)(RtrV(s’))- VCs)IS- s]
Here ,
S(Afs) distributions w ith
,
showing
TL (A)s) =¥y
policy
TL(Als)>o =Db(Als)>o.
VCs)15=D TL and b ane
assum ption
.
lo

4. (20)
Give the specification of the off- policy
method .
Expected
Sars a
control

5.
A n
state

15433
( 2 0)
S { following
agent is in a
MDP,
has twoactionsoff,pig
3 –
observed
trajectory :
So=/,Ao-I, Ri- I,S,-2.AF2,
Tabular Dyna – Q .
, .
where
the
agent
the
Assume
each state
Rz=- I, Sa=3Ao=/Rz= 2,Sz=L,AEl,Rg-2,54=1.
The agent
uses
,,
which of the following a re possible (a r not possible) simulated transition
{S, A.R, s’}giventhe above observed trajectory w ith a deterministic m o d e l
and random search control. i{ 5=1}
.
A=L 12=2 } 5=1, ,
,
ii.{s. – 3AH 12=2
,
‘ ,
S= iii.{=LA=L12=1,s’ 2
SI
.,
iv.{5=2 A- 2, R= – I, ,
,
s
-I
– 2 Sk S=3A- , .
v.{,2,
Justmentionpossibleor notpossibleforeach.
Rz
ng
} 33

6 (203. Thesame
the activation
worksheet11question gradient computations fo r
neural networks except that
instead of re la .
regarding
as
z
g
is
tanh