程序代写代做代考 html deep learning 2020/8/14 https://www.cse.unsw.edu.au/~cs9444/20T2/quiz/ans/quiz7_answers.html

2020/8/14 https://www.cse.unsw.edu.au/~cs9444/20T2/quiz/ans/quiz7_answers.html
COMP9444 Neural Networks and Deep Learning Quiz 7 (Reinforcement Learning)
This is an optional quiz to test your understanding of the material from Week 7.
1. Explain the difference between the following paradigms, in terms of what is presented to the agent, and what the agent aims to do:
Supervised Learning Unsupervised Learning Reinforcment Learning
Supervised Learning: Each training item includes an input and a target output. The aim is to predict the output, given the input (for the training set as well as an unseen test set).
Unsupervised Learning: Each training item consists of only an input (no target value). The aim is to learn hidden features, or to infer whatever structure you can, from the data (input items). Reinforcement Learning: An agent chooses actions in a simulated environment, observing its state and receiving rewards along the way. The aim is to maximize the cumulative reward.
2. Describe the elements (sets and functions) that are needed to give a formal description of a reinforcement learning environment. What is the difference between a deterministic environment and a stochastic environment?
Formally, a reinforcement learning environment is defined by a set of states, a set of actions, a transition function ¦Ä and a reward function . For a deterministic environment, ¦Ä and are single-valued functions:
¦Ä : ¡Á ¡ú and : ¡Á ¡ú R
For a stochastic environment, ¦Ä and/or are not single-valued, but instead define a probability distribution on or R.
3. Name three different models of optimality in reinforcement learning, and give a formula for calculating each one.
Finite horizon reward: ¦² Infinite discounted reward: ¦² Average reward: lim
4. What is the definition of: a. the optimal policy b. the value function
c. the Q-function?
,
0 ¡Ü ¦Ã < 1 a. The optimal policy is the function ¦Ð*: ¡ú infinite discounted reward. , which maximizes the b. The value function V¦Ð( ) is the expected infinite discounted reward obtained by following policy ¦Ð starting from state . If ¦Ð = ¦Ð* is https://www.cse.unsw.edu.au/~cs9444/20T2/quiz/ans/quiz7_answers.html 1/2 s AS i+tr h < i ¡Ü 0¦² )h/1( ¡Þ ¡ú h i+tr i¦Ã 0 ¡Ý i R RA S i+tr h < i ¡Ü 0 S R s ASRSAS 2020/8/14 https://www.cse.unsw.edu.au/~cs9444/20T2/quiz/ans/quiz7_answers.html optimal, then V*( ) = V¦Ð*( ) is the maximum (expected) infinite discounted reward obtainable from state . c. The Q-function Q¦Ð( ) is the expected infinite discounted reward received by an agent who begins in state s, first performs action and then follows policy ¦Ð for all subsequent timesteps. If ¦Ð = ¦Ð* is optimal, then Q*( ) = Q¦Ð*( ) is the maximum (expected) discounted reward obtainable from , if the agent is forced to take action in the first timestep but can act optimally thereafter. 5. Assuming a stochastic environment, discount factor ¦Ã and learning rate of ¦Ç, write the equation for a. Temporal Difference learning TD(0) V( ) ¡û V( ) + ¦Ç [ + ¦ÃV( ) - V( )] b. Q-Learning Q( ) ¡û Q( ) + ¦Ç [ + ¦Ã max Q( Remember to define any symbols you use. = state at time , = action performed at time , = reward received at time , = state at time ) - Q( )] . https://www.cse.unsw.edu.au/~cs9444/20T2/quiz/ans/quiz7_answers.html 2/2 a 1+t 1+ts t tr t tat ts t a ,t s b ,1 + t s b t r t a ,t s t a ,t s ts 1+ts tr ts ts as a,sa,s s a,s ss