2020/8/14 https://www.cse.unsw.edu.au/~cs9444/20T2/quiz/ans/quiz8_answers.html
COMP9444 Neural Networks and Deep Learning Quiz 8 Deep RL and Unsupervised Learning
This is an optional quiz to test your understanding of Deep RL and Unsupervised Learning.
1. Write out the steps in the REINFORCE algorithm, making sure to define any symbols you use.
for each trial
run trial and collect states , acions
for = 1 to length(trial)
θ ← θ + η( total – b) ∇θ log πθ( |
end end
and reward )
total
θ = parameters of policy, η = learning rate, total = total reward received during trial,
= baseline (constant), ∇θ = gradient with respect to θ, πθ( | ) = probability of performing action in state .
2. In the context of Deep Q-Learning, explain the following: a. Experience Replay
The agent(s) choose actions according to their current Q-function, using an ε- greedy strategy, and contribute to a central database of experiences in the form ( ). Another thread samples experiences asynchronously from the
experience database, and updates the Q-function by gradient descent, to minimize
[ + γ max ( ) – ( )]2
b. Double Q-Learning
Two sets of Q values are maintained. The current Q-network is used to select
actions, and a slightly older Q-network w̄ is used for the target value. 3. What is the Energy function for these architectures:
a. Boltzmann Machine
b. Restricted Boltzmann Machine
Remember to define any variables you use. a. Boltzmann Machine
E( ) = -(Σ < + Σ )
where = activation of node (0 or 1)
https://www.cse.unsw.edu.au/~cs9444/20T2/quiz/ans/quiz8_answers.html 1/2
w
i ix
ix ib i jx jiw ix j i x
ta,tswQb,1+tswQb tr
sasa
b
ts ta r r ta ts
t
1 + t s ,t r ,t a ,t s
r
2020/8/14 https://www.cse.unsw.edu.au/~cs9444/20T2/quiz/ans/quiz8_answers.html
b. Restricted Boltzmann Machine
E( , ) = -(Σ + Σ + Σ , )
where = visible unit activations, = hidden unit activations 4. The Variational Auto-Encoder is trained to maximize
E φ( | ( )) [log θ( ( ) | )] – DKL( φ( | ( )) || ( )) Briefly state what each of these two terms aims to achieve.
The first term enforces that any sample drawn from the conditional distribution φ(
| ( )) should, when fed to the decoder, produce something approximationg ( ).
The second term encourages the distribution φ( | ( )) to approximate the Normal distribution ( ) (by minimizing the KL-divergence between the two distributions)
5. Generative Adversarial Networks traditionally made use of a two-player zero-sum game between a Generator and a Discriminator , to compute
min max ( ( ))
a. Give the formula for ( θ, ψ).
( ) = E ∼ data [log Dψ( )] + E ∼ model [log(1 - ( ( )))]
b. Explain why it may be advantageous to change the GAN algorithm so that the game is no longer zero-sum, and write the formula that the Generator would try to maximize in that case.
The quality of the generated images tends to improve if the Generator instead tries to maximize
E ∼ model [log( ( ( )))]
This forces the Generator to put emphasis on improving the poor-quality images, rather than taking the images that are already good and making them slightly better.
6. In the context of GANs, briefly explain what is meant by , and list three different methods for avoiding it.
Mode collapse is when the Generator produces only a small subset of the desired range of images, or converges to a single image (with minor variations). Methods for avoiding mode collapse include: Conditioning Augmentation, Minibatch Features and Unrolled GANs.
https://www.cse.unsw.edu.au/~cs9444/20T2/quiz/ans/quiz8_answers.html 2/2
espalloc edom
z θ G ψ D
p z x
p x ψ D ,θ G V
ψD
θG
ixzq
ix ix zqz
DGV
zθGψD p z
zp ixzq zixp ixzq∼z
jh iv jhjiwivj i jhjcj ivibi
h v
ψD,θGV ψ θ
zp