程序代写代做代考 information theory go ER Bayesian AI game deep learning database algorithm COMP9444

COMP9444
Neural Networks and Deep Learning
COMP9444
⃝c Alan Blair, 2017-20
10b. Summary

COMP9444 20T3 Review 1
McCulloch & Pitts Model of a Single Neuron
x1
1
❍❍❍❍❍ w❍
✟✟ x2 ✟
✁✁✕ ✁
x1, x2 are inputs
COMP9444
g is transfer function
⃝c Alan Blair, 2017-20
w2 ✟✟✟
1
= w1x1 + w2x2 + w0
th is a threshold
❥❍ s
✯✟ Σ ✲ g ✲ g ( s )
✁ w0=-th
✁ s = w1x1 + w2x2−th
w1, w2 are synaptic weights
w0 is a bias weight

COMP9444 20T3 Review 2
Perceptron Learning Rule
Adjust the weights as each input is presented. recall: s = w1x1 +w2x2 +w0
if g(s) = 0 but should be 1, wk ← wk+ηxk
if g(s) = 1 but should be 0, wk ← wk−ηxk
w0 ← w0+η
so s ← s+η(1+∑xk2 )
so
w0 ← w0−η
s ← s−η(1+∑xk2 )
kk
otherwise, weights are unchanged. (η > 0 is called the learning rate) Theorem: This will eventually learn to classify the data correctly,
as long as they are linearly separable.
COMP9444 ⃝c Alan Blair, 2017-20

COMP9444 20T3 Review 3
Limitations of Perceptrons
Problem: many useful functions are not linearly separable (e.g. XOR)
I1 I1 I1 111
000 0 1I2 0 1I2
0 1I2 (c) I1 xor I2
(a) I1 and I2 (b) I1 or I2
Possible solution:
x1 XOR x2 can be written as: (x1 AND x2) NOR (x1 NOR x2)
Recall that AND, OR and NOR can be implemented by perceptrons. COMP9444 ⃝c Alan Blair, 2017-20
?

COMP9444 20T3 Review
4
Multi-Layer Neural Networks
AND
NOR
−1.5
+1
+0.5
XOR
NOR
Problem: How can we train it to learn a new function? (credit assignment)
COMP9444 ⃝c Alan Blair, 2017-20
+0.5 −1 −1
+1
−1
−1

COMP9444 20T3 Review 5
Types of Learning
􏰁 Supervised Learning
◮ agent is presented with examples of inputs and their target outputs
􏰁 Reinforcement Learning
◮ agent is not presented with target outputs, but is given a reward
signal, which it aims to maximize 􏰁 Unsupervised Learning
COMP9444
⃝c Alan Blair, 2017-20
◮ agent is only presented with the inputs themselves, and aims to find structure in these inputs

COMP9444 20T3 Review 6
Ockham’s Razor
“The most likely hypothesis is the simplest one consistent with the data.” oxx oxx oxx
oo x oo x oo x
oxx xxx oxx xxx oxx xxx
oxoxox
oxxx oxxx oxxx
o oxxo oxxo oxx
oo oooo oooo oo ooo ooo ooo
inadequate good compromise over-fitting
Since there can be noise in the measurements, in practice need to make a tradeoff between simplicity of the hypothesis and how well it fits the data.
COMP9444 ⃝c Alan Blair, 2017-20

COMP9444 20T3 Review
7
Two-Layer Neural Network
Output units ai Wj,i
Hidden units
aj Wk,j
Input units ak
Normally, the numbers of input and output units are fixed,
but we can choose the number of hidden units. COMP9444
⃝c Alan Blair, 2017-20

COMP9444 20T3 Review 8
Local Search in Weight Space
Problem: because of the step function, the landscape will not be smooth but will instead consist almost entirely of flat local regions and “shoulders”, with occasional discontinuous jumps.
COMP9444 ⃝c Alan Blair, 2017-20

COMP9444 20T3
Review 9
Key Idea
ai ai ai +1 +1 +1
(a) Step function
(b) Sign function
(c) Sigmoid function
t ini
ini
ini
−1
Replace the (discontinuous) step function with a differentiable function, such as the sigmoid:
g(s)= 1 1+e−s
or hyperbolic tangent
g(s) = tanh(s) = es +e−s = 2􏰂1+e−2s 􏰃−1
COMP9444
⃝c Alan Blair, 2017-20
es − e−s 1

COMP9444 20T3 Review 10
Gradient Descent (4.3)
Recall that the error function E is (half) the sum over all input patterns of the square of the difference between actual output and desired output
E = 1 ∑(z−t)2 2
The aim is to find a set of weights for which E is very low.
If the functions involved are smooth, we can use multi-variable calculus to adjust the weights in such a way as to take us in the steepest downhill
direction.
w←w−η ∂E ∂w
Parameter η is called the learning rate. COMP9444
⃝c Alan Blair, 2017-20

COMP9444 20T3 Review 11
Variations on Backprop
􏰁 Cross Entropy
◮ problem: least squares error function unsuitable for classification,
where target = 0 or 1
◮ mathematical theory: maximum likelihood
◮ solution: replace with cross entropy error function
􏰁 Weight Decay
◮ problem: weights “blow up”, and inhibit further learning ◮ mathematical theory: Bayes’ rule
◮ solution: add weight decay term to error function
􏰁 Momentum
◮ problem: weights oscillate in a “rain gutter”
◮ solution: weighted average of gradient over time
COMP9444
⃝c Alan Blair, 2017-20

COMP9444 20T3 Review 12
Cross Entropy
For classification tasks, target t is either 0 or 1, so better to use E = −t log(z)−(1−t)log(1−z)
This can be justified mathematically, and works well in practice – especially when negative examples vastly outweigh positive ones.
It also makes the backprop computations simpler
COMP9444
⃝c Alan Blair, 2017-20
∂E = z−t ∂z z(1 − z)
if z = 1 , 1+e−s
∂E = ∂E∂z=z−t ∂s ∂z ∂s

COMP9444 20T3 Review 13
Bayes’ Rule (3.11)
The formula for conditional probability can be manipulated to find a relationship when the two variables are swapped:
P(a∧b) = P(a|b)P(b) = P(b|a)P(a)
→ Bayes’ rule P(a | b) = P(b | a)P(a)
P(b)
This is often useful for assessing the probability of an underlying cause
after an effect has been observed:
COMP9444
⃝c Alan Blair, 2017-20
P(Cause|Effect) = P(Effect|Cause)P(Cause) P(Effect)

COMP9444 20T3 Review 14
Bayesian Inference
H is a class of hypotheses
P(D | h) = probability of data D being generated under hypothesis h ∈ H . P(h | D) = probability that h is correct, given that data D were observed. Bayes’ Theorem:
P(h) is called the prior. COMP9444
⃝c Alan Blair, 2017-20
P(h | D)P(D) = P(h|D) =
P(D | h)P(h) P(D | h)P(h)
P(D)

COMP9444 20T3 Review 15
Weight Decay (5.2.2)
Sometimes we add a penalty term to the loss function which encourages the neural network weights w j to remain small:
E = 1 ∑ ( z i − t i ) 2 + λ ∑ w 2j 2i 2j
This can prevent the weights from “saturating” to very high values.
It is sometimes referred to as “elastic weights” because the weights experience a force as if there were a spring pulling them back towards the origin according to Hooke’s Law.
The scaling factor λ needs to be determined from experience, or empirically.
COMP9444 ⃝c Alan Blair, 2017-20

COMP9444 20T3 Review 16
Momentum (8.3)
If landscape is shaped like a “rain gutter”, weights will tend to oscillate without much improvement.
Solution: add a momentum factor
motion by 1 . 1−α
COMP9444
⃝c Alan Blair, 2017-20
δw ← αδw−η∂E ∂w
w ← w+δw
Hopefully, this will dampen sideways oscillations but amplify downhill

COMP9444 20T3 Review 17
Dropout (7.12)
Nodes are randomly chosen to not be used, with some fixed probability (usually, one half).
COMP9444 ⃝c Alan Blair, 2017-20

COMP9444 20T3
Review
18
Hinton Diagrams
􏰁 used to visualize higher dimensions 􏰁 white = positive, black = negative
COMP9444
⃝c Alan Blair, 2017-20
Sharp Left
Straight Ahead
Sharp Right
4 Hidden Units
30 Output Units
30×32 Sensor Input Retina

COMP9444 20T3 Review 19
Limitations of Two-Layer Neural Networks
Some functions cannot be learned with a 2-layer sigmoidal network.
−2
−4
−6
6
4
2
0
−6 −4 −2 0 2 4 6
For example, this Twin Spirals problem cannot be learned with a 2-layer network, but it can be learned using a 3-layer network if we include shortcut connections between non-consecutive layers.
COMP9444 ⃝c Alan Blair, 2017-20

COMP9444 20T3 Review 20
Vanishing / Exploding Gradients
Training by backpropagation in networks with many layers is difficult.
When the weights are small, the differentials become smaller and smaller as we backpropagate through the layers, and end up having no effect.
When the weights are large, the activations in the higher layers will saturate to extreme values. As a result, the gradients at those layers will become very small, and will not be propagated to the earlier layers.
When the weights have intermediate values, the differentials will sometimes get multiplied many times is places where the transfer function is steep, causing them to blow up to large values.
COMP9444 ⃝c Alan Blair, 2017-20

COMP9444 20T3 Review 21
Activation Functions (6.3)
COMP9444
44
33
22
11
00
-1
-1
-2
-4 -2 0 2 4
-2
-4 -2 0 2 4
Sigmoid
Rectified Linear Unit (ReLU)
44
33
22
11
00
-1
-1
-2
-2
-4 -2 0 2 4
-4 -2 0 2 4
Hyperbolic Tangent
Scaled Exponential Linear Unit (SELU)
⃝c Alan Blair, 2017-20

COMP9444 20T3 Review 22
Convolutional Network Components
􏰁 convolution layers: extract shift-invariant features from the previous layer
􏰁 subsampling or pooling layers: combine the activations of multiple units from the previous layer into one unit
􏰁 fully connected layers: collect spatially diffuse information
􏰁 output layer: choose between classes
COMP9444 ⃝c Alan Blair, 2017-20

COMP9444 20T3 Review 23
Softmax (6.2.2)
Consider a classification task with N classes, and assume zj is the output of the unit corresponding to class j.
We assume the network’s estimate of the probability of each class j is proportional to exp(zj). Because the probabilites must add up to 1, we need to normalize by dividing by their sum:
Prob(i) = exp(zi)
∑ Nj = 1 e x p ( z j )
log Prob(i) = zi − log ∑Nj=1 exp(z j )
If the correct class is i, we can treat −logProb(i) as our cost function. The first term pushes up the correct class i, while the second term mainly pushes down the incorrect class j with the highest activation (if j ̸= i).
COMP9444 ⃝c Alan Blair, 2017-20

COMP9444 20T3 Review
24
Convolutional Neural Networks
Zi j,k
m=0 n=0 Ki Vl
􏰃
j j+m
k k+n
l
M−1 N−1
∑∑ ∑ l,m,n j+m,k+n
=g􏰂bi+
The same weights are applied to the next M × N block of inputs, to
l
compute the next hidden unit in the convolution layer (“weight sharing”). COMP9444 ⃝c Alan Blair, 2017-20

COMP9444 20T3 Review 25
Stride with Zero Padding
When combined with zero padding of width P, j takes on the values 0,s,2s,…,(J+2P−M)
k takes on the values 0,s,2s,…,(K+2P−N)
The next layer is (1+(J+2P−M)/s) by (1+(K+2P−N)/s)
COMP9444 ⃝c Alan Blair, 2017-20

COMP9444 20T3 Review
26
Convolutional Filters
COMP9444
First Layer Second Layer
Third Layer
⃝c Alan Blair, 2017-20

COMP9444 20T3 Review 27
Weight Initialization
In order to have healthy forward and backward propagation, each term in the product must be approximately equal to 1. Any deviation from this could cause the activations to either vanish or saturate, and the differentials to either decay or explode exponentially.
􏰆 D in (i) 􏰇 Var[z] ≃ ∏G0 ni Var[w ] Var[x]
i=1
∂ 􏰆D out (i)􏰇 ∂
Var[∂x] ≃ ∏G1 ni Var[w ] Var[∂z] i=1
We therefore choose the initial weights {w(i)} in each layer (i) such that jk
COMP9444
⃝c Alan Blair, 2017-20
G1noutVar[w(i)] = 1 i

COMP9444 20T3 Review 28
Batch Normalization
We can normalize the activations x(i) of node k in layer (i) relative to the k
mean and variance of those activations, calculated over a mini-batch of
training items:
x(i) − Mean[x(i)] kk
(i)
xˆ k =
􏰈 V a r [ x ( i ) ] k
These activations can then be shifted and re-scaled to y(i) = β(i) + γ(i)xˆ(i)
kkkk
β(i),γ(i) are additional parameters, for each node, which are trained by kk
backpropagation along with the other parameters (weights) in the network. After training is complete, Mean[x(i)] and Var[x(i)] are either pre-computed
kk
on the entire training set, or updated using running averages.
COMP9444 ⃝c Alan Blair, 2017-20

COMP9444 20T3 Review 29
Residual Networks
Idea: Take any two consecutive stacked layers in a deep network and add a “skip” connection which bipasses these layers and is added to their output.
COMP9444 ⃝c Alan Blair, 2017-20

COMP9444 20T3 Review 30
Dense Networks
Recently, good results have been achieved using networks with densely connected blocks, within which each layer is connected by shortcut connections to all the preceding layers.
COMP9444 ⃝c Alan Blair, 2017-20

COMP9444 20T3 Review 31
Neural Texture Synthesis
1. pretrain CNN on ImageNet (VGG-19)
2. pass input texture through CNN; compute feature map F l for ith filter
at spatial location k in layer (depth) l
3. compute the Gram matrix for each pair of features
Gl = FlFl ij ∑ikjk
4. feed (initially random) image into CNN
5. compute L2 distance between Gram matrices of original and new image
6. backprop to get gradient on image pixels
7. update image and go to step 5.
COMP9444 ⃝c Alan Blair, 2017-20
k
ik

COMP9444 20T3
Review
32
Neural Style Transfer
COMP9444
content + style

new image
⃝c Alan Blair, 2017-20

COMP9444 20T3 Review 33
Neural Style Transfer
For Neural Style Transfer, we minimize a cost function which is Etotal = α Econtent + β Estyle
where
i,k
xc,x Fl
= = = =
ith filter at position k in layer l
number of filters, and size of feature maps, in layer l weighting factor for layer l
Gram matrices for style image, and synthetic image
ik Nl, Ml
wl Gilj, Ailj
COMP9444
⃝c Alan Blair, 2017-20
= 2 =
||Fl(x)−Fl(xc)||2 + 4 N2M2 (Gl −Al )2 ∑ikik ∑∑ijij
α
βL wl
l=0 l l i,j content image, synthetic image

COMP9444 20T3
Review
34
Recurrent Networks
􏰁 Processing Temporal Sequences 􏰁 Sliding Window
􏰁 Recurrent Network Architectures 􏰁 Hidden Unit Dynamics
􏰁 Long Short Term Memory
COMP9444
⃝c Alan Blair, 2017-20

COMP9444 20T3 Review 35
Sliding Window
The simplest way to feed temporal input to a neural network is the “sliding window” approach, first used in the NetTalk system (Sejnowski & Rosenberg, 1987).
COMP9444 ⃝c Alan Blair, 2017-20

COMP9444 20T3 Review 36
Simple Recurrent Network (Elman, 1990)
􏰁 at each time step, hidden layer activations are copied to “context” layer
􏰁 hidden layer receives connections from input and context layers
􏰁 theinputsarefedoneatatimetothenetwork,itusesthecontextlayer
to “remember” whatever information is required for it to produce the correct output
COMP9444 ⃝c Alan Blair, 2017-20

COMP9444 20T3 Review 37
Back Propagation Through Time
􏰁 wecan“unroll”arecurrentarchitectureintoanequivalentfeedforward architecture, with shared weights
􏰁 applying backpropagation to the unrolled architecture is reffered to as “backpropagation through time”
􏰁 we can backpropagate just one timestep, or a fixed number of timesteps, or all the way back to beginning of the sequence
COMP9444 ⃝c Alan Blair, 2017-20

COMP9444 20T3 Review 38
Oscillating Solution for anbn
COMP9444 ⃝c Alan Blair, 2017-20

COMP9444 20T3 Review 39
Long Range Dependencies
􏰁 Simple Recurrent Networks (SRNs) can learn medium-range dependencies but have difficulty learning long range dependencies
􏰁 LongShortTermMemory(LSTM)andGatedRecurrentUnits(GRU) can learn long range dependencies better than SRN
COMP9444 ⃝c Alan Blair, 2017-20

COMP9444 20T3 Review 40
Long Short Term Memory
COMP9444 ⃝c Alan Blair, 2017-20

COMP9444 20T3 Review 41
Gated Recurrent Unit
GRU is similar to LSTM but has only two gates instead of three.
COMP9444 ⃝c Alan Blair, 2017-20

a
all
and bought cat caught
crooked found he house in
little
lived man
mile mouse
sixpence
stile there they
together upon
walked was
who
COMP9444 20T3 Review 42
Co-occurrence Matrix (2-word window)
COMP9444
⃝c Alan Blair, 2017-20
word
a 116111 111 all 11
and 1111
bought 1
cat
caught 1
crooked 6 1 found 1 1
1
1 1
1
he 1 1
house in1 little 1 lived
man
mile 1
1 1
1 1
mouse
sixpence
stile
there
they 11 together
1
1 1 1 1 1
1 1
1
1
1
upon 1
walked 1
was 1
who 11 1 1
1
1111111
1
1
1
1

COMP9444 20T3 Review
43
Word Embeddings
COMP9444
⃝c Alan Blair, 2017-20
0.4
together
0.3
0.2
and
0.1
-0.1
who
-0.2
0
a crooked
found
there was
-0.6 -0.4 -0.2
0
lived they
caught
man
bought stile
little mouse
house

COMP9444 20T3 Review 44
Singular Value Decomposition
Co-occurrence matrix X can be decomposed as X = USVT where U, V are unitary (all columns have unit length) and S is diagonal.
L
L uk
12 sN
MN u1
NM
s
~ u2 1s vv N
~~~T XUSV
Columns 1 to n of row k of U then provide an n-dimensional vector representing the kth word in the vocabulary.
SVD is computationally expensive, proportional to L × M2 if L ≥ M. Can we do something similar with less computation, and incrementally?
COMP9444 ⃝c Alan Blair, 2017-20
2

COMP9444 20T3 Review 45
Continuous Bag Of Words
COMP9444
⃝c Alan Blair, 2017-20
􏰁
If several context words are each used independently to predict the center word, the hidden activation becomes a sum (or average) over all the context words
􏰁
Note the difference between this and NetTalk – in word2vec (CBOW) all context words share the same input-to-hidden weights

COMP9444 20T3 Review 46
word2vec Skip-Gram Model
COMP9444
⃝c Alan Blair, 2017-20
􏰁 try to predict the context words, given the center word
􏰁 this skip-gram model is similar to CBOW, except that in this case a single input word is used to predict multiple context words
􏰁 all context words share the same hidden-to-output weights

COMP9444 20T3 Review
47
Hierarchical Softmax
[n′ = child(n)] = +1, if n′ is left child of node n, −1, otherwise.
σ(u) = 1/(1 − exp(−u))
L(w)−1 T
prob(w = wt ) = ∏ σ([n(w, j + 1) = child(n(w, j))]v′n(w, j) j=1
h) COMP9444 ⃝c Alan Blair, 2017-20

COMP9444 20T3 Review 48
Negative Sampling
The idea of negative sampling is that we train the network to increase its estimation of the target word j∗ and reduce its estimate not of all the words in the vocabulary but just a subset of them Wneg, drawn from an appropriate distribution.
′T ′T E=−logσ(vj∗ h)− ∑ logσ(−vj h)
This is a simplified version of Noise Constrastive Estimation (NCE). It is not guaranteed to produce a well-defined probability distribution, but in practice it does produce high-quality word embeddings.
COMP9444 ⃝c Alan Blair, 2017-20
j∈Wneg

COMP9444 20T3 Review 49
Word Vector Arithmetic
COMP9444 ⃝c Alan Blair, 2017-20

COMP9444 20T3 Review 50
Bidirectional Recurrent Encoder
COMP9444 ⃝c Alan Blair, 2017-20

COMP9444 20T3 Review 51
Attention Mechanism
COMP9444 ⃝c Alan Blair, 2017-20

COMP9444 20T3 Review 52
Reinforcement Learning Framework
􏰁 Anagentinteractswithitsenvironment.
􏰁 ThereisasetSofstatesandasetAofactions.
􏰁 Ateachtimestept,theagentisinsomestatest.
It must choose an action at , whereupon it goes into state st+1 =δ(st,at)andreceivesrewardrt =R(st,at)
􏰁 Agenthasapolicyπ:S→A. Weaimtofindanoptimalpolicyπ∗ which maximizes the cumulative reward.
􏰁 In general, δ, R and π can be multi-valued, with a random element, in which case we write them as probability distributions
COMP9444
⃝c Alan Blair, 2017-20
δ(st+1 =s|st,at) R(rt =r|st,at) π(at =a|st)

COMP9444 20T3
Review 53
Models of optimality
Is a fast nickel worth a slow dime?
Finite horizon reward Infinitediscountedreward
h−1
∑ rt+i
Average reward
lim h ∑ rt +i h→∞ i=0
􏰁 Finite horizon reward is simple computationally
i=0
∑∞ γirt+i, 0≤γ<1 i=0 1 h−1 􏰁 Infinite discounted reward is easier for proving theorems 􏰁 Average reward is hard to deal with, because can’t sensibly choose between small reward soon and large reward very far in the future. COMP9444 ⃝c Alan Blair, 2017-20 COMP9444 20T3 Review 54 RL Approaches 􏰁 Value Function Learning ◮ TD-Learning ◮ Q-Learning 􏰁 Policy Learning ◮ Hill Climbing ◮ Policy Gradients ◮ Evolutionary Strategy 􏰁 Actor-Critic ◮ combination of Value and Policy learning COMP9444 ⃝c Alan Blair, 2017-20 COMP9444 20T3 Review 55 Exploration / Exploitation Tradeoff Most of the time we should choose what we think is the best action. However, in order to ensure convergence to the optimal strategy, we must occasionally choose something different from our preferred action, e.g. 􏰁 choose a random action 5% of the time, or 􏰁 use Softmax (Boltzmann distribution) to choose the next action: COMP9444 ⃝c Alan Blair, 2017-20 P(a) = b∈A eR (a))/T ∑ eR(b))/T COMP9444 20T3 Review 56 Temporal Difference Learning Let’s first assume that R and δ are deterministic. Then the (true) value V∗(s) of the current state s should be equal to the immediate reward plus the discounted value of the next state V∗(s) = R (s,a)+γV∗(δ(s,a)) We can turn this into an update rule for the estimated value, i.e. V(st) ← rt +γV(st+1) If R and δ are stochastic (multi-valued), it is not safe to simply replace V (s) with the expression on the right hand side. Instead, we move its value fractionally in this direction, proportional to a learning rate η V(st)←V(st)+η[rt +γV(st+1)−V(st)] COMP9444 ⃝c Alan Blair, 2017-20 COMP9444 20T3 Review 57 Q-Learning For a deterministic environment, π∗ , Q∗ and V ∗ are related by π∗(s) = argmaxa Q∗(s,a) Q∗(s,a) = R (s,a)+γV∗(δ(s,a)) V∗(s) = max Q∗(s,b) b So This allows us to iteratively approximate Q by Q∗ (s, a) = R (s, a) + γ max Q∗ (δ(s, a), b) b Q(st,at) ← rt +γ maxQ(st+1,b) b If the environment is stochastic, we instead write Q(st,at) ← Q(st,at)+η[rt +γ max Q(st+1,b)−Q(st,at)] COMP9444 ⃝c Alan Blair, 2017-20 b COMP9444 20T3 Review 58 Policy Gradients If rtotal = +1 for a win and −1 for a loss, we can simply multiply the log probability by rtotal. Differentials can be calculated using the gradient mm ∇θ rtotal ∑logπθ(at|st) = rtotal ∑∇θ logπθ(at|st) t=1 t=1 The gradient of the log probability can be calculated nicely using Softmax. If rtotal takes some other range of values, we can replace it with (rtotal − b) where b is a fixed value, called the baseline. COMP9444 ⃝c Alan Blair, 2017-20 COMP9444 20T3 Review 59 REINFORCE Algorithm We then get the following REINFORCE algorithm: for each trial run trial and collect states st , actions at , and reward rtotal for t = 1 to length(trial) end θ ← θ+η(rtotal −b)∇θ logπθ(at|st) end This algorithm has successfully been applied, for example, to learn to play the game of Pong from raw image pixels. COMP9444 ⃝c Alan Blair, 2017-20 COMP9444 20T3 Review 60 Deep Q-Network COMP9444 ⃝c Alan Blair, 2017-20 COMP9444 20T3 Review 61 Deep Q-Learning with Experience Replay 􏰁 choose actions using current Q function (ε-greedy) 􏰁 build a database of experiences (st,at,rt,st+1) 􏰁 sample asynchronously from database and apply update, to minimize [rt +γ max Qw(st+1,b)−Qw(st,at)]2 b 􏰁 removes temporal correlations by sampling from variety of game situations in random order 􏰁 makes it easier to parallelize the algorithm on multiple GPUs COMP9444 ⃝c Alan Blair, 2017-20 COMP9444 20T3 Review 62 Double Q-Learning 􏰁 if the same weights w are used to select actions and evaluate actions, this can lead to a kind of confirmation bias 􏰁 could maintain two sets of weights w and w, with one used for selection and the other for evaluation (then swap their roles) 􏰁 in the context of Deep Q-Learning, a simpler approach is to use the current “online” version of w for selection, and an older “target” version w for evaluation; we therefore minimize [rt +γQw(st+1,argmaxb Qw(st+1,b))−Qw(st,at)]2 􏰁 a new version of w is periodically calculated from the distributed values of w, and this w is broadcast to all processors. COMP9444 ⃝c Alan Blair, 2017-20 COMP9444 20T3 Review 63 Advantage Actor Critic Recall that in the REINFORCE algorithm, a baseline b could be subtracted from rtotal Intheactor-criticframework,rtotal isreplacedbyQ(st,at) COMP9444 ⃝c Alan Blair, 2017-20 θ ← θ+η(rtotal −b)∇θ logπθ(at|st) θ ← θ+ηθ Q(st,at)∇θ logπθ(at |st) We can also subtract a baseline from Q(st,at). This baseline must be independent of the action at , but it could be dependent on the state st . A good choice of baseline is the value function Vu(s), in which case the Q function is replaced by the advantage function Aw(s,a) = Q(s,a)−Vu(s) COMP9444 20T3 Review 64 Asynchronous Advantage Actor Critic 􏰁 use policy network to choose actions 􏰁 learn a parameterized Value function Vu(s) by TD-Learning 􏰁 estimate Q-value by n-step sample Q(st,at) = rt+1 +γrt+2 +...+γn−1rt+n +γnVu(st+n) 􏰁 update policy by θ ← θ+ηθ [Q(st,at)−Vu(st)]∇θ logπθ(at |st) 􏰁 update Value function my minimizing COMP9444 ⃝c Alan Blair, 2017-20 [Q(st,at)−Vu(st)]2 COMP9444 20T3 Review 65 Hopfield Network E (x) = −( 1 ∑ xi wi j x j + ∑ bi xi ) 2 i,j i Start with an initial state x and then repeatedly try to “flip” neuron activations one at a time, in order to reach a lower-energy state. If we choose to modify neuron xi, its new value should be +1, if ∑ w x +b >0,  jijji
xi←xi, if ∑jwijxj+bi=0, −1, if ∑jwijxj+bi<0. This ensures that the energy E(x) will never increase. It will eventually reach a local minimum. COMP9444 ⃝c Alan Blair, 2017-20 COMP9444 20T3 Review 66 Boltzmann Machine (20.1) The Boltzmann Machine uses exactly the same energy function as the Hopfield network: E (x) = −( ∑ xi wi j x j + ∑ bi xi ) i