COMP9444
Neural Networks and Deep Learning
Outline
COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 20T2
Probability and Backprop Variations 2
COMP9444 20T2
Probability and Backprop Variations 3
2a. Probability and Backprop Variations
Probability and Random Variables (3.1-3.2) Probability for Continuous Variables (3.3) Gaussian Distribution (3.9.3)
Conditional Probability (3.5)
Textbook, Sections 3.1-3.5, 3.9-3.13, 5.2.2, 5.5, 8.3
Bayes’ Rule (3.11)
Entropy and KL-Divergence (3.13) Cross Entropy (3.13)
Maximum Likelihood (5.5)
Weight Decay (5.2.2)
Momentum (8.3)
Probability (3.1)
Random Events
Begin with a set Ω – the sample space (e.g. 6 possible rolls of a die) Each ω ∈ Ω is a sample point / possible world / atomic event
A random event A is any subset of Ω
P(A)= ∑P(ω)
A probability space or probability model is a sample space with an assignment P(ω) for every ω ∈ Ω such that
{ω∈A} e.g.P(dieroll<5)=P(1)+P(2)+P(3)+P(4)=1+1+1+1 =2
e.g. P(1)=P(2)=P(3)=P(4)=P(5)=P(6)= 1. 6
COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
0≤P(ω)≤1 ∑P(ω) = 1
ω
COMP9444 20T2 Probability and Backprop Variations 1
66663
COMP9444 20T2 Probability and Backprop Variations 4
COMP9444 20T2 Probability and Backprop Variations 5
Random Variables (3.2)
Probability for Continuous Variables (3.3)
A random variable is a function from sample points to some range (e.g. the Reals or Booleans)
e.g. P(X = x) = U [18, 26](x) = uniform density between 18 and 26 0.125
For example, Odd(3) = true
P induces a probability distribution for any random variable X:
COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
P(X = xi) = ∑ P(ω) {ω:X (ω)=xi }
18 dx 26
e.g., P(Odd=true) = P(1)+P(3)+P(5) = 1 + 1 + 1 = 1 6 6 6 2
Here P is a density; it integrates to 1. P(X = 20.5) = 0.125 really means
COMP9444 20T2 Probability and Backprop Variations
6 COMP9444 20T2
Probability and Backprop Variations
7
Gaussian Distribution (3.9.3)
Probability and Logic
Pμ,σ(x) = √ 1 2πσ
e−(x−μ)2/2σ2
B
μ = mean
σ = standard deviation
Logically related events must have related probabilities For example, P(A∨B) = P(A)+P(B)−P(A∧B)
Multivariate Gaussian: COMP9444
Pμ,σ(x) = ∏ Pμi,σi (xi) i
0
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
A
lim P(20.5 ≤ X ≤ 20.5 + dx)/dx = 0.125 dx→0
A ^B
COMP9444 20T2 Probability and Backprop Variations
8
COMP9444 20T2 Probability and Backprop Variations 9
Conditional Probability (3.5)
Bayes’ Rule (3.11)
A
B
The formula for conditional probability can be manipulated to find a relationship when the two variables are swapped:
A ^ B
P(B)
This is often useful for assessing the probability of an underlying cause
If P(B) ̸= 0, then the conditional probability of A given B is P(A|B)= P(A∧B)
after an effect has been observed:
P(Cause|Effect) = P(Effect|Cause)P(Cause)
P(B)
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 20T2 Probability and Backprop Variations
10
COMP9444 20T2
Probability and Backprop Variations
11
Example: Medical Diagnosis
Bayes’ Rule for Medical Diagnosis
Question: Suppose we have a test for a type of cancer which occurs in 1% of patients. The test has a sensitivity of 98% and a specificity of 97%.
If a patient tests positive, what is the probability that they have the cancer?
Pos P(Yes,Pos) = 0.01 0.98
Answer: There are two random variables: Cancer (true or false) and Test (positive or negative). The probability is called a prior, because it represents our estimate of the probability before we have done the test (or made some other observation). The sensitivity and specificity are interpreted as follows:
0.99 No
Test
0.03 0.97
P(positive|cancer) = 0.98, COMP9444
and P(negative|¬cancer) = 0.97 ⃝c Alan Blair, 2017-20
P(cancer | positive) COMP9444
= =
P(positive | cancer)P(cancer) P(positive)
P(A∧B) = P(A|B)P(B) = P(B|A)P(A)
→ Bayes’ rule P(A | B) = P(B | A)P(A)
Cancer?
P(No ,Pos) = 0.03
Test
Yes 0.02
0.01
Neg Pos
P(Yes,Neg) = 0.00
Neg
P(No ,Neg) = 0.96
P(Effect)
0.98∗0.01 = 0.01 0.98∗0.01+0.03∗0.99 0.01+0.03 4
= 1 ⃝c Alan Blair, 2017-20
COMP9444 20T2 Probability and Backprop Variations
12
COMP9444 20T2 Probability and Backprop Variations
13
Example: Light Bulb Defects
Bayes’ Rule for Light Bulb Defects
Question: You work for a lighting company which manufactures 60% of its light bulbs in Factory A and 40% in Factory B. One percent of the light bulbs from Factory A are defective, while two percent of those from Factory B are defective. If a random light bulb turns out to be defective, what is the probability that it was manufactured in Factory A?
Defect?
Yes P(A,Yes) = 0.006 0.01
Answer: There are two random variables: Factory (A or B) and Defect (Yes or No). In this case, the prior is:
0.98
P(A) = 0.6, P(B) = 0.4
The conditional probabilities are:
P(A | defect) COMP9444
= =
P(defect | A)P(A) P(defect)
P(defect|A) = 0.01, COMP9444
and P(defect|B) = 0.02
COMP9444 20T2
Probability and Backprop Variations
14
COMP9444 20T2
Probability and Backprop Variations 15
Entropy and KL-Divergence (3.13)
Entropy and Huffmann Coding
The entropy of a discrete probability distribution p = ⟨p1,..., pn⟩ is
Entropy is the number of bits per symbol achieved by a (block) Huffman Coding scheme.
n
H(p) = ∑ pi (−log2 pi)
Example 1: H(⟨0.5,0.5⟩) = 1 bit.
i=1
Given two probability distributions p = ⟨p1,..., pn⟩ and q = ⟨q1,...,qn⟩ on the same set Ω, the Kullback-Leibler Divergence between p and q is
Suppose we want to encode, in zeros and ones, a long message composed of the two letters A and B, which occur with equal frequency. This can be done efficiently by assigning A=0, B=1. In other words, one bit is needed to encode each letter.
n
DKL(p||q) = ∑ pi (log2 pi −log2 qi)
i=1
KL-Divergence is like a “distance” from one probability distribution to
01 AB
another. But, it is not symmetric.
DKL(p||q) ̸= DKL(q||p)
COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
⃝c Alan Blair, 2017-20
Factory?
Yes P(B,Yes) = 0.008 0.02
A
0.99
No P(A,No ) = 0.594
B
Defect?
No P(B,No ) = 0.392
0.6
0.4
0.01∗0.6 = 0.01∗0.6+0.02∗0.4 0.006+0.008 7
0.006
= 3
⃝c Alan Blair, 2017-20
COMP9444 20T2 Probability and Backprop Variations
16
COMP9444 20T2 Probability and Backprop Variations 17
Entropy and Huffmann Coding
Entropy and KL-Divergence
Example 2: H (⟨0.5, 0.25, 0.25⟩) = 1.5 bits.
If the samples occur in some other proportion, we would need to “block” them together in order to encode them efficiently. But, the average number of bits required by the most efficient coding scheme is given by
Suppose we need to encode a message consisting of the letters A, B and C, and that B and C occur equally often but A occurs twice as often as the other two letters. In this case, the most efficient code would be A=0, B=10, C=11. The average number of bits needed to encode each letter is 1.5 .
n
H(⟨p1,..., pn⟩) = ∑ pi (−log2 pi)
COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 20T2
Probability and Backprop Variations
18
COMP9444 20T2
Probability and Backprop Variations
19
0 1 01
i=1
DKL (q || p) is the number of extra bits we need to trasmit if we designed a
A
BC
n
DKL(p||q) = ∑ pi (log2 pi −log2 qi)
Continuous Entropy and KL-Divergence
Forward KL-Divergence
the entropy of a continuous distribution p() is H(p)= p(θ)(−logp(θ))dθ
Given P, choose Gaussian Q to minimize DKL(P||Q)
for a multivariate Gaussian distribution, H(p) = ∑ log σi
KL-Divergence
DKL(p||q) =
θ
p(θ)(log p(θ)−log q(θ))dθ
COMP9444
⃝c Alan Blair, 2017-20
Q must not be small in any place where P is large. COMP9444
⃝c Alan Blair, 2017-20
θ
i
code for q() but it turned out that the samples were drawn from p() instead.
i=1
COMP9444 20T2 Probability and Backprop Variations
20
COMP9444 20T2 Probability and Backprop Variations 21
Reverse KL-Divergence
KL-Divergence in Deep Learning
Given P, choose Gaussian Q to minimize DKL(Q||P)
KL-Divergence is used in a number of deep learning algorithms, including Variational Autoencoders (Week 10)
Q just needs to be concentrated in some place where P is large.
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444 20T2 Probability and Backprop Variations
22
COMP9444 20T2 Probability and Backprop Variations 23
Variations on Backprop
Cross Entropy
Cross Entropy
◮ problem: least squares error function not suitable for classification,
For classification tasks, target t is either 0 or 1, so better to use E = −t log(z)−(1−t)log(1−z)
where target = 0 or 1
◮ mathematical theory: maximum likelihood
◮ solution: replace with cross entropy error function
This can be justified mathematically, and works well in practice – especially when negative examples vastly outweigh positive ones.
It also makes the backprop computations simpler
Weight Decay
◮ problem: weights “blow up”, and inhibit further learning ◮ mathematical theory: Bayes’ rule
◮ solution: add weight decay term to error function
∂E = z−t ∂z z(1 − z)
Momentum
◮ problem: weights oscillate in a “rain gutter”
◮ solution: weighted average of gradient over time
if z = 1 , 1+e−s
COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
KL-Divergence is also used in some policy-based deep reinforcement learning algorithms such as Variational Inference or Trust Region Policy Optimization (TPRO) (but we will not cover these in detail)
∂E = ∂E∂z=z−t ∂s ∂z ∂s
COMP9444 20T2 Probability and Backprop Variations
24
COMP9444 20T2 Probability and Backprop Variations 25
Maximum Likelihood (5.5)
Least Squares Line Fitting
H is a class of hypotheses
f(x)
P(D|h) = probability of data D being generated under hypothesis h ∈ H.
logP(D|h) is called the likelihood.
ML Principle: Choose h ∈ H which maximizes this likelihood,
i.e. maximizes P(D|h) [or, maximizes logP(D|h)]
In our case, each hypothesis h is a choice of parameters for a neural network or other function applied to input values x while the observed data D consist of the target values t .
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 20T2 Probability and Backprop Variations
26
COMP9444 20T2
Probability and Backprop Variations 27
Derivation of Least Squares
Derivation of Cross Entropy (3.9.1)
Suppose the data are generated by a linear function f () plus Gaussian noise with mean zero and standard deviation σ .
For classification tasks, t is either 0 or 1.
Assume t is generated by hypothesis f as follows:
(Note: we do not need to know σ) COMP9444
fML =
argmaxf∈H ∑ ti log f(xi)+(1−ti)log(1− f(xi)) i
∏ 1 − 1 (ti−f(xi))2
P(1| f(xi)) = P(0| f(xi)) = P(ti| f(xi)) =
f(xi)
(1− f(xi))
i
f(xi)ti(1− f(xi))(1−ti) ∑ ti log f(xi)+(1−ti)log(1− f(xi))
√2πσe 2σ2
1 2 1
P(D|h)=P(t|f) =
logP(t|f) = ∑i −2σ2 (ti − f(xi)) −log(σ)− 2 log(2π)
fML = argmaxf∈H logP(t|f)
2
then
logP(t|f) =
= argminf∈H ∑(ti − f(xi)) i
i
⃝c Alan Blair, 2017-20
(Can also be generalized to multiple classes.)
COMP9444 ⃝c Alan Blair, 2017-20
i.e.
x
COMP9444 20T2 Probability and Backprop Variations
28
COMP9444 20T2 Probability and Backprop Variations 29
Weight Decay (5.2.2)
Bayesian Inference
Sometimes we add a penalty term to the loss function which encourages the neural network weights wj to remain small:
H is a class of hypotheses
P(D|h) = probability of data D being generated under hypothesis h ∈ H. P(h|D) = probability that h is correct, given that data D were observed. Bayes’ Theorem:
E = 1 ∑(zi −ti)2 + λ ∑w2j 2i 2j
This can prevent the weights from “saturating” to very high values.
It is sometimes referred to as “elastic weights” because the weights experience a force as if there were a spring pulling them back towards the origin according to Hooke’s Law.
P(h | D)P(D) = P(h|D) =
P(D | h)P(h) P(D | h)P(h) P(D)
The scaling factor λ needs to be determined from experience, or empirically.
P(h) is called the prior. COMP9444
COMP9444 ⃝c Alan Blair, 2017-20
⃝c Alan Blair, 2017-20
COMP9444 20T2 Probability and Backprop Variations
30
COMP9444 20T2
Probability and Backprop Variations 31
Weight Decay as MAP Estimation (5.6.1)
Momentum (8.3)
We assume a Gaussian prior distribution for the weights, i.e. 1 −w2/2σ2
If landscape is shaped like a “rain gutter”, weights will tend to oscillate without much improvement.
P(w) = ∏ √ e j 0 Then j 2πσ0
P(t|w)P(w) 1 ∏ 1
P(w|t)= =√e2σ2 √ej0
Solution: add a momentum factor
− 1 (zi−ti)2 ∏ 1 P(t) P(t) i 2πσ j 2πσ0
−w2/2σ2
1212 ∂w
logP(w|t) = −2σ2 ∑(zi−ti) −2σ2 ∑wj + constant i0j
δw ← αδw−η∂E w ← w + δw
wMAP = argmaxw∈H log P(w | t )
1 λ
Hopefully, this will dampen sideways oscillations but amplify downhill
where λ = σ2/σ20. COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
= argminw∈H 2 ∑(zi −ti)2 + 2 ∑w2j ij
motionby 1 . 1−α