2a: Probability, Generalization and Over�tting
Week 2: Overview
In this module, we will brie�y review certain topics from probability which are essential for deep
learning, and we will introduce the issue of generalization and over�tting in supervised learning.
We will then discuss cross entropy and softmax, which are used for classi�cation tasks as alternatives
to the sum squared error loss function. Finally, we will describe weight decay, momentum, and
adaptive moment estimation (Adam). Along the way, we will need to introduce the mathematical
concepts of Maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP) Estimation.
Weekly learning outcomes
By the end of this module, you will be able to:
compute Entropy and KL-Divergence for discrete probability distributions
compute probabilities using Bayes’ Rule
describe the basics of supervised learning
explain how to avoid over�tting in neural networks
explain and compute cross entropy, softmax
explain weight decay, momentum
Probability
This section brie�y summarizes a few topics from Probability which are fundamental to neural
networks and deep learning.
Gaussian Distributions
[image source]
The Gaussian distribution with mean and standard deviation is given byμ σ
P (x) =μ,σ e
σ2π
1 −(x−μ) /2σ2 2
The -dimensional multivariate Gaussian with mean and covariance is given byd μ Σ
P (x) =μ,Σ e
(2π) ∣Σ∣k
1 − (x−μ) Σ (x−μ)2
1 T −1
If is diagonal, the multivariate Gaussian reduces to a product of 1-
dimensional Gaussians
Σ = diag(σ , … , σ )1
2
d
2
P (x) =μ,Σ P (x )
i
∏ μ ,σ i i i
The multivariate Gaussian with , is called the Standard Normal distribution.μ = 0 Σ = I
Conditional Probability
If , then the conditional probability of given isP (B) = 0 A B
P (A ∣ B) =
P (B)
P (A ∩ B)
Bayes’ Rule
The formula for conditional probability can be manipulated to �nd a relationship when the two
variables are swapped:
P (A ∩ B) = P (A ∣ B)P (B) = P (B ∣ A)P (A)
→ Bayes’ Rule P (A ∣ B) =
P (B)
P (B ∣ A)P (A)
This is often useful for assessing the probability of an underlying cause after an e�ect has been
observed:
P (Cause ∣ Effect) =
P ( Effect )
P ( Effect ∣ Cause )P ( Cause )
Example: Light Bulb Defects
Question: You work for a lighting company which manufactures 60% of its light bulbs in Factory A
and 40% in Factory B. One percent of the light bulbs from Factory A are defective, while two percent
of those from Factory B are defective. If a random light bulb turns out to be defective, what is the
probability that it was manufactured in Factory A?
Answer: There are two random variables: Factory (A or B) and Defect (Yes or No). The prior
probabilities (before the bulb has been tested) are:
P (A) = 0.6, P (B) = 0.4
The conditional probabilities are:
P (Defect ∣ A) = 0.01, and P (Defect ∣ B) = 0.02
P (A ∣ Defect ) =
P (Defect)
P (Defect ∣ A)P (A)
= = =
0.01 ∗ 0.6 + 0.02 ∗ 0.4
0.01 ∗ 0.6
0.006 + 0.008
0.006
7
3
Entropy and Hu�mann Coding
The entropy of a discrete probability distribution isp = p , … , p⟨ 1 n⟩
H(p) = p (− log p )
i=1
∑
n
i 2 i
One way to think of entropy is the number of bits per symbol achieved by a (block) Hu�man Coding
scheme.
Example 1: H(⟨0.5, 0.5⟩) = 1 bit.
Suppose we want to encode, in zeros and ones, a long message composed of the two letters A and B,
which occur with equal frequency. This can be done e�ciently by assigning , . In other
words, one bit is needed to encode each letter.
A = 0 B = 1
Example 2: H(⟨0.5, 0.25, 0.25⟩) = 1.5 bits.
Suppose we need to encode a message consisting of the letters A, B and C, and that B and C occur
equally often but A occurs twice as often as the other two letters. In this case, the most e�cient code
would be . The average number of bits needed to encode each letter is 1.5.A = 0, B = 10, C = 11
If the symbols occur in some other proportion, we may need to “block” them together in order to
encode them e�ciently. But, asymptotically, the average number of bits required by the most
e�cient coding scheme is given by the entropy
H p , … , p =(⟨ 1 n⟩) p (− log p )
i=1
∑
n
i 2 i
KL-Divergence
Given two probability distributions on the same set , the
Kullback-Leibler Divergence between and is
p = ⟨p , … , p ⟩ and q =1 n ⟨q , … , q ⟩1 n Ω
p q
D (p ∥ q) =KL p (log p −
i=1
∑
n
i 2 i log q )2 i
In coding theory, is the number of extra bits we need to transmit if we designed a code
for but it turned out that the samples were drawn from instead.
D (p ∥ q)KL
q() p()
KL-Divergence is like a “distance” from one probability distribution to another. However, it is not
symmetric.
D (p ∥ q) =KL D (q ∥ p)KL
Generally speaking, will be large if there exist some symbol(s) for which is very
small but is large. This is because the code for symbol will be long, but it will occur frequently.
D (p ∥ q)KL i q i
p i i
Entropy and KL-Divergence for Continuous Distributions
The entropy of a continuous distribution isp()
H(p) = p(θ)(− log p(θ))dθ∫
θ
The KL-Divergence between two continuous distributions and isp() q()
D (p ∥ q) =KL p(θ)(log p(θ) −∫
θ
log q(θ))dθ
Forward KL-Divergence
Suppose we are under attack from a drone and we want to track its position. Let’s assume the likely
location of the drone is given by a probability distribution and we want to approximate with
a Gaussian distribution . In this case, we may wish to minimize the forward KL-Divergence
. This ensures there will be no place where is very small but is large, so we will not
move to that place and get hit by the drone.
P () P ()
Q()
D (P ∥ Q)KL Q P
[image source: https://blog.evjang.com/2016/08/variational-bayes.html]
Reverse KL-Divergence
Now consider a situation where we are learning to play a video game, and is the distribution of
actions chosen by a human expert in a particular game state. In this case, we may prefer to minimize
the reverse KL-divergence . This ensures there will be no place where is very small
but is large, so we will not choose an action which is very unlikely to ever be chosen by the human
player.
P ()
D (Q ∥ P )KL P
Q
[image source: https://blog.evjang.com/2016/08/variational-bayes.html]
Further Reading
Gaussian function
Bayes’ Theorem
Kullback–Leibler divergence
Textbook Deep Learning (Goodfellow, Bengio, Courville, 2016):
Probability and Information Theory (Chapter 3)
Generalisation and Over�tting
Supervised, Reinforcement and Unsupervised Learning
Three types of learning will be discussed in this course: Supervised Learning (Weeks 1-5,7),
Reinforcement Learning (Week 8) and Unsupervised Learning (Week 9).
Supervised Learning: system is presented with examples of inputs and their target outputs,
Reinforcement Learning: system is not presented with target outputs, but is given a reward
signal, which it aims to maximize,
Unsupervised Learning: system is only presented with the inputs themselves, and aims to
�nd structure in these inputs.
Supervised Learning
For Supervised Learning, we have a training set and a test set, each consisting of a set of items.
Each item speci�es a number of input attributes and a target value.
The system is presented with the input and target values for all items in the training set; it goes
through some kind of learning procedure, and must then predict the output for each item in the test
set.
Various learning paradigms are available for Supervised Learning (Decision Trees, K-Nearest-
Neighbors, Support Vector Machine, Random Forest, etc.). In this course, we will concentrate on
Neural Networks.
Curve Fitting Example
The aim of Supervised Learning is to accurately predict the target value for all items in the test set,
based on the input attributes.
One common mistake that we need to avoid is building a model which �ts the training set very well,
but makes poor predictions on the test set. This is known as over�tting. In contrast, a system which
achieves good accuracy on both the training and test set is said to achieve good generalization.
Consider, for example, the task of �tting a curve to the following points:
Which curve do you think gives the “best �t” to these data?
Straight line?
Parabola?
Fourth order polynomial?
Something else?
Ockham’s Razor
Here is another example. The x’s and o’s represent training items for which the target value is 1 and
0, respectively. The aim is to draw a curve such that test set observations on one side of the curve are
predicted to be x and on the other side are predicted to be o.
inadequate good compromise over-�tting
“The most likely hypothesis is the simplest one consistent with the data.”
This principle was popularized by William of Ockham in the thirteenth Century and came to be
known as “Ockham’s Razor”.
Because there can be noise in the measurements, we now understand this principal as a tradeo�
between simplicity of the hypothesis and how well it �ts the data. In the above example, the straight
line on the left side is considered inadequate to �t the data. The curve on the right hand side is seen
as being too complicated, thus over�tting to the training data. The curve in the middle is seen as
achieving a good compromise because it is relatively simple, but is still able to classify almost all the
training data correctly.
Noise in the training data
You will notice that if we choose the middle curve in the above example, there are two training items
near the boundary which get incorrectly classi�ed. We consider this acceptable because there may be
noise in the training data. This noise often arises as an accumulation of small changes due to factors
which are not included in the model (roundo� errors, friction, air resistance, limitations in measuring
equipment, etc.).
Looking again at the curve �tting example, we see that the 5th data point is correctly predicted by the
blue curve but not by the green curve. In this case, the discrepancy is quite large and would more
likely be caused by a macroscopic factor such as breakdown of the measuring equipment or human
error in preparing the datasets. Our choice of model may therefore depend on how well we trust the
data. We would normally prefer the green curve on the assumption that the 5th data point is some
kind of anomaly. But, if we feel very con�dent that the inputs and targets in the training data are
precisely correct, we may instead prefer the blue curve, which is more complicated but classi�es all
the training data correctly.
Outliers
The 5th data point in the curve �tting example is known as an outlier.
Another famous example of an outlier is from the results of the 2000 US Presidential Election, in
which George W. Bush defeated Al Gore in Florida by fewer than 1000 votes.
This diagram shows the number of votes received by a third party candidate called Pat Buchanan in
all Florida counties on the vertical axis, compared to the prediction from a regression model based
on total votes cast as well as the proportion of Democratic leaning voters on the horizontal axis. We
see that the actual vote in Palm Beach County exceeds the predicted value by 1500 votes (see
http://faculty.washington.edu/mtbrett/ for further details).
When we see such a big di�erence between an actual observation and what was predicted by the
model, we should normally look in more detail to see if we can �nd an alternate explanation. Indeed,
many scienti�c advances have been made in this way, to the point where some have suggested that
the most important phrase in scienti�c discovery is not “Eureka!” but rather “That’s funny?”.
In the case of the Palm Beach voting anomaly, the most likely explanation lies in the design of the
electoral ballot, which was known as a “butter�y ballot”. It seems that many voters intended to vote
for Al Gore, but accidentally voted for Pat Buchanan instead.
Training, validation and test error
One way to avoid over�tting in neural networks is to limit the number of hidden nodes, connections
or independent parameters in the network. In order to determine the optimum values, we often
divide the data into Training, Validation and Test sets. Typically, as the number of hidden nodes is
increased, both the training and validation set error will decrease initially, but after a certain point the
training error will continue to decrease but the validation error will plateau or even slightly increase.
The (approximate) value for which the validation error is minimal is likely to achieve low test set error
as well.
Limiting the number of training epochs has also been tried – although, there is some evidence that
this is not necessary when using weight decay (discussed in the next slide). Also, it may happen that
the validation error increases temporarily before decreasing further, so you have to train for a
su�cient number of epochs to understand what is really happening.
In this example, the validation set error is much larger than the training set error; but, it is still
decreasing at epoch 6000.
Dropout
Another way to avoid over�tting in neural networks is Dropout.
[image source: Srivastava, 2014]
For each minibatch, we randomly choose a subset of nodes which will not be used in the training of
that mini batch. Each node is chosen with some �xed probability (usually, one half). When training is
�nished and the network is deployed, all nodes are used, but their activations are multiplied by the
same probability that was used to drop them out during training. Thus, the activation received by
each unit during testing is the average of what it would have received during training.
Dropout forces the network to achieve redundancy because it must deal with situations where some
features are missing. Another way to view dropout is that it implicitly (and e�ciently) simulates an
ensemble of di�erent architectures.
Ensembling
Ensembling is a method where a number of di�erent classi�ers are trained on the same task, and the
�nal class is decided by “voting” among them. In order to bene�t from ensembling, we need to have
diversity in the di�erent classi�ers. For example, we could train three neural networks with di�erent
architectures, three Support Vector Machines with di�erent dimensions and kernels, as well as two
other classi�ers, and ensemble all of them to produce a �nal result. (Kaggle Competition entries are
often constructed in this way).
Dropout as an Implicit Ensemble
In the case of dropout, a di�erent architecture is created for each minibatch by removing the nodes
that are dropped out. The trick of multiplying the output of each node by the probability of dropout
implicitly averages the output over all of these di�erent models. Thus, we get the bene�t of
ensembling but in a very e�cient way.
References
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R., 2014. Dropout: A simple
way to prevent neural networks from over�tting. Journal of Machine Learning Research, 15(1), 1929-
1958.
Further Reading
Textbook Deep Learning (Goodfellow, Bengio, Courville, 2016):
Over�tting in Neural Networks (5.2)
Ensembling (7.11) and Dropout (7.12)
Exercise: Probability
Question 1
No response
Question 2
No response
Question 3
No response
Question 4
No response
Consider these two probability distributions on the same space Ω = {A, B, C, D}
p
q
=
=
⟨½, ¼,⅛,⅛⟩
⟨¼,⅛,⅛, ½⟩
Construct a Hu�mann tree for each distribution and .p q
Compute the Entropy .H(p)
Compute the -Divergence in each direction and KL D ( p ∣∣ q )KL D ( q ∣∣ p )KL
Which one is larger? Why?
One bag contains red balls and white balls. Another bag contains red balls and green balls.
One of these bags is chosen at random, and two balls are drawn randomly from that bag, without
replacement. Both of the balls turn out to be red. What is the probability that the �rst bag is the one
that was chosen?
2 3 3 2
Quiz 2: Probability
Question 1
No response
Question 2
No response
Question 3
No response
Question 4
No response
This is a Quiz to test your understanding of the material from Week 2 on Probability.
You must attempt to answer each question yourself, before looking at the sample answer.
Write the formula for a Gaussian distribution with mean and standard deviation .μ σ
Write the formula for Bayes’ Rule, in terms of a cause and an e�ect .A B
Write the formula for the Entropy of a continuous probability distribution .H(p) p()
Write the formula for the Kullback-Leibler Divergence between two continuous
probability distributions and .
D (p ∣∣KL q)
p() q()
Week 2 Wednesday video