When Models Meet Data 2
Liang Zheng
Australian National University
liang. .au
Overfitting
• The aim of a machine learning predictor is to perform well on
unseen data.
• We simulate the unseen data by holding out a proportion of the
whole dataset.
• This hold out set is called test set.
• In practice, we split data into a training set and a test set.
• Training set: fit the model
• Test set: not seen during training, used to evaluate generalization
performance
• It is important for the user to not cycle back to a new round of
training after having observed the test set.
• Empirical risk minimization can lead to overfitting.
• the predictor fits too closely to the training data and does not
generalize well to new data
𝑥
𝑦
𝑥
𝑦
This simple model fits the
training data less well.
A larger empirical risk.
A good machine learning
model.
This complex model fits the
training data very well.
A very small empirical risk.
A poor machine learning model
due to overfitting.
regression
classificationA good model A poor model
• Empirical risk minimization can lead to overfitting.
• the predictor fits too closely to the training data and does not
generalize well to new data
8.2.3 Regularization to Reduce Overfitting
• When overfitting happens, we have
• very small average loss on the training set but large average loss on the
test set
• Given a predictor 𝑓, overfitting occurs when
• the risk estimate from the training data 𝐑%&’ 𝑓,𝑿*+,-.,𝒚*+,-.
underestimates the expected risk 𝐑*+0% 𝑓 . In other words,
• 𝐑%&’ 𝑓,𝑿*+,-.,𝒚*+,-. is much smaller than 𝐑*+0% 𝑓 which is estimated
using 𝐑%&’ 𝑓,𝑿*%1*,𝒚*%1*
• Overfitting occurs usually when
• we have little data and a complex hypothesis class
• How to prevent overfitting?
• We can bias the search for the minimizer of empirical
risk by introducing a penalty term
• The penalty term makes it harder for the optimizer to
return an overly flexible predictor
• The penalty term is called regularization.
• Regularization is an approach that discourages
complex or extreme solutions to an optimization
problem.
• Example
• Least-squares problem
• To regularize this formulation, we add a penalty term
• The addition term 𝜽 3 is called the regularizer or penalty term,
and the parameter regularizer 𝜆 is the regularization parameter.
• 𝜆 enables a trade-off between minimizing the loss on the training
set and the amplitude of the parameters 𝜽
• It often happens that the amplitude of the parameters in 𝜽
becomes relatively large if we run into overfitting
• 𝜆 is a hyperparameter
min
𝜽
8
9
𝒚 − 𝑿𝜽 3
min
𝜽
8
9
𝒚 − 𝑿𝜽 3 + 𝜆 𝜽 3
8.2.4 Cross-Validation to Assess the Generalization
Performance
• We mentioned that we split a dataset into a training set and a
test set
• we measure generalization error by applying the predictor on
test data.
• This data is also sometimes referred to as the validation set.
• Validation set is from the entire data, and has no overlap with
the training data.
• We want the training set to be large
• That leaves the validation set small
• A small validation set makes
the result less stable (large variances)
• Basically, we want the training set to be large
• We want the validation to be large, too
• How to solve these contradictory objectives?
• Cross-validation: 𝐾-fold cross-validation
Example: 𝐾 = 5
Cross-validation
• 𝐾-fold cross-validation partitions the data into 𝐾 chunks
• 𝐾 − 1 trunks form the training set ℛ
• The last trunk is the validation set 𝒱
• This procedure is repeated for all 𝐾 choices for the validation set,
and the performance of the model from the 𝐾 runs is averaged
Example: 𝐾 = 5
Cross-validation
• Formally, we partition our training set into two sets 𝒟 = ℛ ∪ 𝒱,
such that they do not overlap, i.e., ℛ ∩ 𝒱 = 𝜙
• We train on our model on ℛ (training set)
• We evaluate our model on 𝒱 (validation set)
• We have 𝐾 partitions. In each partition 𝑘:
• Training set ℛ G produces a predictor 𝑓 G
• 𝑓 G is applied to validation set 𝒱 G to compute the empirical risk
R 𝑓 G ,𝒱 G
• All the empirical risks are averaged to approximate the expected
generalization error
𝔼J 𝑅 𝑓,𝒱 ≈
1
𝐾
M
GN8
O
R 𝑓 G ,𝒱 G
Cross-validation – some understandings
• The training set is limited — not producing the best 𝑓 G
• The testing set is limited – producing an inaccurate estimation of
R 𝑓 G ,𝒱 G
• After averaging, the results are stable and indicative
• An extreme: leave-one-out cross-validation, where the validation
set only contains one example.
• A potential drawback – computation cost
• The training can be time-consuming
• If the model has several parameters to tune, it is hard to evaluate those
hyperparameters.
• This problem can be solved by parallel computing, given enough
computational resources
Check your understanding
• When your model works poorly on the training set, your model
will also work poorly on the test set.
• When your model works poorly on the training set, your model
may also have overfitting.
• Overfitting happens when your model is too complex given your
training data.
• Regularization alleviates overfitting by improving the complexity
of your training data.
• In 𝐾-fold cross-validation, we will get more stable test accuracy if
𝐾 increases.
• In 2-fold cross-validation, you can obtain 2 results from the 2 test
sets, and they may differ a lot with each other.