CS计算机代考程序代写 flex When Models Meet Data 2

When Models Meet Data 2

Liang Zheng
Australian National University
liang. .au

Overfitting

• The aim of a machine learning predictor is to perform well on
unseen data.

• We simulate the unseen data by holding out a proportion of the
whole dataset.

• This hold out set is called test set.

• In practice, we split data into a training set and a test set.

• Training set: fit the model

• Test set: not seen during training, used to evaluate generalization
performance

• It is important for the user to not cycle back to a new round of
training after having observed the test set.

• Empirical risk minimization can lead to overfitting.

• the predictor fits too closely to the training data and does not
generalize well to new data

𝑥

𝑦

𝑥

𝑦

This simple model fits the
training data less well.

A larger empirical risk.

A good machine learning
model.

This complex model fits the
training data very well.

A very small empirical risk.

A poor machine learning model
due to overfitting.

regression

classificationA good model A poor model

• Empirical risk minimization can lead to overfitting.

• the predictor fits too closely to the training data and does not
generalize well to new data

8.2.3 Regularization to Reduce Overfitting

• When overfitting happens, we have
• very small average loss on the training set but large average loss on the

test set

• Given a predictor 𝑓, overfitting occurs when
• the risk estimate from the training data 𝐑%&’ 𝑓,𝑿*+,-.,𝒚*+,-.

underestimates the expected risk 𝐑*+0% 𝑓 . In other words,

• 𝐑%&’ 𝑓,𝑿*+,-.,𝒚*+,-. is much smaller than 𝐑*+0% 𝑓 which is estimated
using 𝐑%&’ 𝑓,𝑿*%1*,𝒚*%1*

• Overfitting occurs usually when
• we have little data and a complex hypothesis class

• How to prevent overfitting?

• We can bias the search for the minimizer of empirical
risk by introducing a penalty term

• The penalty term makes it harder for the optimizer to
return an overly flexible predictor

• The penalty term is called regularization.

• Regularization is an approach that discourages
complex or extreme solutions to an optimization
problem.

• Example

• Least-squares problem

• To regularize this formulation, we add a penalty term

• The addition term 𝜽 3 is called the regularizer or penalty term,
and the parameter regularizer 𝜆 is the regularization parameter.

• 𝜆 enables a trade-off between minimizing the loss on the training
set and the amplitude of the parameters 𝜽

• It often happens that the amplitude of the parameters in 𝜽
becomes relatively large if we run into overfitting

• 𝜆 is a hyperparameter

min
𝜽

8
9
𝒚 − 𝑿𝜽 3

min
𝜽

8
9
𝒚 − 𝑿𝜽 3 + 𝜆 𝜽 3

8.2.4 Cross-Validation to Assess the Generalization
Performance

• We mentioned that we split a dataset into a training set and a
test set

• we measure generalization error by applying the predictor on
test data.

• This data is also sometimes referred to as the validation set.

• Validation set is from the entire data, and has no overlap with
the training data.

• We want the training set to be large

• That leaves the validation set small

• A small validation set makes

the result less stable (large variances)

• Basically, we want the training set to be large

• We want the validation to be large, too

• How to solve these contradictory objectives?

• Cross-validation: 𝐾-fold cross-validation

Example: 𝐾 = 5

Cross-validation

• 𝐾-fold cross-validation partitions the data into 𝐾 chunks

• 𝐾 − 1 trunks form the training set ℛ

• The last trunk is the validation set 𝒱

• This procedure is repeated for all 𝐾 choices for the validation set,
and the performance of the model from the 𝐾 runs is averaged

Example: 𝐾 = 5

Cross-validation

• Formally, we partition our training set into two sets 𝒟 = ℛ ∪ 𝒱,
such that they do not overlap, i.e., ℛ ∩ 𝒱 = 𝜙

• We train on our model on ℛ (training set)

• We evaluate our model on 𝒱 (validation set)

• We have 𝐾 partitions. In each partition 𝑘:
• Training set ℛ G produces a predictor 𝑓 G

• 𝑓 G is applied to validation set 𝒱 G to compute the empirical risk
R 𝑓 G ,𝒱 G

• All the empirical risks are averaged to approximate the expected
generalization error

𝔼J 𝑅 𝑓,𝒱 ≈
1
𝐾
M
GN8

O

R 𝑓 G ,𝒱 G

Cross-validation – some understandings

• The training set is limited — not producing the best 𝑓 G

• The testing set is limited – producing an inaccurate estimation of
R 𝑓 G ,𝒱 G

• After averaging, the results are stable and indicative

• An extreme: leave-one-out cross-validation, where the validation
set only contains one example.

• A potential drawback – computation cost
• The training can be time-consuming

• If the model has several parameters to tune, it is hard to evaluate those
hyperparameters.

• This problem can be solved by parallel computing, given enough
computational resources

Check your understanding

• When your model works poorly on the training set, your model
will also work poorly on the test set.

• When your model works poorly on the training set, your model
may also have overfitting.

• Overfitting happens when your model is too complex given your
training data.

• Regularization alleviates overfitting by improving the complexity
of your training data.

• In 𝐾-fold cross-validation, we will get more stable test accuracy if
𝐾 increases.

• In 2-fold cross-validation, you can obtain 2 results from the 2 test
sets, and they may differ a lot with each other.