Supervised Learning III
Classification, Regularization
Detecting overfitting
Plot model complexity versus objective function on test/train data
As model becomes more complex, performance on training keeps improving while on test data it increases
Horizontal axis: measure of model complexity
In this example, we use the maximum order of the polynomial basis functions.
Vertical axis: For regression, it would be SSE or mean SE (MSE)
For classification, the vertical axis would be classification error rate or cross-entropy error function
Overcoming overfitting
• Basic ideas
– Use more training data
– Regularization methods – Cross-validation
Solution: use more data
M=9, increase N
What if we do not have a lot of data?
Overcoming overfitting
• Basic ideas
– Use more training data
– Regularization methods – Cross-validation
Supervised Learning III
Regularization
Solution: Regularization
M =9 0.35 232.37 -5321.83 48568.31
-231639.30 640042.26
-1061800.52 1042400.18 -557682.99 125201.43
Regularized Linear Regression
Size of house
Price
Gradient descent for Linear Regression Repeat
replace with
Regularized Normal Equation
Suppose , (#examples) (#features)
If ,
Non-invertible/singular
Regularized Logistic Regression
Hypothesis:
Cost Function:
Goal: minimize cost
Many types of Regularization • Most common are l1 and l2
l1 often used to create sparsity
Image credit: https://en.wikipedia.org/wiki/Regularization_(mathematics)
Supervised Learning III
Bias-Variance
Bias vs Variance
• Understanding how different sources of error lead to bias and variance helps us improve model fitting
• Error due to Bias: The error due to bias is taken as the difference between the expected (or average) prediction of our model and the correct value which we are trying to predict (imagine you could repeat the whole model fitting process on many datasets)
• Error due to Variance: The variance is how much the predictions for a given point vary between different realizations of the model.
Graphical Illustration Low Variance High Variance
Low Bias
High Bias
http://scott.fortmann-roe.com/docs/BiasVariance.html
The Bias-Variance Trade-off There is a trade-off between bias and variance:
• Less complex models (fewer parameters) have high bias and hence low variance
• More complex models (more parameters) have low bias and hence high variance
• Optimal model will have a balance
Which is worse?
• A gut feeling many people have is that they should minimize bias even at the expense of variance
• This is mistaken logic. It is true that a high variance and low bias model can preform well in some sort of long-run average sense. However, in practice modelers are always dealing with a single realization of the data set
• In these cases, long run averages are irrelevant, bias and variance are equally important, and one should not be improved at an excessive expense to the other.
How to deal with bias/variance
• Can deal with variance by
– Bagging, e.g. Random Forest
– Bagging trains multiple models on random subsamples of the data, averages their prediction
• Can deal with high bias by
– Decreasing regularization/increasing complexity of model – Also known as model selection
Supervised Learning III
Model selection and training/validation/test sets
Size
(underfit)
Size Size
(overfit)
Andrew Ng
Price
Price
Price
Model selection
x test x train
x
x
Hyperparameters (e.g., degree of polynomial, regularization weight, learning rate) must be selected prior to training.
How to choose them?
Try several values, choose one with the lowest test error?
Problem: test error is likely an overly optimistic estimate of generalization error because we “cheat” by fitting the hyperparameter to the actual test examples.
x
size
price
Train/Validation/Test Sets
Size Price
Solution: split data into three sets.
For each value of a hyperparameter, train on the train set, evaluate learned parameters on the validation set.
Pick the model with the hyper parameter that achieved the lowest validation error.
Report this model’s test set error.
2104 1600 2400 1416 3000
1985
1534
1427
1380 1494
400 330 369 232 540
300 315 199
212 243
test validation train
N-Fold Cross Validation
• What is we don’t have enough data for train/test/validation sets?
• Solution: use N-fold cross validation.
• Split training set into train/validation sets N times
• Report average predictions over N val sets, e.g. N=10:
Diagnosing bias vs. variance
Suppose your learning algorithm is performing less well than you were hoping. ( or is high.) Is it a bias problem or a variance problem?
(cross validation error)
Bias (underfit):
will be high, ≈
Variance (overfit):
will be low, >>
(training error)
model complexity
Andrew Ng
error
Learning Curves: High bias
size
(training set size)
If a learning algorithm is suffering from high bias, getting more training data will not (by itself) help much.
size
Andrew Ng
price
price
error
Learning Curves: High variance
(and small )
size
(training set size)
If a learning algorithm is suffering from high variance, getting more training data is likely to help.
size
Andrew Ng
price
price
error
Debugging a learning algorithm
Suppose you have implemented regularized linear regression to predict housing prices. However, when you test your hypothesis in a new set of houses, you find that it makes unacceptably large errors in its prediction. What should you try next?
To fix high variance To fix high bias
Andrew Ng
Supervised learning
Training set:
Andrew Ng
Unsupervised learning
Training set:
Andrew Ng
Clustering
Gene analysis Social network analysis
Types of voters Trending news
Andrew Ng
slide credit: Andrew Ng
Andrew Ng
cluster centroids
slide credit: Andrew Ng
Andrew Ng
slide credit: Andrew Ng
Andrew Ng
slide credit: Andrew Ng
Andrew Ng
slide credit: Andrew Ng
Andrew Ng
Next Class
Unsupervised Learning I: Clustering:
clustering, k-means, Gaussian mixtures. Reading: Bishop 9.1-9.2
Andrew Ng
PSet 2 Out
• Due in 1 week: 9/24 11:59pm GMT -5 (Boston Time)
• Regression, gradient descent
Andrew Ng