Supervised Learning III
Classification, Regularization
Recall: Logistic Regression Hypothesis:
𝜃: parameters
𝐷 = 𝑥(𝑖), 𝑦(𝑖) : data
Cost Function:
Goal: minimize cost
Cross Entropy Cost
• Cross entropy compares distribution q to reference p
• Here q is predicted probability of y=1 given x, reference distribution is p=y(i), which is either 1 or 0
•
Maximum Likelihood Derivation of
Logistic Regression Cost We can derive the Logistic Regression cost
convex
•
using Maximum Likelihood, and find its derivative w.r.t 𝜃: (left as exercise)
No direct closed-form solution
Cost
Want : Repeat
Gradient descent for Logistic Regression
(simultaneously update all )
Cost
Want : Repeat
Gradient descent for Logistic Regression
(simultaneously update all )
x2
3 2
1
123
Decision boundary
Predict “ “ if
x1
Non-linear decision boundaries
-1
x2
1
1 -1
x1
Predict “ “ if
x2
Supervised Learning III
Non-linear features
What to do if data is nonlinear?
Nonlinear basis functions
Transform the input/feature
𝜙(𝑥)∶𝑥∈𝑅2 →𝑧=𝑥1·𝑥2
Another example
How to transform the input/feature?
Transformed training data: linearly separable
Intuition: suppose 𝜃 = Then𝜃𝑇𝑧=𝑥12 +𝑥22
1 0 1
Another example
How to transform the input/feature?
𝜙(𝑥):𝑥∈𝑅2 →𝑧=
𝑥22
𝑥12 𝑥1 · 𝑥2
i.e., the sq. distance to the origin!
Non-linear basis functions
• We can use a nonlinear mapping, or basis function
𝜙(𝑥)∶𝑥∈𝑅𝑁 → 𝑧∈𝑅𝑀
• where M is the dimensionality of the new feature/input 𝑧
(or 𝜙(𝑥))
• Note that M could be either greater than D or less than, or the same
Example with regression
Add more polynomial basis functions
Being too adaptive leads to better results on the training data, but not so great on data that has not been seen!
good fit
Supervised Learning III
Overfitting
Overfitting
Parameters for higher-order polynomials are very large
M =0 0.19
𝜃 w1
𝜃 w2
M =1 0.82 -1.27
M =3 0.31 7.99
-25.43 17.37
M =9 0.35 232.37 -5321.83 48568.31
𝜃 w0
𝜃
w3
𝜃 4
𝜃5 𝜃6 𝜃7 𝜃8 𝜃9
3
-231639.30 640042.26 -1061800.52
1042400.18 -557682.99 125201.43
Overfitting disaster
Fitting the housing price data with M = 3
Note that the price would goes to zero (or negative) if you buy bigger houses!
This is called poor generalization/overfitting.
Detecting overfitting
Plot model complexity versus objective function on test/train data
As model becomes more complex, performance on training keeps improving while on test data it increases
Horizontal axis: measure of model complexity
In this example, we use the maximum order of the polynomial basis functions.
Vertical axis: For regression, it would be SSE or mean SE (MSE)
For classification, the vertical axis would be classification error rate or cross-entropy error function
Overcoming overfitting
• Basic ideas
– Use more training data – Regularization methods – Cross-validation
Solution: use more data
M=9, increase N
What if we do not have a lot of data?
Overcoming overfitting
• Basic ideas
– Use more training data – Regularization methods – Cross-validation
Supervised Learning III
Regularization
Solution: Regularization • Use regularization:
– Add λ 𝜃 22 term to SSE cost function – “L-2” norm squared, ie sum of sq.
elements ∑𝜃2 𝑗
– Penalizes large 𝜃 𝜃0
M =9 0.35 232.37 -5321.83 48568.31
– λ controls amount of regularization • Later, we will derive regularized linear
regression from Bayesian linear regression
𝜃 1
𝜃2
𝜃3 𝜃
𝜃5
4 -231639.30
640042.26 𝜃6 -1061800.52 𝜃7 1042400.18 𝜃8 -557682.99 𝜃9 125201.43
Regularized Linear Regression
Size of house
Price
Gradient descent for Linear Regression Repeat
replace with
Regularized Normal Equation
Suppose , (#examples) (#features)
If ,
Non-invertible/singular
Regularized Logistic Regression
Hypothesis:
𝜃: parameters
𝐷 = 𝑥(𝑖), 𝑦(𝑖) : data
Cost Function:
+𝜆 𝜃
2 2
Goal: minimize cost
Supervised Learning III
Bias-Variance
Bias vs Variance
• Understanding how different sources of error lead to bias and variance helps us improve model fitting
• Error due to Bias: The error due to bias is taken as the difference between the expected (or average) prediction of our model and the correct value which we are trying to predict (imagine you could repeat the whole model fitting process on many datasets)
• Error due to Variance: The variance is how much the predictions for a given point vary between different realizations of the model.
Graphical Illustration Low Variance High Variance
Low Bias
High Bias
http://scott.fortmann-roe.com/docs/BiasVariance.html
The Bias-Variance Trade-off Hence, there is a trade-off between bias and variance:
• Less complex models (fewer parameters) have high bias and hence low variance
• More complex models (more parameters) have low bias and hence high variance
• Optimal model will have a balance
Which is worse?
• A gut feeling many people have is that they should minimize bias even at the expense of variance
• This is mistaken logic. It is true that a high variance and low bias model can preform well in some sort of long-run average sense. However, in practice modelers are always dealing with a single realization of the data set
• In these cases, long run averages are irrelevant, bias and variance are equally important, and one should not be improved at an excessive expense to the other.
How to deal with bias/variance
• Can deal with variance by
– Bagging, e.g. Random Forest
– Bagging trains multiple models on random subsamples of the data, averages their prediction
• Can deal with high bias by
– Decreasing regularization/increasing complexity of model – Also known as model selection
Supervised Learning III
Model selection and training/validation/test sets
Overfitting example
Once parameters (𝜃 ,𝜃 ,…,𝜃 )
014
were fit to some set of data (training set), the error of the parameters as measured on that data (the training error 𝐽(𝜃)) is likely to be lower than the actual generalization error.
size
One solution is to regularize, but how can we choose the regularization weight 𝜆?
price
Choosing weight 𝜆
Size
𝜆 = 100
High bias (underfit)
Size
𝜆 = 1
“Just right”
Size
𝜆 = 0.01
High variance (overfit)
Andrew Ng
Price
Price
Price
Model selection
x test x train
x
x
Hyperparameters (e.g., degree of polynomial, regularization weight, learning rate) must be selected prior to training.
How to choose them?
Try several values, choose one with the lowest test error?
Problem: test error is likely an overly optimistic estimate of generalization error because we “cheat” by fitting the hyperparameter to the actual test examples.
x
size
price
Train/Validation/Test Sets
Size Price
Solution: split data into three sets.
For each value of a hyperparameter, train on the train set, evaluate learned parameters on the validation set.
Pick the model with the hyper parameter that achieved the lowest validation error.
Report this model’s test set error.
2104 1600 2400 1416 3000
1985 1534 1427
1380 1494
400 330 369 232 540
300 315 199
212 243
test validation train
Diagnosing bias vs. variance
Suppose your learning algorithm is performing less well than you were hoping. ( or is high.) Is it a bias problem or a variance problem?
𝐽𝜃 𝐶𝑉
(cross validation error)
Bias (underfit):
will be high, ≈
Variance (overfit):
will be low, >>
𝐽 𝜃 𝑡𝑟𝑎𝑖𝑛
(training error)
model complexity
Andrew Ng
error
Learning Curves: High bias
𝐽𝜃 𝐶𝑉
𝐽𝜃 𝑡𝑟𝑎𝑖𝑛
size
(training set size)
If a learning algorithm is suffering from high bias, getting more training data will not (by itself) help much.
size
Andrew Ng
price
price
error
Learning Curves: High variance
(and small )
(training set size)
If a learning algorithm is suffering from high variance, getting more training data is likely to help.
𝐽𝜃 𝐶𝑉
𝐽𝜃 𝑡𝑟𝑎𝑖𝑛
size
price
price
error
size
Andrew Ng
Debugging a learning algorithm
Suppose you have implemented regularized linear regression to predict housing prices. However, when you test your hypothesis in a new set of houses, you find that it makes unacceptably large errors in its prediction. What should you try next?
To fix high variance
• Get more training examples
• Try smaller sets of features
• Try increasing 𝜆
To fix high bias
• Try getting additional features • Try adding polynomial features • Try decreasing 𝜆
Andrew Ng