程序代写代做代考 algorithm decision tree C CMPUT 366 F20: Supervised Learning V

CMPUT 366 F20: Supervised Learning V
James Wright & Vadim Bulitko
November 19, 2020
CMPUT 366 F20: Supervised Learning V
1

Lecture Outline
Overfitting
PM 7.4
CMPUT 366 F20: Supervised Learning V 2

Overfitting
The learner makes predictions based on regularities that occur in the training data but not in the underlying population
failure to generalize from training data
Reason 1: learning spurious patterns: in training data there may be coincidental associations that are not reflective of the process underlying the data
example: more pictures of tanks taken on sunny days, more pictures without tanks taken on cloudy days. Learning agent learns that sunny pictures are predictive of tanks.
Reason 2: overconfidence in the learned model. The unseen data is assumed to be more like the training data than is plausible
example: just because my training data does not contain the word “squeegee” does not mean there is a literally zero percent chance of encountering it
CMPUT 366 F20: Supervised Learning V 3

Example: Restaurant Ratings
A website collects ratings for restaurants on a scale of  to 5 stars The website wants to display the best restaurants
best restaurants = restaurants that future diners will rate the highest
The mean of recorded ratings for a restaurant optimizes the squared error on the training data
If the website just lists the restaurants with the highest mean rating then restaurants with a few ratings would get too-high predictions
CMPUT 366 F20: Supervised Learning V 4

Regression to the Mean
Extreme predictions do not perform well on test cases
Examples
children of very tall parents are likely to be shorter than either parent
the Sports Illustrated Cover curse: Players who have just appeared on the cover of Sports Illustrated often perform much worse subsequently
there is no rating higher than five stars, so only possible noise in the data is too-low ratings
CMPUT 366 F20: Supervised Learning V 5

Model Complexity
Adding more parameters to a model can almost always fit the training data better
but doing so can also cause overfitting to the training data and loss of performance on test data
Intuition:
simple models cannot represent much, so they are forced to prioritize the
largest/most important effects
complex models can represent more effects, including small, unimportant, and or spurious effects
CMPUT 366 F20: Supervised Learning V 6

Example: Fitting Polynomials
A linear model may not hit every training datum A sufficiently high-degree polynomial will Which model’s predictions are more credible?
CMPUT 366 F20: Supervised Learning V 7

Big Data
More examples usually gives better predictions
But this is not a cure-all because:
often more examples E come with more features X of the examples more features require more examples for efficient learning
CMPUT 366 F20: Supervised Learning V 8

Causes of Test-data Error
Bias Variance Noise
CMPUT 366 F20: Supervised Learning V 9

Bias
Types:
Representation bias: hypothesis class does not contain a model close enough to
the ground truth
Search bias: algorithm was not able to find a good enough hypothesis in the hypothesis space
Examples:
decision trees can represent any function of categorical variables: low representational bias
the space of decision trees is too large to search systematically: high search bias linear regression is a very simple class of models: high representation bias
an optimal linear model can be found analytically: zero search bias
CMPUT 366 F20: Supervised Learning V 10

Variance
The smaller the training data set, the more different we can expect our model estimates to be
Restaurant Example: how different would the estimates be from two training sets of  rating each? How different would they be from two training sets of ààààà ratings each?
Variance is the error from having too little data to train from
from having too complex a model for the amount of data that we have
more complex models require more data to fit Bias-variance tradeoff (for a given fixed amount of data):
complex models will contain better hypotheses but be harder to estimate simple models will be easier to estimate but representational bias means they cannot be as accurate
CMPUT 366 F20: Supervised Learning V 11

Noise
There may be randomness in our training data Example 1: quantum physics
Example 2: a coin toss
Example 3: the ice cream trucks only come out when it is sunny, but our data set does not record the weather
CMPUT 366 F20: Supervised Learning V 12

Techniques to Avoid Overfitting
Pseudocounts: explicitly account for regression to the mean
Regularization: explicitly trade off between fitting the data and model complexity
Cross-validation: detect overfitting using some of the training data set aside
CMPUT 366 F20: Supervised Learning V 13

Pseudocounts or Prior Counts
When we have not observed all the values of a variable, those values should not be assigned probability zero. If we do not have much data, we should not be making extreme predictions
Solution: prepend some fictional (prior) observations to the set of training data In case our prediction is the mean of training data values:
v +…+vn n
Prepend c copies of a fictional value aà to the training data: c·aà +v +…+vn
c+n
The value of aà is due to other knowledge, not found in vi
The value of c determines the impact
CMPUT 366 F20: Supervised Learning V 14

Pseudocounts: The Restaurant Example
Suppose we have a new restaurant with a single 5-star rating: v = 5, n =  Overfitting: uncorrected prediciton of the mean 5

predicting 5 as the restaurant future rating will likely be inaccurate
Pseudocount correction: v+c·aà +c
aà the average rating of all restaurants
c is obtained by solving a′ = v +c·aà for c
+c
here n = , v = 5 and a′ is the average rating of all restaurants with a 5-star rating
Solveforcwhenaà =3,v =5,n=,a′ =3.5
CMPUT 366 F20: Supervised Learning V 15

Regularization
We should not choose a complex model unless there is clear need for it
Instead of optimizing only for training error, optimize training error plus a penalty for complexity:
loss(E, h) = error(E, h) + λ regularizer(h)
regularizer(h) is the complexity of the hypothesis h
λ is the regularization parameter: indicates how important hypothesis complexity is compared to training data error
CMPUT 366 F20: Supervised Learning V 16

Types of Regularizers
Number of parameters
Degree of polynomial
L2 regularizer (“ridge regularizer”): sum of squares of weights prefers models with smaller weights
L1 regularizer (“lasso regularizer”): sum of absolute values of weights prefers models with fewer nonzero weights
often used for feature selection: only features with nonzero weights are used
CMPUT 366 F20: Supervised Learning V 17

(Hold Out) Cross-Validation
Previous methods require us to already know how simple a model should be: How many pseudocounts to add?
What should the regularization parameter be?
Ideally we would like to be able to answer these questions from the data Can we use the test data to see which of these work best?
Use some of the training data as an estimate of the test data
CMPUT 366 F20: Supervised Learning V 18

(Hold Out) Cross-Validation
Want to use more data for training
Want to use more data for validation
One gets larger the other gets smaller since they are disjoint How do we solve this problem?
Use each available example for both training and validation but not at the same time
CMPUT 366 F20: Supervised Learning V 19

k-Fold Cross-Validation
1. Randomly partition training data into k approximately equal-sized sets (folds)
2. Repeat k times:
train on k −  folds, test on the remaining (held out, left out) fold
3. Select the hyperparameter value with the best validation performance (measured over all held out folds)
4. Train on all training data with the selected hyperparameters test the resulting hypothesis on test data, untouched until now
Each example is used exactly once for validation and k −  times for training Extreme case: k = n is called leave-one-out cross-validation
CMPUT 366 F20: Supervised Learning V 20

k-Fold Cross-Validation Example
CMPUT 366 F20: Supervised Learning V 21

Summary
Overfitting is when a learned model fails to generalize due to overconfidence and/or learning spurious regularities
Bias-variance tradeoff: More complex models can be more accurate, but also require more data to train
Techniques to reduce overfitting: pseudocounts: add fictional observations
regularization: penalize model complexity
cross-validation: reserve part of the training data to estimate test error
CMPUT 366 F20: Supervised Learning V 22