CS计算机代考程序代写 javascript Java algorithm CS229 Bias-Variance and Error Analysis

CS229 Bias-Variance and Error Analysis
Yoann Le Calonnec October 2, 2017
1 The Bias-Variance Tradeoff
Assume you are given a well fitted machine learning model fˆ that you want to apply on some test dataset. For instance, the model could be a linear regression whose parameters were computed using some training set different from your test set. For each point x in your test set, you want to predict the associated target y ∈ R, and compute the mean squared error (MSE)
ˆ2 E(x,y)∼test set|f(x) − y|
You now realize that this MSE is too high, and try to find an explanation to this result:
• Overfitting: the model is too closely related to the examples in the training set and doesn’t generalize well to other examples.
• Underfitting: the model didn’t gather enough information from the training set, and doesn’t capture the link between the features x and the target y.
• The data is simply noisy, that is the model is neither overfitting or underfitting, and the high MSE is simply due to the amount of noise in the dataset.
Our intuition can be formalized by the Bias-Variance tradeoff.
Assume that the points in your training/test set are all taken from a similar distribution, with
yi = f(xi) + εi, where the noise εi satisfies E(εi) = 0, Var(εi) = σ2
and your goal is to compute f. By looking at your training set, you obtain an estimate f.
Now use this estimate with your test set, meaning that for each example j in the test set, ˆ
your prediction for yj = f (xj ) + εj is f (xj ). Here, xj is a fixed real number (or vector if the feature space is multi-dimensional) thus f(xj) is fixed, and εj is a real random variable with

mean 0 and variance σ . The crucial observation is that f(xj) is random since it depends on
ˆ
the values εi from the training set. That’s why talking about the bias E(f(x) − f(x)) and
the variance of fˆ makes sense.
We can now compute our MSE on the test set by computing the following expectation
with respect to the possible training sets (since fˆ is a random variable function of the choice of the traning set)
1
ˆ

()
ˆ2 TestMSE=E (y−f(x))
()
ˆ2 =E (ε+f(x)−f(x))
()
2ˆ2 =E(ε )+E (f(x)−f(x))
2( ˆ)2 ( ˆ) =σ + E(f(x)−f(x)) +Var f(x)−f(x)
2(ˆ)2 (ˆ) =σ + Biasf(x) +Var f(x)
There is nothing we can do about the first term σ2 as we can not predict the noise ε by definition. The bias term is due to underfitting, meaning that on average, fˆ does not predict f . The last term is closely related to overfitting, the prediction fˆ is too close from the values ytrain and varies a lot with the choice of our training set.
To sum up, we can understand our MSE as follows
High Bias High Variance Large σ2
←→ Underfitting ←→ Overfitting ←→ Noisy data
Hence, when analyzing the performance of a machine learning algorithm, we must always ask ourselves how to reduce the bias without increasing the variance, and respectively how to reduce the variance without increasing the bias. Most of the time, reducing one will increase the other, and there is a tradeoff between bias and variance.
2 Error Analysis
Even though understanding whether our poor test error is due to high bias or high variance is important, knowing which parts of the machine learning algorithm lead to this error or score is crucial.
Consider the machine learning pipeline on figure 1. The algorithms is divided into several steps
1. The inputs are taken from a camera image
2. Preprocessing to remove the background on the image. For instance, if the image are taken from a security camera, the background is always the same, and we could remove it easily by keeping the pixels that changed on the image.
3. Detect the position of the face.
4. Detect the eyes – Detect the nose – Detect the mouth
2

Figure 1: Face recognition pipeline
5. Final logistic regression step to predict the label
If you biuld a complicated system like this one, you might want to figure out how much error is attributable to each of the components, how good is each of these green boxes. Indeed, if one of these boxes is really problematic, you might want to spend more time trying to improve the performance of that one green box. How do you decide what part to focus on?
One thing we can do is plug in the ground-truth for each component, and see how accuracy changes. Let’s say the overall accuracy of the system is 85% (pretty bad). You can now take your development set and manually give it the perfect background removal, that is, instead of using your background removal algorithm, manually specify the perfect background removal yourself (using photoshop for instance), and look at how much that affect the performance of the overall system.
Now let’s say the accuracy only improves by 0.1%. This gives us an upperbound, that is even if we worked for years on background removal, it wouldn’t help our system by more than 0.1%.
Now let’s give the pipeline the perfect face detection by specifying the position of the face manually, see how much we improve the performance, and so on.
The results are specified in the table 1.
Looking at the table, we know that working on the background removal won’t help much. It also tells us where the biggest jumps are. We notice that having an accurate face detection mechanism really improves the performance, and similarly, the eyes really help making the prediction more accurate.
Error analysis is also useful when publishing a paper, since it’s a convenient way to 3

Component Accuracy
Overall system 85% Preprocess (remove background) 85.1%
Face detection 91% Eyes segmentation 95% Nose segmentation 96%
Mouth segmentation 97% Logistic regression 100%
Table 1: Accuracy when providing the system with the perfect component
analyze the error of an algorithm and explain which parts should be improved.
Ablative analysis
While error analysis tries to explain the difference between current performance and perfect performance, ablative analysis tries to explain the difference between some baseline (much poorer) performance and current performance.
For instance, suppose you have built a good anti-spam classifier by adding lots of clever features to logistic regression
• Spelling correction
• Sender host features
• Email header features
• Email text parser features
• Javascript parser
• Features from embedded images
and your question is: How much did each of these components really help?
In this example, let’s say that simple logistic regression without any clever features gets 94% performance, but when adding these clever features, we get 99.9% performance. In abaltive analysis, what we do is start from the current level of performance 99.9%, and slowly take away all of these features to see how it affects performance. The results are
provided in table 2.
When presenting the results in a paper, ablative analysis really helps analyzing the fea-
tures that helped decreasing the misclassification rate. Instead of simply giving the loss/error rate of the algorithm, we can provide evidence that some specific features are actually more important than others.
4

Component
Overall system Spelling correction Sender host features Email header features Email text parser features Javascript parser Features from images
Accuracy
99.9% 99.0% 98.9% 98.9% 95% 94.5% 94.0%
Table 2: Accuracy when removing feature from logistic regression
Analyze your mistakes
Assume you are given a dataset with pictures of animals, and your goal is to identify pictures of cats that you would eventually send to the members of a community of cat lovers. You notice that there are many pictures of dogs in the original dataset, and wonders whether you should build a special algorithm to identify the pictures of dogs and avoid sending dogs pictures to cat lovers or not.
One thing you can do is take a 100 examples from your development set that are mis- classified, and count up how many of these 100 mistakes are dogs. If 5% of them are dogs, then even if you come up with a solution to identidy your dogs, your error would only go down by 5%, that is your accuracy would go up from 90% to 90.5%. However, if 50 of these 100 errors are dogs, then you could improve your accuracy to reach 95%.
By analyzing your mistakes, you can focus on what’s really important. If you notice that 80 out of your 100 mistakes are blurry images, then work hard on classifying correctly these blurry images. If you notice that 70 out of the 100 errors are great cats, then focus on this specific task of identifying great cats.
In brief, do not waste your time improving parts of your algorithm that won’t really help decreasing your error rate, and focus on what really matters.
5