Practical Advice for Applying Machine Learning
Machine Learning Kate Saenko
Outline
Kate Saenko, CS542 Machine Learning
• Machine learning system design
• How to improve a model’s performance?
• Feature engineering/pre-processing • Learning with large datasets
Machine learning system design
Practical Advice for Applying Machine Learning
Example: Building a spam classifier
From: cheapsales@buystufffromme.com To: ang@cs.stanford.edu
Subject: Buy now!
Deal of the week! Buy now!
Rolex w4tchs – $100
Med1cine (any kind) – $50
Also low cost M0rgages available.
From: Alfred Ng
To: ang@cs.stanford.edu Subject: Christmas dates?
Hey Andrew,
Was talking to Mom about plans for Xmas. When do you get off work. Meet Dec 22?
Alf
Andrew Ng
Example: Building a spam classifier
Supervised learning. 𝑥𝑥 = features of email. 𝑦𝑦 = spam (1) or not spam (0). Features 𝑥𝑥: Choose 100 words indicative of spam/not spam.
From: cheapsales@buystufffromme.com To: ang@cs.stanford.edu
Subject: Buy now!
Deal of the week! Buy now!
Note: In practice, take most frequently occurring 𝑛𝑛 words ( 10,000 to 50,000) in training set, rather than manually pick 100 words.
Andrew Ng
Example: Building a spam classifier
How to spend your time to make it have low error?
– Collect lots of data
– Develop sophisticated features based on email routing information (from email header).
– Develop sophisticated features for message body, e.g. should “discount” and “discounts” be treated as the same word? How about “deal” and “Dealer”? Features about punctuation?
– Develop sophisticated algorithm to detect misspellings (e.g. m0rtgage, med1cine, w4tches.)
Andrew Ng
Recommended approach
– Start with a simple algorithm that you can implement quickly. Implement it and test it on your cross-validation data.
– Plot learning curves to decide if more data, more features, etc. are likely to help.
– Error analysis: Manually examine the examples (in cross validation set) that your algorithm made errors on. See if you spot any systematic trend in what type of examples it is making errors on.
Andrew Ng
Error Analysis
500 examples in cross validation set
Algorithm misclassifies 100 emails.
Manually examine the 100 errors, and categorize them based on:
(i) (ii)
What type of email it is
What cues (features) you think would have helped the algorithm classify them correctly.
Pharma: Replica/fake: Steal passwords: Other:
Deliberate misspellings: (m0rgage, med1cine, etc.)
Unusual email routing:
Unusual (spamming) punctuation:
Andrew Ng
The importance of numerical evaluation
Should discount/discounts/discounted/discounting be treated as the same word? • Can use “stemming” software (E.g. “Porter stemmer”)
• Not always right? E.g. universe/university.
Error analysis may not be helpful for deciding if this is likely to improve performance. Only solution is to try it and see if it works.
Need numerical evaluation (e.g., cross validation error) of algorithm’s performance with and without stemming.
Without stemming: Acc=0.89 With stemming: Acc=0.92 Distinguish upper vs. lower case (Mom/mom): Acc=0.91
Andrew Ng
Algorithm vs. Data
E.g. Classify between confusable words. {to, two, too}, {then, than}
For breakfast I ate ___?__ eggs. Algorithms
– – – –
Perceptron (Logistic regression) Winnow
Memory-based
Naïve Bayes
[Banko and Brill, 2001]
“It’s not who has the best algorithm that wins.
It’s who has the most data.”
Training set size (millions)
Accuracy
Large data rationale
Assume feature 𝑦𝑦 has sufficient information to predict 𝑥𝑥 ∈ R𝑛𝑛 accurately. Example: For breakfast I ate ____ eggs.
Counterexample: Predict housing price from only size (feet2) and no other features. Useful test: Given the input 𝑥𝑥, can a human expert confidently predict 𝑦𝑦?
Large data rationale
Use a learning algorithm with many parameters (e.g. logistic regression/linear regression with many features; neural network with many hidden units).
Use a very large training set (unlikely to overfit)
Feature Pre-processing
Practical Advice for Applying Machine Learning
some Figures from How to Win a Data Science Competition, coursera.org
Preprocessing: feature scaling
• Some models are influenced by features scaling, while others are not.
• Non-tree-based models can be easily influenced by feature scaling: • Linear models
• Nearest neighbor classifiers • Neural networks
• Conversely, tree-based models are not influenced by feature scaling: • Decision trees
• Random forests
• Gradient boosted trees
Kate Saenko, CS542 Machine Learning
Preprocessing: feature scaling
Tree-based models Non-tree-based models
Kate Saenko, CS542 Machine Learning
coursera.org
Unequal feature scales
Kate Saenko, CS542 Machine Learning
coursera.org
Preprocessing: scaling
Kate Saenko, CS542 Machine Learning
coursera.org
Preprocessing: scaling
Kate Saenko, CS542 Machine Learning
coursera.org
Preprocessing: scaling
Kate Saenko, CS542 Machine Learning
coursera.org
Preprocessing: scaling
sklearn.preprocessing.MinMaxScaler sklearn.preprocessing.StandardScaler
Kate Saenko, CS542 Machine Learning
coursera.org
Preprocessing: outliers
Kate Saenko, CS542 Machine Learning
coursera.org
Preprocessing: outliers
Kate Saenko, CS542 Machine Learning
coursera.org
Preprocessing: outliers
Kate Saenko, CS542 Machine Learning
coursera.org
Preprocessing: outliers
Kate Saenko, CS542 Machine Learning
coursera.org
Preprocessing: outliers
Kate Saenko, CS542 Machine Learning
coursera.org
Preprocessing: outliers
Kate Saenko, CS542 Machine Learning
coursera.org
Preprocessing: rank
Rank transform removes the relative distance between feature values and replaces them with a consistent interval representing feature value ranking (think: first, second, third…). This moves outliers closer to other feature values (at a regular interval), and can be a valid approach for linear models, kNN, and neural networks.
rank( [0, -100, 1e5] ) == [1, 0, 2]
scipy.stats.rankdata
Kate Saenko, CS542 Machine Learning
Preprocessing: log / sqrt
Log transforms can be useful for non-tree-based models. They can be used to drive larger feature values away from extremes and closer to the mean feature values. It also has the effect of making values closer to zero more distinguishable.
log transform
np.log(1+x)
Raising to the power < 1
np.sqrt(x+2/3)
Kate Saenko, CS542 Machine Learning
Other useful feature transformations
sklearn.preprocessing contains many of these:
• Scaling sparse data
• Whitening
• Unit vector norm normalization
• Mapping to a uniform/Gaussian distribution
• Encoding categorical features (OrdinalEncoder, OneHotEncoder)
• Imputation of missing values
• Non-linear feature transforms (polynomial, etc.)
Kate Saenko, CS542 Machine Learning
Feature scaling: summary
• Numeric feature preprocessing is different for tree and non-tree models:
• Tree-based models (e.g., Decision Trees) do not depend on scaling
• Non tree-based models (e.g., logistic regression, NN, k-NN) hugely depend on
scaling
• Feature scaling techniques most often used are • MinMaxScaler – to [0, 0]
• StandardScaler – to mean==0, std==1
• Rank – sets spaces between sorted values to be equal • log(1+x) and sqrt(1+x)
• other useful pre-processing transforms...
Kate Saenko, CS542 Machine Learning
Learning with large datasets
Practical Advice for Applying Machine Learning
with slides from How to Win a Data Science Competition, coursera.org
Learning with large datasets
How much data should we use for training? m=100,000,000? Or m=1,000? Suppose using gradient descent:
(training set size) (training set size)
Andrew Ng
error
error
Linear regression with gradient descent
Repeat {
(for every ) }
Andrew Ng
Linear regression with gradient descent
Repeat {
(for every ) }
Andrew Ng
Batch gradient descent Stochastic gradient descent Repeat {
(for every ) }
Andrew Ng
Stochastic gradient descent
1. Randomly shuffle (reorder) training examples
2. Repeat {
for {
(for every ) }
}
Andrew Ng
Batch gradient descent: Use all 𝑚𝑚 examples in each iteration Stochastic gradient descent: Use 1 example in each iteration Mini-batch gradient descent: Use 𝑏𝑏 examples in each iteration
Mini-batch gradient descent
Andrew Ng
Mini-batch gradient descent
Say Repeat {
for
.
{
)
} }
(for every
Andrew Ng
Checking for convergence
Batch gradient descent:
Plot as a function of the number of iterations of gradient descent.
Stochastic gradient descent:
During learning, compute before updating using .
Every 1000 iterations (say), plot averaged over the last 1000 examples processed by algorithm.
Andrew Ng
Checking for convergence
Plot
, averaged over the last 1000 (say) examples
No. of iterations
No. of iterations
No. of iterations
No. of iterations
Andrew Ng
Stochastic gradient descent
1. Randomly shuffle dataset.
2. Repeat { for
{
)
is typically held constant. Can slowly decrease
(for Learning rate
} }
over time if we want to converge. (E.g. const1 ) iterationNumber + const2
Andrew Ng
Stochastic gradient descent
1. Randomly shuffle dataset.
2. Repeat { for
{
)
is typically held constant. Can slowly decrease
(for Learning rate
} }
over time if we want to converge. (E.g. const1 ) iterationNumber + const2
Andrew Ng
Summary
• Machine learning system design
• How to improve a model’s performance?
• Feature pre-processing
• Learning with large datasets
Kate Saenko, CS542 Machine Learning