程序代写代做代考 algorithm decision tree data science Practical Advice for Applying Machine Learning

Practical Advice for Applying Machine Learning
Machine Learning Kate Saenko

Outline
Kate Saenko, CS542 Machine Learning
• Machine learning system design
• How to improve a model’s performance?
• Feature engineering/pre-processing • Learning with large datasets

Machine learning system design
Practical Advice for Applying Machine Learning

Example: Building a spam classifier
From: cheapsales@buystufffromme.com To: ang@cs.stanford.edu
Subject: Buy now!
Deal of the week! Buy now!
Rolex w4tchs – $100
Med1cine (any kind) – $50
Also low cost M0rgages available.
From: Alfred Ng
To: ang@cs.stanford.edu Subject: Christmas dates?
Hey Andrew,
Was talking to Mom about plans for Xmas. When do you get off work. Meet Dec 22?
Alf
Andrew Ng

Example: Building a spam classifier
Supervised learning. 𝑥𝑥 = features of email. 𝑦𝑦 = spam (1) or not spam (0). Features 𝑥𝑥: Choose 100 words indicative of spam/not spam.
From: cheapsales@buystufffromme.com To: ang@cs.stanford.edu
Subject: Buy now!
Deal of the week! Buy now!
Note: In practice, take most frequently occurring 𝑛𝑛 words ( 10,000 to 50,000) in training set, rather than manually pick 100 words.
Andrew Ng

Example: Building a spam classifier
How to spend your time to make it have low error?
– Collect lots of data
– Develop sophisticated features based on email routing information (from email header).
– Develop sophisticated features for message body, e.g. should “discount” and “discounts” be treated as the same word? How about “deal” and “Dealer”? Features about punctuation?
– Develop sophisticated algorithm to detect misspellings (e.g. m0rtgage, med1cine, w4tches.)
Andrew Ng

Recommended approach
– Start with a simple algorithm that you can implement quickly. Implement it and test it on your cross-validation data.
– Plot learning curves to decide if more data, more features, etc. are likely to help.
– Error analysis: Manually examine the examples (in cross validation set) that your algorithm made errors on. See if you spot any systematic trend in what type of examples it is making errors on.
Andrew Ng

Error Analysis
500 examples in cross validation set
Algorithm misclassifies 100 emails.
Manually examine the 100 errors, and categorize them based on:
(i) (ii)
What type of email it is
What cues (features) you think would have helped the algorithm classify them correctly.
Pharma: Replica/fake: Steal passwords: Other:
Deliberate misspellings: (m0rgage, med1cine, etc.)
Unusual email routing:
Unusual (spamming) punctuation:
Andrew Ng

The importance of numerical evaluation
Should discount/discounts/discounted/discounting be treated as the same word? • Can use “stemming” software (E.g. “Porter stemmer”)
• Not always right? E.g. universe/university.
Error analysis may not be helpful for deciding if this is likely to improve performance. Only solution is to try it and see if it works.
Need numerical evaluation (e.g., cross validation error) of algorithm’s performance with and without stemming.
Without stemming: Acc=0.89 With stemming: Acc=0.92 Distinguish upper vs. lower case (Mom/mom): Acc=0.91
Andrew Ng

Algorithm vs. Data
E.g. Classify between confusable words. {to, two, too}, {then, than}
For breakfast I ate ___?__ eggs. Algorithms
– – – –
Perceptron (Logistic regression) Winnow
Memory-based
Naïve Bayes
[Banko and Brill, 2001]
“It’s not who has the best algorithm that wins.
It’s who has the most data.”
Training set size (millions)
Accuracy

Large data rationale
Assume feature 𝑦𝑦 has sufficient information to predict 𝑥𝑥 ∈ R𝑛𝑛 accurately. Example: For breakfast I ate ____ eggs.
Counterexample: Predict housing price from only size (feet2) and no other features. Useful test: Given the input 𝑥𝑥, can a human expert confidently predict 𝑦𝑦?

Large data rationale
Use a learning algorithm with many parameters (e.g. logistic regression/linear regression with many features; neural network with many hidden units).
Use a very large training set (unlikely to overfit)

Feature Pre-processing
Practical Advice for Applying Machine Learning
some Figures from How to Win a Data Science Competition, coursera.org

Preprocessing: feature scaling
• Some models are influenced by features scaling, while others are not.
• Non-tree-based models can be easily influenced by feature scaling: • Linear models
• Nearest neighbor classifiers • Neural networks
• Conversely, tree-based models are not influenced by feature scaling: • Decision trees
• Random forests
• Gradient boosted trees
Kate Saenko, CS542 Machine Learning

Preprocessing: feature scaling
Tree-based models Non-tree-based models
Kate Saenko, CS542 Machine Learning
coursera.org

Unequal feature scales
Kate Saenko, CS542 Machine Learning
coursera.org

Preprocessing: scaling
Kate Saenko, CS542 Machine Learning
coursera.org

Preprocessing: scaling
Kate Saenko, CS542 Machine Learning
coursera.org

Preprocessing: scaling
Kate Saenko, CS542 Machine Learning
coursera.org

Preprocessing: scaling
sklearn.preprocessing.MinMaxScaler sklearn.preprocessing.StandardScaler
Kate Saenko, CS542 Machine Learning
coursera.org

Preprocessing: outliers
Kate Saenko, CS542 Machine Learning
coursera.org

Preprocessing: outliers
Kate Saenko, CS542 Machine Learning
coursera.org

Preprocessing: outliers
Kate Saenko, CS542 Machine Learning
coursera.org

Preprocessing: outliers
Kate Saenko, CS542 Machine Learning
coursera.org

Preprocessing: outliers
Kate Saenko, CS542 Machine Learning
coursera.org

Preprocessing: outliers
Kate Saenko, CS542 Machine Learning
coursera.org

Preprocessing: rank
Rank transform removes the relative distance between feature values and replaces them with a consistent interval representing feature value ranking (think: first, second, third…). This moves outliers closer to other feature values (at a regular interval), and can be a valid approach for linear models, kNN, and neural networks.
rank( [0, -100, 1e5] ) == [1, 0, 2]
scipy.stats.rankdata
Kate Saenko, CS542 Machine Learning

Preprocessing: log / sqrt
Log transforms can be useful for non-tree-based models. They can be used to drive larger feature values away from extremes and closer to the mean feature values. It also has the effect of making values closer to zero more distinguishable.
log transform
np.log(1+x)
Raising to the power < 1 np.sqrt(x+2/3) Kate Saenko, CS542 Machine Learning Other useful feature transformations sklearn.preprocessing contains many of these: • Scaling sparse data • Whitening • Unit vector norm normalization • Mapping to a uniform/Gaussian distribution • Encoding categorical features (OrdinalEncoder, OneHotEncoder) • Imputation of missing values • Non-linear feature transforms (polynomial, etc.) Kate Saenko, CS542 Machine Learning Feature scaling: summary • Numeric feature preprocessing is different for tree and non-tree models: • Tree-based models (e.g., Decision Trees) do not depend on scaling • Non tree-based models (e.g., logistic regression, NN, k-NN) hugely depend on scaling • Feature scaling techniques most often used are • MinMaxScaler – to [0, 0] • StandardScaler – to mean==0, std==1 • Rank – sets spaces between sorted values to be equal • log(1+x) and sqrt(1+x) • other useful pre-processing transforms... Kate Saenko, CS542 Machine Learning Learning with large datasets Practical Advice for Applying Machine Learning with slides from How to Win a Data Science Competition, coursera.org Learning with large datasets How much data should we use for training? m=100,000,000? Or m=1,000? Suppose using gradient descent: (training set size) (training set size) Andrew Ng error error Linear regression with gradient descent Repeat { (for every ) } Andrew Ng Linear regression with gradient descent Repeat { (for every ) } Andrew Ng Batch gradient descent Stochastic gradient descent Repeat { (for every ) } Andrew Ng Stochastic gradient descent 1. Randomly shuffle (reorder) training examples 2. Repeat { for { (for every ) } } Andrew Ng Batch gradient descent: Use all 𝑚𝑚 examples in each iteration Stochastic gradient descent: Use 1 example in each iteration Mini-batch gradient descent: Use 𝑏𝑏 examples in each iteration Mini-batch gradient descent Andrew Ng Mini-batch gradient descent Say Repeat { for . { ) } } (for every Andrew Ng Checking for convergence Batch gradient descent: Plot as a function of the number of iterations of gradient descent. Stochastic gradient descent: During learning, compute before updating using . Every 1000 iterations (say), plot averaged over the last 1000 examples processed by algorithm. Andrew Ng Checking for convergence Plot , averaged over the last 1000 (say) examples No. of iterations No. of iterations No. of iterations No. of iterations Andrew Ng Stochastic gradient descent 1. Randomly shuffle dataset. 2. Repeat { for { ) is typically held constant. Can slowly decrease (for Learning rate } } over time if we want to converge. (E.g. const1 ) iterationNumber + const2 Andrew Ng Stochastic gradient descent 1. Randomly shuffle dataset. 2. Repeat { for { ) is typically held constant. Can slowly decrease (for Learning rate } } over time if we want to converge. (E.g. const1 ) iterationNumber + const2 Andrew Ng Summary • Machine learning system design • How to improve a model’s performance? • Feature pre-processing • Learning with large datasets Kate Saenko, CS542 Machine Learning