PowerPoint Presentation
Lecture 7: Introduction to Machine Learning
C.-C. Hung
Kennesaw State University
(Slides used in the classroom only)
Some slides are from Michael Scherger
*
Read chapters 18, 19, 20, and 21 in our textbook.
– What is machine learning?
– Supervised vs unsupervised learning
– Regression and classification
– Some basic algorithms
Slides modified from different websites
Lecture overview
A branch of artificial intelligence, concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data.
As intelligence requires knowledge, it is necessary for the computers to acquire knowledge.
What is machine learning?
*
What is Learning?
Herbert Simon: “Learning is any process by which a system improves performance from experience.”
What is the task for machine learning?
Classification and Regression
Problem solving / planning / control
*
Classification
Assign object/event to one of a given finite set of categories.
Medical diagnosis
Credit card applications or transactions
Fraud detection in e-commerce
Worm detection in network packets
Spam filtering in email
Recommended articles in a newspaper
Recommended books, movies, music, or jokes
Financial investments
DNA sequences
Spoken words
Handwritten letters
Astronomical images
*
Problem Solving / Planning / Control
Performing actions in an environment in order to achieve a goal.
Solving calculus problems
Playing checkers, chess, or backgammon
Balancing a pole
Driving a car
Flying a plane, helicopter, or rocket
Controlling an elevator
Controlling a character in a video game
Controlling a mobile robot
*
Learning and power law of practice
Computational studies of learning may help us understand learning in humans and other biological organisms.
The power law of practice states that the logarithm of the reaction time for a particular task decreases linearly with the logarithm of the number of practice trials taken. It is an example of the learning curve effect on performance (wikipedia).
log(perf. time)
log(# training trials)
*
Defining the Learning Task
Improve on task, T, with respect to
performance metric, P, based on experience, E.
T: Playing checkers
P: Percentage of games won against an arbitrary opponent
E: Playing practice games against itself
T: Recognizing hand-written words
P: Percentage of words correctly classified
E: Database of human-labeled images of handwritten words
T: Driving on four-lane highways using vision sensors
P: Average distance traveled before a human-judged error
E: A sequence of images and steering commands recorded while
observing a human driver.
T: Categorize email messages as spam or legitimate.
P: Percentage of email messages correctly classified.
E: Database of emails, some with human-given labels
Supervised learning (learning with a teacher)
Supervised Machine Learning
Images in the training set must be annotated with the “correct answer” that the model is expected to produce/predict the correct label.
Label: Contains a motorbike
Recognition task and supervision
Slide credit: L. Lazebnik
*
Unsupervised learning (learning without a teacher)
Unsupervised Machine Learning
Machine learning problems
Clustering:
find natural
groups of
samples in
unlabelled data
Density Estimation:
make a
statistical model of the data (e.g. Gaussian, Parzen, KNN, Mixture Densities Models)
Regression:
fit lines or other
functions to data
(De-noising or Compression)
Classification: find functions separating the classes (e.g. Support Vector Machine, SVM)
Clustering Strategies/Algorithms
K-means clustering algorithm
Iteratively re-assign points to the nearest cluster center.
Agglomerative clustering algorithm
Start with each point as its own cluster and iteratively merge the closest clusters.
Mean-shift clustering
Estimate modes of probability density function (pdf).
Spectral clustering
Split the nodes in a graph based on assigned links with similarity weights.
The machine learning framework
Apply a prediction function to a feature representation of the image to get the desired output:
f( ) = “apple”
f( ) = “tomato”
f( ) = “cow”
Slide credit: L. Lazebnik
*
The machine learning framework
y = f(x)
Training: given a training set of labeled examples {(x1,y1), …, (xN,yN)}, estimate the prediction function f by minimizing the prediction error on the training set
Testing: apply f to a never before seen test example x and output the predicted value y = f(x)
output
prediction function
Image features
Slide credit: L. Lazebnik
*
Prediction
Steps
Training Labels
Training
Training
Image Features
Image Features
Testing
Test Image
Learned model
Learned model
Slide credit: D. Hoiem and L. Lazebnik
Training Images
*
What are features?
Raw pixels
Histograms
SIFT descriptors
…
Slide credit: L. Lazebnik
*
Generative vs. Discriminative Classifiers
Generative Models
Represent both the data and the labels
Often, makes use of conditional independence and priors
Examples
Naïve Bayes classifier
Bayesian network
Models of data may apply to future prediction problems
Slide credit: D. Hoiem
Generative classifiers try to model the data. Discriminative classifiers try to predict the label.
*
Generative vs. Discriminative Classifiers
Discriminative Models
Learn to directly predict the labels from the data
Often, assume a simple boundary (e.g., linear)
Examples
Logistic regression
SVM
Boosted decision trees
Often easier to predict a label from the data than to model the data
Slide credit: D. Hoiem
Generative classifiers try to model the data. Discriminative classifiers try to predict the label.
*
Instance-Based Learning
Idea:
Similar examples have similar label.
Classify new examples like similar training examples.
Algorithm:
Given some new example x for which we need to predict its class y.
Find most similar training examples.
Classify x “like” these most similar examples.
Example: K-Nearest Neighbors (K-NN).
Questions:
How to determine similarity?
How many similar training examples to consider?
How to resolve inconsistencies among the training examples?
Nearest Neighbor (NN) Model
f(x) = label of the training example nearest to x
All we need is a distance function (i.e., similarity) for our inputs
No training required!
Test example
Training examples from class 1
Training examples from class 2
Slide credit: L. Lazebnik
*
K-NN Classifiers
Learning by analogy:
Tell me who your friends are and I’ll tell you who you are
A new example is assigned to the most common class among the (K) examples that are most similar to it.
K-Nearest Neighbor (K-NN) Algorithm
To determine the class of a new example E:
Calculate the distance between E and all examples in the training set
Select K-nearest examples to E in the training set
Assign E to the most common class among its K-nearest neighbors
Response
Response
No response
No response
No response
Class: Response
K-NN Classifier
Each example is represented with a set of numerical attributes
“Closeness” is defined in terms of the Euclidean distance between two examples.
The Euclidean distance between X=(x1, x2, x3,…xn) and Y =(y1,y2, y3,…yn) is defined as:
Distance (John, Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]
John:
Age=35
Income=95K
No. of credit cards=3
Rachel:
Age=41
Income=215K
No. of credit cards=2
Distance Between Neighbors
K-NN: Instance-Based Learning
No model is built: Store all training examples
Any processing is delayed until a new instance must be classified.
Response
Response
No response
No response
No response
Class: Respond
Example : 3-NN
K-NN Classifier
Customer Age Income No. credit cards Response
John 35 35K 3 No
Rachel 22 50K 2 Yes
Hannah 63 200K 1 No
Tom 59 170K 1 No
Nellie 25 40K 4 Yes
David 37 50K 2 ?
Example
3-NN Classifier
Yes
Customer Age Income
(K) No.
cards
John 35 35 3
Rachel 22 50 2
Hannah 63 200 1
Tom 59 170 1
Nellie 25 40 4
David 37 50 2
Response
No
Yes
No
No
Yes
Distance from David
sqrt [(35-37)2+(35-50)2 +(3-2)2]=15.16
sqrt [(22-37)2+(50-50)2 +(2-2)2]=15
sqrt [(63-37)2+(200-50)2 +(1-2)2]=152.23
sqrt [(59-37)2+(170-50)2 +(1-2)2]=122
sqrt [(25-37)2+(40-50)2 +(4-2)2]=15.74
Distance Metrics
*
Minkowsky = l-norm
Summary: k-NN Algorithm
To determine the class of a new sample P,
Step 1: Define an initial value of k and a set of sample pixels (N), where N is greater than k.
Step 2: Calculate the distance (i.e. feature similarity) between sample P and each of a set of samples (N).
Step 3: Select k-nearest sample pixels to P in the dataset.
Step 4: Assign P to the most common class among its k-nearest neighbors.
Curse-of-Dimensionality
Prediction accuracy can quickly degrade when the number of attributes grows.
Irrelevant attributes easily “swamp” information from relevant attributes
When many irrelevant attributes, similarity/distance measure becomes less reliable
Remedy
Try to remove irrelevant attributes in pre-processing step
Weight attributes differently
Increase k (but not too much)
*
Prediction accuracy can quickly degrade when number of attributes grows.
Imagine attribute vector (y,r1,r2,r3,…,rN),
first feature always tells the label
r1,..rn take random binary values.
What is the probability that nearest neighbor has same label? P(y=yNN)
Similarity measure is number of mismatches.
Assume we have only two training examples, one for each label. Both training examples have distance N+1 (all bits different)
(-1,+1,-1,+1,-1,+1,…,-1)
(+1,-1,+1,-1,+1,-1,…,+1)
If N=0, then P(y=yNN)=1
If N=2, then P(y=yNN)=3/4
need both random attributes to match the wrong examples
(4 outcomes on random attributes, and only one leads to a mistake)
If N=4, then P(y=yNN)=11/16
need three random attributes to match the wrong examples
(16 outcomes on random attributes, and 5 lead to a mistake)
If N=100, then P(y=yNN) almost 1/2
need N/2+1 random attributes to match the wrong examples
For every random attribute, we need exponentially more training examples.
Advantages and Disadvantages of K-NN
Need distance/similarity measure and attributes that “match” target function.
For large training sets,
Must make a pass through the entire dataset for each classification. This can be prohibitive for large data sets.
Prediction accuracy can quickly degrade when number of attributes grows.
Simple to implement algorithm;
Requires little tuning;
Often performs quite well!
(Try it first on a new learning problem).
*
Need similarity measure and attributes that “match” target function
predicting a person’s height may depend on different attributes than predicting their IQ
Generalization
How well does a learned model generalize from the data it was trained on to a new test set?
Training set (labels known)
Test set (labels unknown)
Slide credit: L. Lazebnik
*
Generalization
Components of generalization error
Bias: how much the average model over all training sets differ from the true model?
Error due to inaccurate assumptions/simplifications made by the model
Variance: how much models estimated from different training sets differ from each other
Slide credit: L. Lazebnik
*
Generalization
Underfitting: model is too “simple” to represent all the relevant class characteristics
High bias and low variance
High training error and high test error
Overfitting: model is too “complex” and fits irrelevant characteristics (noise) in the data
Low bias and high variance
Low training error and high test error
Slide credit: L. Lazebnik
*
No Free Lunch Theorem
Slide credit: L. Lazebnik
*
Bias-Variance Trade-off
Models with too few parameters are inaccurate because of a large bias (not enough flexibility).
Models with too many parameters are inaccurate because of a large variance (too much sensitivity to the sample).
Slide credit: D. Hoiem
http://www.aiaccess.net/English/Glossaries/GlosMod/e_gm_bias_variance.htm
*
Bias-Variance Trade-off
E(MSE) = noise2 + bias2 + variance
See the following for explanations of bias-variance (also Bishop’s “Neural Networks” book):
http://www.inf.ed.ac.uk/teaching/courses/mlsc/Notes/Lecture4/BiasVariance.pdf
Unavoidable error
Error due to incorrect assumptions
Error due to variance of training samples
Slide credit: D. Hoiem
From the link above:
You can only get generalization through assumptions.
Thus in order to minimize the MSE, we need to minimize both
the bias and the variance. However, this is not trivial to do this. For instance, just
neglecting the input data and predicting the output somehow (e.g., just a constant), would definitely minimize the variance of our predictions: they would be
always the same, thus the variance would be zero—but the bias of our estimate
(i.e., the amount we are off the real function) would be tremendously large. On
the other hand, the neural network could perfectly interpolate the training data,
i.e., it predict y=t for every data point. This will make the bias term vanish entirely, since the E(y)=f (insert this above into the squared bias term to verify this),
but the variance term will become equal to the variance of the noise, which may
be significant (see also Bishop Chapter 9 and the Geman et al. Paper). In general,
finding an optimal bias-variance tradeoff is hard, but acceptable solutions can be
found, e.g., by means of cross validation or regularization.
*
Bias-variance tradeoff
Training error
Test error
Underfitting
Overfitting
Slide credit: D. Hoiem
Complexity
Low Bias
High Variance
High Bias
Low Variance
Error
*
Bias-variance tradeoff
Many training examples
Few training examples
Slide credit: D. Hoiem
Complexity
Low Bias
High Variance
High Bias
Low Variance
Test Error
Note: these figures don’t work in pdf
*
Effect of Training Size
Testing
Training
Generalization Error
Fixed prediction model
Slide credit: D. Hoiem
Number of Training Examples
Error
*
The perfect classification algorithm
Objective function: encodes the right loss for the problem. F = (x1, x2, …, xn)
Parameterization: makes assumptions that fit the problem. F = (x1, x2, …, xn)
Regularization: right level of regularization for amount of training data.
Regularization is used to prevent the model from overfitting the training sample. It is basically a parameter that is used by minimizing the error function.
Slide credit: L. Lazebnik
*
The perfect classification algorithm
Training algorithm: can find parameters that maximize objective function on training set.
Inference algorithm: can solve for objective function in evaluation.
Slide credit: L. Lazebnik
*
Remember…
No classifier is inherently better than any other: you need to make assumptions to generalize
Three kinds of error
Inherent: unavoidable
Bias: due to over-simplifications
Variance: due to inability to perfectly estimate parameters from limited data
Slide credit: D. Hoiem
Slide credit: D. Hoiem
You can only get generalization through assumptions.
*
How to reduce variance?
Choose a simpler classifier
Regularize the parameters
Get more training data
Slide credit: D. Hoiem
The simpler classifier or regularization could increase bias and lead to more error.
*
Regularization
The regularization parameter (lambda) is an input to the model so what we probably want to know is how do we select the value of lambda.
The regularization parameter reduces overfitting, which reduces the variance of our estimated regression parameters; however, it does this at the expense of adding bias to our estimate.
Increasing lambda results in less overfitting but also greater bias. So the real question is “How much bias are we willing to tolerate in the estimate?”
Slide credit: D. Hoiem
The simpler classifier or regularization could increase bias and lead to more error.
*
Regression
Linear Regression
Ordinary Least Squares (OLS)
Ordinary Least Squares (OLS)
An Example for Regression
Generative classifiers try to model the data. Discriminative classifiers try to predict the label.
*
An Example for Regression
?
Generative classifiers try to model the data. Discriminative classifiers try to predict the label.
*
Another Example for Regression
https://online.stat.psu.edu/stat462/node/101/
Example: Given Teen Birth Rate and Poverty Level
Generative classifiers try to model the data. Discriminative classifiers try to predict the label.
*
An Example for Regression
https://online.stat.psu.edu/stat462/node/101/
Generative classifiers try to model the data. Discriminative classifiers try to predict the label.
*
An Example for Regression
Example: Given Teen Birth Rate and Poverty Level
Generative classifiers try to model the data. Discriminative classifiers try to predict the label.
*
An Example for Regression
Example: Given Teen Birth Rate and Poverty Level
Generative classifiers try to model the data. Discriminative classifiers try to predict the label.
*
An Example for Regression
Example: Given Teen Birth Rate and Poverty Level
The plot of the data below (birth rate on the vertical) shows a generally linear relationship, on average, with a positive slope. As the poverty level increases, the birth rate for 15 to 17 year old females tends to increase as well.
Generative classifiers try to model the data. Discriminative classifiers try to predict the label.
*
Logistic Regression
Probabilistic Approach
Understanding the sigmoid
Logistic Regression Model
Logistic Regression Model
L1 vs. L2
While practicing machine learning, you may have come upon a choice of the mysterious L1 vs L2.
Usually the two decisions are :
1) L1-norm vs L2-norm loss function; and
2) L1-regularization vs L2-regularization.
Slide credit: D. Hoiem
Generative classifiers try to model the data. Discriminative classifiers try to predict the label.
*
Regularized Linear Models:
L1-Norm vs. L2-Norm
Ridge regression and lasso regression are two different techniques for increasing the robustness against colinearity of ordinary least squares regression.
Both of these algorithms attempt to minimize a cost function. The cost is a function of two terms: one, the residual sum of squares (RSS), taken from ordinary least squares; the other, an additional regularizer penalty. The second term is an L2-norm in ridge regression, and an L1-norm in lasso regression.
Slide credit: D. Hoiem
The simpler classifier or regularization could increase bias and lead to more error.
*
L1: As an error function
L1-norm loss function is also known as least absolute deviations (LAD), least absolute errors (LAE).
It is basically minimizing the sum of the absolute differences (S) between the target value (Yi) and the estimated values (f(xi)):
Slide credit: D. Hoiem
Generative classifiers try to model the data. Discriminative classifiers try to predict the label.
*
L2: As an error function
L2-norm loss function is also known as least squares error (LSE).
It is basically minimizing the sum of the square of the differences (S) between the target value (Yi) and the estimated values (f(xi):
Slide credit: D. Hoiem
Generative classifiers try to model the data. Discriminative classifiers try to predict the label.
*
L1-norm vs. L2-norm: as an error function
The differences of L1-norm and L2-norm as a loss function can be promptly summarized as follows:
Slide credit: D. Hoiem
Generative classifiers try to model the data. Discriminative classifiers try to predict the label.
*
L1 and L2: As Regularization
Regularization is a very important technique in machine learning to prevent overfitting.
Mathematically speaking, it adds a regularization term in order to prevent the coefficients to fit so perfectly to overfit. The difference between the L1 and L2 is just that L2 is the sum of the square of the weights, while L1 is just the sum of the weights. As follows:
Slide credit: D. Hoiem
Generative classifiers try to model the data. Discriminative classifiers try to predict the label.
*
L1 Regularization on least squares
Slide credit: D. Hoiem
Generative classifiers try to model the data. Discriminative classifiers try to predict the label.
*
L2 Regularization on least squares
Slide credit: D. Hoiem
Generative classifiers try to model the data. Discriminative classifiers try to predict the label.
*
L1 vs. L2: As Regularization
The difference between their properties can be promptly summarized as follows:
Sparsity refers to that only very few entries in a matrix (or vector) is non-zero. L1-norm has the property of producing many coefficients with zero values or very small values with few large coefficients.
Slide credit: D. Hoiem
Generative classifiers try to model the data. Discriminative classifiers try to predict the label.
*
More on classification
Assign input vector to one of two or more classes
Any decision rule divides input space into decision regions separated by decision boundaries
Slide credit: L. Lazebnik
*
Nearest Neighbor Classifier
Assign label of nearest training data point to each test data point
Voronoi partitioning of feature space
for two-category 2D and 3D data
from Duda et al.
Source: D. Lowe
*
Voronoi Diagram
Voronoi Diagram: the partitioning of a plane with points into convex polygons such that each polygon contains exactly one generating point and every point in a given polygon is closer to its generating point than to any other.
A Voronoi diagram is sometimes also known as a Dirichlet tessellation.
Source: D. Lowe
*
Voronoi diagrams have linear complexity {|v|, |e| = O(n)}
Intuition: Not all bisectors are Voronoi edges!
e
e : Voronoi edge
pi
pi : site points
Nearest Neighbor Classifier
Assign label of nearest training data point to each test data point.
Assume that there are 5 training data points. How do we draw a Voronoi diagram for nearest neighbor classifier?
Source: D. Lowe
*
Voronoi Diagram: Nearest Neighbor Classifier
Source: D. Lowe
*
K-NN Exercise
If a patient with a feature vector of (high, yes, no, no) for query, which category should we classify using 3-NN?
*
Distance metric: Number of matching attributes
Example 1:
Most similar example: number 2 (1 mismatch, 4 match) -> classify yes
Second most similar example: number 1 (2 mismatch, 3 match) -> classify yes
Example 2:
Most similar example: number 3 (1 mismatch, 4 match) -> classify no
Second most similar example: number 1 (2 mismatch, 3 match) -> classify yes/no
Summary: Classifiers
Nearest-neighbor and k-nearest-neighbor classifiers
L1, L2, χ2 distance, quadratic distance, histogram intersection
Support vector machines
Neural networks,
Decision trees,
Random Forests
Slide credit: D. Hoiem
Recap: K-NN
The k-NN classification rule is to assign to a test sample the majority category label of its k nearest training samples
In practice, k is usually chosen to be odd, so as to avoid ties
The k = 1 rule is generally called the nearest-neighbor classification rule
What to remember about classifiers
No free lunch: machine learning algorithms are tools, not dogmas.
Try simple classifiers first.
Better to have smart features and simple classifiers than simple features and smart classifiers.
Use increasingly powerful classifiers with more training data (bias-variance tradeoff).
Slide credit: D. Hoiem
Many classifiers to choose from
Support Vector Machine (SVM)
Neural networks
Naïve Bayes
Bayesian network
Logistic regression
Randomized Forests
Boosted Decision Trees
Restricted Boltzmann Machines (RBMs)
Convolutional Neural Networks (CNN)
etc.
Which is the best one?
Slide credit: D. Hoiem
Some Machine Learning References
General
Tom Mitchell, Machine Learning, McGraw Hill, 1997
Christopher Bishop, Neural Networks for Pattern Recognition, Oxford University Press, 1995
Adaboost
Friedman, Hastie, and Tibshirani, “Additive logistic regression: a statistical view of boosting”, Annals of Statistics, 2000
SVMs
http://www.support-vector.net/icml-tutorial.pdf
Slide credit: D. Hoiem
Questions & Suggestions?
The End
*
Appendix
*
Using Logistic Regression
Quick, simple classifier (try it first).
Outputs a probabilistic label confidence.
Use L2 or L1 regularization.
L1 does feature selection and is robust to irrelevant features but slower to train
(http://www.chioka.in/differences-between-l1-and-l2-as-loss-function-and-regularization/)
Slide credit: D. Hoiem
Generative classifiers try to model the data. Discriminative classifiers try to predict the label.
*
Parameters regularization
Very brief tour of some classifiers
K-nearest neighbor
SVM
Neural networks
Boosted Decision Trees
Naïve Bayes
Bayesian network
Logistic regression
Randomized Forests
RBMs
etc.
Classifiers: Logistic Regression
x2
x1
Height
Pitch of voice
male
female
Maximize likelihood of label given data, assuming a log-linear model
x
x
x
x
x
x
x
x
o
o
o
o
o
Classifiers: Logistic Regression
Maximize likelihood of label given data, assuming a log-linear model.
Slide credit: D. Hoiem
Generative classifiers try to model the data. Discriminative classifiers try to predict the label.
*
k-NN Algorithm
Intuitively, the k-NN algorithm assigns to each new query instance the majority class among its k nearest neighbors
K-NN: Simple Illustration
q is + under 1-NN, but – under 5-NN
*
Why Study Machine Learning?
Cognitive Science
Computational studies of learning may help us understand learning in humans and other biological organisms.
Hebbian neural learning
“Neurons that fire together, wire together.”
Human’s relative difficulty of learning disjunctive concepts vs. conjunctive ones.
Power law of practice
log(# training trials)
log(perf. time)
*
History of Machine Learning
1950s
Samuel’s checker player
Selfridge’s Pandemonium
1960s:
Neural networks: Perceptron
Pattern recognition
Learning in the limit theory
Minsky and Papert prove limitations of Perceptron
1970s:
Symbolic concept induction
Winston’s arch learner
Expert systems and the knowledge acquisition bottleneck
Quinlan’s ID3
Michalski’s AQ and soybean diagnosis
Scientific discovery with BACON
Mathematical discovery with AM
*
History of Machine Learning (cont.)
1980s:
Advanced decision tree and rule learning
Explanation-based Learning (EBL)
Learning and planning and problem solving
Utility problem
Analogy
Cognitive architectures
Resurgence of neural networks (connectionism, backpropagation)
Valiant’s PAC Learning Theory
Focus on experimental methodology
1990s
Data mining
Adaptive software agents and web applications
Text learning
Reinforcement learning (RL)
Inductive Logic Programming (ILP)
Ensembles: Bagging, Boosting, and Stacking
Bayes Net learning
*
History of Machine Learning (cont.)
2000s
Support vector machines
Kernel methods
Graphical models
Statistical relational learning
Transfer learning
Sequence labeling
Collective classification and structured outputs
Computer Systems Applications
Compilers
Debugging
Graphics
Security (intrusion, virus, and worm detection)
E mail management
Personalized assistants that learn
Learning in robotics and vision
Modeling image pixel labels as MRF (Ising)
Slides by R. Huang – Rutgers University
Naïve Bayes
Bayesian
posterior
Gray Circles: Image Pixels
White Circles: Labels (segmented)
1
real image
label image
Given two events A and B and suppose that Pr(A) > 0. Then
Example:
Bayes’ Rule
R: It is a rainy day
W: The grass is wet
Pr(R|W) = ?
Pr(R) = 0.8
Pr(W|R) R R
W 0.7 0.4
W 0.3 0.6
Bayes’ Rule
R: It rains
W: The grass is wet
Information Pr(W|R)
Inference Pr(R|W)
Pr(W|R) R R
W 0.7 0.4
W 0.3 0.6
R
W
Bayes’ Rule
R: It rains
W: The grass is wet
Information: Pr(E|H)
Inference: Pr(H|E)
Pr(W|R) R R
W 0.7 0.4
W 0.3 0.6
Hypothesis H
Evidence E
Prior
Likelihood
Posterior
Using Naïve Bayes
Simple thing to try for categorical data
Very fast to train/test
Classifiers: Linear SVM
Find a linear function to separate the classes:
f(x) = sgn(w x + b)
x
x
x
x
x
x
x
x
o
o
o
o
o
x2
x1
Classifiers: Linear SVM
Find a linear function to separate the classes:
f(x) = sgn(w x + b)
x
x
x
x
x
x
x
x
o
o
o
o
o
x2
x1
Maximizes a margin
*
Classifiers: Linear SVM
Find a linear function to separate the classes:
f(x) = sgn(w x + b)
x
x
x
x
x
x
x
x
o
o
o
o
o
o
x2
x1
*
Datasets that are linearly separable work out great:
But what if the dataset is just too hard?
We can map it to a higher-dimensional space:
0
x
Nonlinear SVMs
Slide credit: Andrew Moore
0
x
0
x
x2
*
Φ: x → φ(x)
Nonlinear SVMs
General idea: the original input space can always be mapped to some higher-dimensional feature space where the training set is separable:
Slide credit: Andrew Moore
*
Nonlinear SVMs
The kernel trick: instead of explicitly computing the lifting transformation φ(x), define a kernel function K such that
K(xi , xj) = φ(xi ) · φ(xj)
(to be valid, the kernel function must satisfy Mercer’s condition)
This gives a nonlinear decision boundary in the original feature space:
C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998
*
Nonlinear kernel: Example
Consider the mapping
x2
*
Kernels for bags of features
Histogram intersection kernel:
Generalized Gaussian kernel:
D can be (inverse) L1 distance, Euclidean distance, χ2 distance, etc.
J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid, Local Features and Kernels for Classifcation of Texture and Object Categories: A Comprehensive Study, IJCV 2007
*
Summary: SVMs for image classification
Pick an image representation (in our case, bag of features)
Pick a kernel function for that representation
Compute the matrix of kernel values between every pair of training examples
Feed the kernel matrix into your favorite SVM solver to obtain support vectors and weights
At test time: compute kernel values for your test example and each support vector, and combine them with the learned weights to get the value of the decision function
Slide credit: L. Lazebnik
*
What about multi-class SVMs?
Unfortunately, there is no “definitive” multi-class SVM formulation.
In practice, we have to obtain a multi-class SVM by combining multiple two-class SVMs.
Slide credit: L. Lazebnik
*
What about multi-class SVMs?
One vs. others
Training: learn an SVM for each class vs. the others
Testing: apply each SVM to test example and assign to it the class of the SVM that returns the highest decision value
One vs. one
Training: learn an SVM for each pair of classes
Testing: each learned SVM “votes” for a class to assign to the test example
Slide credit: L. Lazebnik
*
SVMs: Pros and cons
Pros
Many publicly available SVM packages:
http://www.kernel-machines.org/software
Kernel-based framework is very powerful, flexible.
SVMs work very well in practice, even with very small training sample sizes.
*
SVMs: Pros and cons
Cons
No “direct” multi-class SVM, must combine two-class SVMs.
Computation, memory
During training time, must compute matrix of kernel values for every pair of examples.
Learning can take a very long time for large-scale problems.
*
Two ways to think about classifiers
What is the objective? What are the parameters? How are the parameters learned? How is the learning regularized? How is inference performed?
How is the data modeled? How is similarity defined? What is the shape of the boundary?
Slide credit: D. Hoiem
Comparison
Naïve Bayes
Logistic Regression
Linear SVM
Nearest Neighbor
Kernelized SVM
Learning Objective
Training
Inference
Gradient ascent
Linear programming
Quadratic programming
complicated to write
most similar features same label
Record data
assuming x in {0 1}
Slide credit: D. Hoiem
*
5390.unknown
K-nearest neighbor (K-NN)
+
+
x
x
x
x
x
x
x
x
o
o
o
o
o
o
o
x2
x1
1-nearest neighbor (1-NN)
+
+
x
x
x
x
x
x
x
x
o
o
o
o
o
o
o
x2
x1
3-nearest neighbor (3-NN)
+
+
x
x
x
x
x
x
x
x
o
o
o
o
o
o
o
x2
x1
5-nearest neighbor (5-NN)
+
+
x
x
x
x
x
x
x
x
o
o
o
o
o
o
o
x2
x1
Using K-NN
Simple, a good one to try first.
With infinite examples, 1-NN provably has error that is at most twice Bayes optimal error.
*
Why Study Machine Learning?
The Time is Ripe
Many basic effective and efficient algorithms available.
Large amounts of on-line data available.
Large amounts of computational resources available.
+
+
+
+
–
–
–
–
–
(q
å
=
–
=
n
i
i
i
y
x
Y
X
D
1
2
)
(
)
,
(
Samples
No
Size
(Feature
#1)
Floor
(Feature
#2)
Broadband
Rate
(Feature #3)
Energy
Expense
($)
(Feature
#4)
Rental Price ($)
(Target/Label)
Prediction
1 500 4 8 100 320
2 630 5 24 120 400
3 700 4 8 180 410
4 880 12 50 230 600
5 920 14 8 250 570
6 1000 9 24 300 620
New 820 6 30 200
Answer
from the
model
x
w
T
y
x
x
P
y
x
x
P
=
–
=
=
)
1
|
,
(
)
1
|
,
(
log
2
1
2
1
(
)
(
)
x
w
T
x
x
y
P
–
+
=
=
exp
1
/
1
)
,
|
1
(
2
1
+
+
+
+
–
–
–
–
–
•q
(,)
ii
xy
f
(,)
ij
xx
y
)
Pr(
)
Pr(
)
|
Pr(
)
Pr(
)
Pr(
)
|
Pr(
A
B
B
A
A
AB
A
B
=
=
Pr(|)Pr()
Pr(|)
Pr()
EHH
HE
E
=
b
K
y
b
y
i
i
i
i
i
i
i
i
+
=
+
×
å
å
)
,
(
)
(
)
(
x
x
x
x
a
j
j
a
)
,
(
)
(
2
x
x
x
=
j
2
2
2
2
2
2
)
,
(
)
,
(
)
,
(
)
(
)
(
y
x
xy
y
x
K
y
x
xy
y
y
x
x
y
x
+
=
+
=
×
=
×
j
j
å
=
=
N
i
i
h
i
h
h
h
I
1
2
1
2
1
))
(
),
(
min(
)
,
(
÷
ø
ö
ç
è
æ
–
=
2
2
1
2
1
)
,
(
1
exp
)
,
(
h
h
D
A
h
h
K
(
)
(
)
å
å
ú
ú
û
ù
ê
ê
ë
é
+
i
i
j
j
i
ij
y
P
y
x
P
0
;
log
;
|
log
maximize
q
q
(
)
(
)
Kr
k
y
r
k
y
x
i
i
i
i
ij
kj
+
=
+
=
Ù
=
=
å
å
d
d
q
1
(
)
(
)
(
)
(
)
(
)
0
|
0
1
|
0
log
,
0
|
1
1
|
1
log
where
0
1
0
1
0
1
=
=
=
=
=
=
=
=
=
=
>
–
+
y
x
P
y
x
P
y
x
P
y
x
P
j
j
j
j
j
j
T
T
q
q
x
θ
x
θ
(
)
(
)
(
)
(
)
(
)
x
θ
θ
x
θ
θ
x
T
i
i
i
i
y
y
P
y
P
–
+
=
+
å
exp
1
/
1
,
|
where
,
|
log
maximize
l
0
>
x
θ
T
i
y
i
T
i
i
i
”
–
³
+
å
1
such that
2
1
minimize
x
x
l
x
θ
θ
(
)
å
>
i
i
i
i
K
y
0
,
ˆ
x
x
a
(
)
x
x
,
ˆ
argmin
where
i
i
i
K
i
y
=
x
x
e
e
x
y
P
1
0
1
0
+
+
+
=
b
b
b
b
1
)
(