Jean Honorio
Purdue University
(originally prepared by Tommi Jaakkola, MIT CSAIL)
CS373 Data Mining and�
Machine Learning�
Lecture 1
Course topics
• Supervised learning
- linear and non-linear classifiers, kernels
- rating, ranking, collaborative filtering
- model selection, complexity, generalization
- conditional Random fields, structured prediction
• Unsupervised learning, modeling
- mixture models, topic models
- Hidden Markov Models
- Bayesian networks
- Markov Random Fields, factor graphs
• Learning from examples
+1
-1
+1
-1
training set
Machine learning
+1
-1
+1
-1
training set
Machine learning
• Learning from examples
The training set of labeled examples
specifies the learning task only
implicitly
+1
-1
+1
-1
training set
Machine learning
• Learning from examples
?
Our goal is to accurately label
new websites that were not
part of the training set
+1
-1
+1
-1
training set
Machine learning
• Learning from examples
classifier = mapping
from websites to labels
predicted
label
new website
“Examples”
• We will have to first represent the examples
(websites) in a manner that can be easily mapped to
labels
White House
officials consulted
with the Justice
Department in
preparing a list of
U.S. attorneys who
would be removed.
(NYT 03/13/07)
news article
“Examples”
• We will have to first represent the examples
(websites) in a manner that can be easily mapped to
labels
White House
officials consulted
with the Justice
Department in
preparing a list of
U.S. attorneys who
would be removed.
(NYT 03/13/07)
White
House
officials
consulted
withthe
JusticeDepartment
in
preparing
a
list
of
U.S.
attorneys
who would
be
removed
bag of
words
news article
“Examples”
• We will have to first represent the examples
(websites) in a manner that can be easily mapped to
labels
White House
officials consulted
with the Justice
Department in
preparing a list of
U.S. attorneys who
would be removed.
(NYT 03/13/07)
White
House
officials
consulted
withthe
JusticeDepartment
in
preparing
a
list
of
U.S.
attorneys
who would
be
removed
bag of
words
news article
“Examples”
• We will have to first represent the examples
(websites) in a manner that can be easily mapped to
labels
White House
officials consulted
with the Justice
Department in
preparing a list of
U.S. attorneys who
would be removed.
(NYT 03/13/07)
White
House
officials
consulted
withthe
JusticeDepartment
in
preparing
a
list
of
U.S.
attorneys
who would
be
removed
bag of
words
news article
counts
a vector whose coordinates (features)
specify how many times (or whether)
particular words appeared in the article
The learning task
• The training set is now a set of labeled points
• Our goal is to find a “good” classifier
based on the training set
so that correctly labels any new websites
+
–
-+
+
+
–
?!
new website
The learning task
• The training set is now a set of labeled points
• Our goal is to find a “good” classifier
based on the training set
so that correctly labels any new websites
+
–
-+
+
+
–
?!
new website
Part 1: !
Model selection!
what type of classifiers
should we consider?
The learning task
• The training set is now a set of labeled points
• Our goal is to find a “good” classifier
based on the training set
so that correctly labels any new websites
+
–
-+
+
+
–
?!
new website
Part 1: !
Model selection!
what type of classifiers
should we consider?
Part 2: !
Estimation!
how to select the best
classifier in the set?
Part 1: allowing all classifiers?
• We can easily construct a “silly classifier” that perfectly
classifiers any distinct set of training points
• But it doesn’t “generalize” (it doesn’t classify new points
very well)
+
–
-+
+
+
–
?!
new website
Part 1: allowing few classifiers?
• We could instead consider very few alternatives such as
• But neither one classifies even training points very well
+
–
-+
+
+
–
?!
new website
Part 1: linear classifiers
• A linear classifier (through origin) with parameters
divides the space into positive and negative halves
Part 1: linear classifiers
• A linear classifier (through origin) with parameters
divides the space into positive and negative halves
Part 1: linear classifiers
• A linear classifier (through origin) with parameters
divides the space into positive and negative halves
decision boundary
Recall:
cos(angle)=dot product
of two unit vectors
cos(90)=0
Part 1: linear classifiers
• A linear classifier (through origin) with parameters
divides the space into positive and negative halves
decision boundary
+
–
Part 2: estimation
• We can use the training error as a surrogate criterion
for finding the best linear classifier through origin
• Other choices are possible (and often preferable)
• The perceptron algorithm considers each training point
in turn, adjusting the parameters to correct any mistakes
• The algorithm will converge (no mistakes) if the training
points are linearly separable, otherwise it won’t converge
Perceptron algorithm
Perceptron algorithm: motivation
• If we make a mistake on the tth training point, then
• After the update, we have
• So that increases based on the update
Perceptron algorithm: motivation
• If we make a mistake on the tth training point, then
• After the update, we have
• So that increases based on the update
Perceptron algorithm: motivation
• If we make a mistake on the tth training point, then
• After the update, we have
• So that increases based on the update
Perceptron algorithm: motivation
• If we make a mistake on the tth training point, then
• After the update, we have
• So that increases based on the update
Perceptron algorithm: motivation
• If we make a mistake on the tth training point, then
• After the update, we have
• So that increases based on the update
Perceptron algorithm: motivation
• If we make a mistake on the tth training point, then
• After the update, we have
• So that increases based on the update
• Iterative updates based on mistakes
Perceptron algorithm
+
• Iterative updates based on mistakes
Perceptron algorithm
+
• Iterative updates based on mistakes
Perceptron algorithm
+
• Iterative updates based on mistakes
Perceptron algorithm
+
• Iterative updates based on mistakes
Perceptron algorithm
+
–
• Iterative updates based on mistakes
Perceptron algorithm
+
–
• Iterative updates based on mistakes
Perceptron algorithm
+
–
• Iterative updates based on mistakes
Perceptron algorithm
+
–
• Iterative updates based on mistakes
Perceptron algorithm
+
–
• Iterative updates based on mistakes
Perceptron algorithm (take 2)
+
–
• Iterative updates based on mistakes
Perceptron algorithm (take 2)
+
–
• Iterative updates based on mistakes
Perceptron algorithm (take 2)
+
–
• Iterative updates based on mistakes
Perceptron algorithm (take 2)
+
–
• Iterative updates based on mistakes
Perceptron algorithm (take 2)
+
–
• Iterative updates based on mistakes
Perceptron algorithm (take 3)
+
–
• Iterative updates based on mistakes
Perceptron algorithm (take 3)
+
–
• Iterative updates based on mistakes
Perceptron algorithm (take 3)
+
–
• Iterative updates based on mistakes
Perceptron algorithm (take 3)
+
–
• Iterative updates based on mistakes
Perceptron algorithm (take 3)
+
–
“Margin”
• We can get a handle on convergence by assuming that
there exists a target classifier with good properties
• One such property is margin, i.e., how close the
separating boundary is to the points
+
–
–
–
+
+
“Margin”
• We can get a handle on convergence by assuming that
there exists a target classifier with good properties
• One such property is margin, i.e., how close the
separating boundary is to the points
+
–
–
–
+
+target
“Margin”
• We can get a handle on convergence by assuming that
there exists a target classifier with good properties
• One such property is margin, i.e., how close the
separating boundary is to the points
+
–
–
–
+
+target
Recall:
cos(angle)=dot product of two unit vectors
dotted distance = radius * cos(angle)
“Margin”
• We can get a handle on convergence by assuming that
there exists a target classifier with good properties
• One such property is margin, i.e., how close the
separating boundary is to the points
+
–
–
–
+
+target
Perceptron convergence theorem
• If there exists such that
and then the perceptron algorithm makes at
most
mistakes (on the training set).
• Note that the result does NOT depend on