程序代写代做代考 data mining Hidden Markov Mode Bayesian network Bayesian algorithm Jean Honorio

Jean Honorio
Purdue University

(originally prepared by Tommi Jaakkola, MIT CSAIL)

CS373 Data Mining and�
Machine Learning�

Lecture 1

Course topics
• Supervised learning

-  linear and non-linear classifiers, kernels
- rating, ranking, collaborative filtering
- model selection, complexity, generalization
- conditional Random fields, structured prediction

• Unsupervised learning, modeling
- mixture models, topic models
- Hidden Markov Models
- Bayesian networks
- Markov Random Fields, factor graphs

• Learning from examples

+1

-1

+1

-1

training set

Machine learning

+1

-1

+1

-1

training set

Machine learning
• Learning from examples

The training set of labeled examples
specifies the learning task only
implicitly

+1

-1

+1

-1

training set

Machine learning
• Learning from examples

?

Our goal is to accurately label
new websites that were not
part of the training set

+1

-1

+1

-1

training set

Machine learning
• Learning from examples

classifier = mapping
from websites to labels

predicted
label

new website

“Examples”
• We will have to first represent the examples

(websites) in a manner that can be easily mapped to
labels

White House
officials consulted
with the Justice
Department in
preparing a list of
U.S. attorneys who
would be removed.

(NYT 03/13/07)

news article

“Examples”
• We will have to first represent the examples

(websites) in a manner that can be easily mapped to
labels

White House
officials consulted
with the Justice
Department in
preparing a list of
U.S. attorneys who
would be removed.

(NYT 03/13/07)

White

House
officials

consulted

withthe

JusticeDepartment
in

preparing

a

list

of

U.S.

attorneys

who would
be

removed

bag of

words

news article

“Examples”
• We will have to first represent the examples

(websites) in a manner that can be easily mapped to
labels

White House
officials consulted
with the Justice
Department in
preparing a list of
U.S. attorneys who
would be removed.

(NYT 03/13/07)

White

House
officials

consulted

withthe

JusticeDepartment
in

preparing

a

list

of

U.S.

attorneys

who would
be

removed

bag of

words

news article

“Examples”
• We will have to first represent the examples

(websites) in a manner that can be easily mapped to
labels

White House
officials consulted
with the Justice
Department in
preparing a list of
U.S. attorneys who
would be removed.

(NYT 03/13/07)

White

House
officials

consulted

withthe

JusticeDepartment
in

preparing

a

list

of

U.S.

attorneys

who would
be

removed

bag of

words

news article

counts

a vector whose coordinates (features)
specify how many times (or whether)

particular words appeared in the article

The learning task
• The training set is now a set of labeled points

• Our goal is to find a “good” classifier
based on the training set
so that correctly labels any new websites

+

-+

+

+

?!
new website

The learning task
• The training set is now a set of labeled points

• Our goal is to find a “good” classifier
based on the training set
so that correctly labels any new websites

+

-+

+

+

?!
new website

Part 1: !
Model selection!

what type of classifiers
should we consider?

The learning task
• The training set is now a set of labeled points

• Our goal is to find a “good” classifier
based on the training set
so that correctly labels any new websites

+

-+

+

+

?!
new website

Part 1: !
Model selection!

what type of classifiers
should we consider?

Part 2: !
Estimation!

how to select the best
classifier in the set?

Part 1: allowing all classifiers?

• We can easily construct a “silly classifier” that perfectly
classifiers any distinct set of training points

• But it doesn’t “generalize” (it doesn’t classify new points
very well)

+

-+

+

+

?!
new website

Part 1: allowing few classifiers?

• We could instead consider very few alternatives such as

• But neither one classifies even training points very well

+

-+

+

+

?!
new website

Part 1: linear classifiers
• A linear classifier (through origin) with parameters

divides the space into positive and negative halves

Part 1: linear classifiers
• A linear classifier (through origin) with parameters

divides the space into positive and negative halves

Part 1: linear classifiers
• A linear classifier (through origin) with parameters

divides the space into positive and negative halves

decision boundary

Recall:
cos(angle)=dot product
of two unit vectors
cos(90)=0

Part 1: linear classifiers
• A linear classifier (through origin) with parameters

divides the space into positive and negative halves

decision boundary

+

Part 2: estimation
• We can use the training error as a surrogate criterion

for finding the best linear classifier through origin

• Other choices are possible (and often preferable)

• The perceptron algorithm considers each training point
in turn, adjusting the parameters to correct any mistakes

• The algorithm will converge (no mistakes) if the training
points are linearly separable, otherwise it won’t converge

Perceptron algorithm

Perceptron algorithm: motivation
•  If we make a mistake on the tth training point, then

• After the update, we have

• So that increases based on the update

Perceptron algorithm: motivation
•  If we make a mistake on the tth training point, then

• After the update, we have

• So that increases based on the update

Perceptron algorithm: motivation
•  If we make a mistake on the tth training point, then

• After the update, we have

• So that increases based on the update

Perceptron algorithm: motivation
•  If we make a mistake on the tth training point, then

• After the update, we have

• So that increases based on the update

Perceptron algorithm: motivation
•  If we make a mistake on the tth training point, then

• After the update, we have

• So that increases based on the update

Perceptron algorithm: motivation
•  If we make a mistake on the tth training point, then

• After the update, we have

• So that increases based on the update

•  Iterative updates based on mistakes
Perceptron algorithm

+

•  Iterative updates based on mistakes
Perceptron algorithm

+

•  Iterative updates based on mistakes
Perceptron algorithm

+

•  Iterative updates based on mistakes
Perceptron algorithm

+

•  Iterative updates based on mistakes
Perceptron algorithm

+

•  Iterative updates based on mistakes
Perceptron algorithm

+

•  Iterative updates based on mistakes
Perceptron algorithm

+

•  Iterative updates based on mistakes
Perceptron algorithm

+

•  Iterative updates based on mistakes
Perceptron algorithm

+

•  Iterative updates based on mistakes
Perceptron algorithm (take 2)

+

•  Iterative updates based on mistakes
Perceptron algorithm (take 2)

+

•  Iterative updates based on mistakes
Perceptron algorithm (take 2)

+

•  Iterative updates based on mistakes
Perceptron algorithm (take 2)

+

•  Iterative updates based on mistakes
Perceptron algorithm (take 2)

+

•  Iterative updates based on mistakes
Perceptron algorithm (take 3)

+

•  Iterative updates based on mistakes
Perceptron algorithm (take 3)

+

•  Iterative updates based on mistakes
Perceptron algorithm (take 3)

+

•  Iterative updates based on mistakes
Perceptron algorithm (take 3)

+

•  Iterative updates based on mistakes
Perceptron algorithm (take 3)

+

“Margin”
• We can get a handle on convergence by assuming that

there exists a target classifier with good properties

• One such property is margin, i.e., how close the
separating boundary is to the points

+


+
+

“Margin”
• We can get a handle on convergence by assuming that

there exists a target classifier with good properties

• One such property is margin, i.e., how close the
separating boundary is to the points

+


+
+target

“Margin”
• We can get a handle on convergence by assuming that

there exists a target classifier with good properties

• One such property is margin, i.e., how close the
separating boundary is to the points

+


+
+target

Recall:
cos(angle)=dot product of two unit vectors
dotted distance = radius * cos(angle)

“Margin”
• We can get a handle on convergence by assuming that

there exists a target classifier with good properties

• One such property is margin, i.e., how close the
separating boundary is to the points

+


+
+target

Perceptron convergence theorem
• If there exists such that

and then the perceptron algorithm makes at
most

mistakes (on the training set).

• Note that the result does NOT depend on