CS373 Data Mining and�
Machine Learning�
Lecture 6
Jean Honorio
Purdue University
(originally prepared by Tommi Jaakkola, MIT CSAIL)
Today’s topics
• Multi-way classification
- reducing multi-class to binary
- margin based solution
• Rating (ordinal regression)
- reduction to binary problems
- SVM solution, on-line solution
• Ranking
- ranking SVM
Reducing multi-class to binary
• We train several classifiers. For a test point, we output
a string (multi way label). We then check which matrix
row is closest to the string. �
• Properties of good code matrices
- “binary codes” (rows) should be well-separated: if minimum
Hamming distance between rows is H, we can make at most H/2
mistakes (good error correction)
- Which seems better: one-versus-all or all-pairs?
- binary tasks (columns) should be easy to solve
- Which seems better: one-versus-all or all-pairs?
binary tasks
1
2
3
4
classes
string for class 2
Reducing multi-class to binary
1
2
3
4
classes y
binary tasks j
target binary label for
the jth classifier if
the multi-class label is y
discriminant function value
of the jth classifier in response
to the new example
agreement
• We train several classifiers. For a test point, we output
a string (multi way label). We then check which matrix
row is closest to the string.
Output codes, error correction
• If the loss is the hinge loss loss(z) = max(0,1-z), then
small if each binary task
can be solved well
small if code words
are well-separated
(See Theorem 1 in [2] if interested.)
Today’s topics
• Multi-way classification
- reducing multi-class to binary
- margin based solution
• Rating (ordinal regression)
- reduction to binary problems
- SVM solution, on-line solution
• Ranking
- ranking SVM
Rating problems
• A common prediction problem in recommender
systems involves rating items (movies, products) on the
basis some known features about such objects
• The rating scale is often 1-5 stars assigned to the object
• The key difference between rating problems and multi-
way classification problems is that the rating scale is
ordinal (e.g., 1<2<3<4<5) while class labels in multi-way
classification problems are category symbols
Ordinal regression: setup
• Each item is associated with a feature vector
- e.g., product description, movie features, etc.
• We wish to predict an ordinal label �
for each item (reflecting views of one user)
• As in the multi-class setting, we translate each rating
into a set of binary labels
• There are many ways to translate ratings into binary
labels...
Binary translation
+-
{2,3,4,5}{1}
{3,4,5}{2}
+-
{4,5}{3}
+-
{5}{4}
+-
1
2
3
4
5
binary tasks
rating
y
• There are many ways to translate ratings into binary
labels...
Binary translation
+-
{2,3,4,5}{1}
{3,4,5}{2}
+-
{4,5}{3}
+-
{5}{4}
+-
1
2
3
4
5
binary tasks
These are independent
partitions that do not enforce
the ordinal scale
rating
y
• We can create more relevant partitions by “sliding”
across the ordinal scale
Binary translation
+-
{2,3,4,5}{1}
{3,4,5}{1,2}
+-
{4,5}{1,2,3}
+-
{5}{1,2,3,4}
+-
1
2
3
4
5
binary tasks
rating
y
• We can create more relevant partitions by “sliding”
across the ordinal scale
Binary translation
+-
{2,3,4,5}{1}
{3,4,5}{1,2}
+-
{4,5}{1,2,3}
+-
{5}{1,2,3,4}
+-
1
2
3
4
5
binary tasks
rating
y
The partitions are now
dependent ... need to enforce
consistency of the binary
labels across partitions
• We can specify a set of classifiers with shared parameters
that always produce consistent binary labels
Ordinal regression
+-
{2,3,4,5}{1}
{3,4,5}{1,2}
+-
{4,5}{1,2,3}
+-
{5}{1,2,3,4}
+-
4
2
3
3
1
2
5
• We can specify a set of classifiers with shared parameters
that always produce consistent binary labels
Ordinal regression
+-
{2,3,4,5}{1}
{3,4,5}{1,2}
+-
{4,5}{1,2,3}
+-
{5}{1,2,3,4}
+-
common prediction
thresholds are
different but
ordered
Ordinal regression, 2nd view
• Each item is associated with a feature vector
- e.g., product description, movie features, etc.
• We wish to predict an ordinal label �
for each item (reflecting views of one user)
• We assume that there exists an underlying continuous
scale from which ratings are obtained via thresholding
ratings
thresholds
underlying
continuous
scale
Ordinal regression, 2nd view
• Each item is associated with a feature vector
- e.g., product description, movie features, etc.
• We wish to predict an ordinal label �
for each item (reflecting views of one user)
• We assume that there exists an underlying continuous
scale from which ratings are obtained via thresholding
ratings
thresholds
underlying
continuous
scale
prediction
Ordinal regression, SVM style
• Given a training set
4
2
3
3
1
1
2
5
k-1 binary labels
obtained from each
observed rating
binary classification
constraint
Ordinal regression, PRank
• We can also define a mistake driven perceptron
algorithm for solving ordinal regression problems
• The updates are modified slightly due to shared
parameters�
�
�
�
�
�
�
�
�
�
�
identify all binary
mistakes
perform a collective update
based on the mistakes
update thresholds
of each classifier
Note: having a threshold is equivalent to having an extra feature, in which all samples have -1.
Thus, the update rule for bl is not surprising.
Ordinal regression, PRank
• We can also define a mistake driven perceptron
algorithm for solving ordinal regression problems
• The updates are modified slightly due to shared
parameters�
�
�
�
�
�
�
�
�
�
�
identify all binary
mistakes
perform a collective update
based on the mistakes
update thresholds
of each classifier
• Lemma: if the thresholds are set to zero initially,
they will maintain the correct ordering in the course
of the algorithm (See Lemma 1 in [1] if interested in the proof.)
PRank, mistake bound
• Theorem: Assume that there exists �
�
�
�
such that�
�
�
�
then the algorithm makes at most �
�
�
�
binary mistakes on the training set.
(See Theorem 2 in [1] if interested in the proof.)
Today’s topics
• Multi-way classification
- reducing multi-class to binary
- margin based solution
• Rating (ordinal regression)
- reduction to binary problems
- SVM solution, on-line solution
• Ranking
- ranking SVM
Ranking
• Rating products, movies, etc. using a few values (e.g.,
1-5 stars) results in a partial ranking of the items
• Many rating / classification problems are better viewed
as ranking problems
- suggest movies in the order of user interest in them,
- rank websites to display in response to a query,
- suggest genes relevant to a particular disease condition, etc.
• By casting the learning problem as a ranking problem
we can also incorporate other types of data / feedback
- e.g., click through data from users
Ranking example
• We would like to rank n websites (find top sites to
display) in response to a few query words
Ranking example
• We would like to rank n websites (find top sites to
display) in response to a few query words
• The available data contain user selections (clicks) of
websites out of those displayed to them
x
x
From selections to preferences
• We can interpret a user click as a statement that they
prefer the selected link over others displayed in the
context of the query
x
x
Ranking function
• Our goal is to estimate a ranking function over pairs
f(x,y) such that its values are consistent with the
observed preferences. �
�
�
�
• We can parameterize this function in terms of feature
vectors extracted from each pair (context,website) �
�
�
Ranking function
• Our goal is to estimate a ranking function over pairs
f(x,y) such that its values are consistent with the
observed preferences. �
�
�
�
• We can parameterize this function in terms of feature
vectors extracted from each pair (context,website) �
�
�
where the features could be, e.g.,
Ranking function
1
2
3
• The ranking function gives rise to a total ordering of the
pairs via projection to the parameter vector
1 > 2
2 > 3
SVM rank
• A training set of order relations between pairs
• An SVM style algorithm for finding a consistent ranking
function�
�
�
�
�
�
�
SVM rank
• A training set of order relations between pairs
• An SVM style algorithm for finding a consistent ranking
function
• It is important to appropriately weight or choose which
constraints to include
The effect of ranking constraints
1 > 2
2 > 3
3 > 4
1 > 2
2 > 3
3 > 4
4 > 5
adding a single constraint
can have a large effect on
the ranking solution violated
constraint