程序代写代做代考 data mining algorithm CS373 Data Mining and�

CS373 Data Mining and�
Machine Learning�

Lecture 6
Jean Honorio

Purdue University

(originally prepared by Tommi Jaakkola, MIT CSAIL)

Today’s topics
• Multi-way classification

- reducing multi-class to binary
- margin based solution

• Rating (ordinal regression)
- reduction to binary problems
- SVM solution, on-line solution

• Ranking
- ranking SVM

Reducing multi-class to binary
• We train several classifiers. For a test point, we output

a string (multi way label). We then check which matrix
row is closest to the string. �

• Properties of good code matrices
- “binary codes” (rows) should be well-separated: if minimum

Hamming distance between rows is H, we can make at most H/2
mistakes (good error correction)

- Which seems better: one-versus-all or all-pairs?
- binary tasks (columns) should be easy to solve

- Which seems better: one-versus-all or all-pairs?

binary tasks

1
2
3
4

classes

string for class 2

Reducing multi-class to binary

1
2
3

4

classes y

binary tasks j

target binary label for
the jth classifier if

the multi-class label is y

discriminant function value
of the jth classifier in response

to the new example

agreement

• We train several classifiers. For a test point, we output
a string (multi way label). We then check which matrix
row is closest to the string.

Output codes, error correction
• If the loss is the hinge loss loss(z) = max(0,1-z), then

small if each binary task
can be solved well

small if code words
are well-separated

(See Theorem 1 in [2] if interested.)

Today’s topics
• Multi-way classification

- reducing multi-class to binary
- margin based solution

• Rating (ordinal regression)
- reduction to binary problems
- SVM solution, on-line solution

• Ranking
- ranking SVM

Rating problems
• A common prediction problem in recommender

systems involves rating items (movies, products) on the
basis some known features about such objects

• The rating scale is often 1-5 stars assigned to the object
• The key difference between rating problems and multi-

way classification problems is that the rating scale is
ordinal (e.g., 1<2<3<4<5) while class labels in multi-way classification problems are category symbols Ordinal regression: setup • Each item is associated with a feature vector - e.g., product description, movie features, etc. • We wish to predict an ordinal label � for each item (reflecting views of one user) • As in the multi-class setting, we translate each rating into a set of binary labels • There are many ways to translate ratings into binary labels... Binary translation +- {2,3,4,5}{1} {3,4,5}{2} +- {4,5}{3} +- {5}{4} +- 1 2 3 4 5 binary tasks rating y • There are many ways to translate ratings into binary labels... Binary translation +- {2,3,4,5}{1} {3,4,5}{2} +- {4,5}{3} +- {5}{4} +- 1 2 3 4 5 binary tasks These are independent partitions that do not enforce the ordinal scale rating y • We can create more relevant partitions by “sliding” across the ordinal scale Binary translation +- {2,3,4,5}{1} {3,4,5}{1,2} +- {4,5}{1,2,3} +- {5}{1,2,3,4} +- 1 2 3 4 5 binary tasks rating y • We can create more relevant partitions by “sliding” across the ordinal scale Binary translation +- {2,3,4,5}{1} {3,4,5}{1,2} +- {4,5}{1,2,3} +- {5}{1,2,3,4} +- 1 2 3 4 5 binary tasks rating y The partitions are now dependent ... need to enforce consistency of the binary labels across partitions • We can specify a set of classifiers with shared parameters that always produce consistent binary labels Ordinal regression +- {2,3,4,5}{1} {3,4,5}{1,2} +- {4,5}{1,2,3} +- {5}{1,2,3,4} +- 4 2 3 3 1 2 5 • We can specify a set of classifiers with shared parameters that always produce consistent binary labels Ordinal regression +- {2,3,4,5}{1} {3,4,5}{1,2} +- {4,5}{1,2,3} +- {5}{1,2,3,4} +- common prediction thresholds are different but ordered Ordinal regression, 2nd view • Each item is associated with a feature vector - e.g., product description, movie features, etc. • We wish to predict an ordinal label � for each item (reflecting views of one user) • We assume that there exists an underlying continuous scale from which ratings are obtained via thresholding ratings thresholds underlying continuous scale Ordinal regression, 2nd view • Each item is associated with a feature vector - e.g., product description, movie features, etc. • We wish to predict an ordinal label � for each item (reflecting views of one user) • We assume that there exists an underlying continuous scale from which ratings are obtained via thresholding ratings thresholds underlying continuous scale prediction Ordinal regression, SVM style • Given a training set 4 2 3 3 1 1 2 5 k-1 binary labels obtained from each observed rating binary classification constraint Ordinal regression, PRank • We can also define a mistake driven perceptron algorithm for solving ordinal regression problems • The updates are modified slightly due to shared parameters� � � � � � � � � � � identify all binary mistakes perform a collective update based on the mistakes update thresholds of each classifier Note: having a threshold is equivalent to having an extra feature, in which all samples have -1. Thus, the update rule for bl is not surprising. Ordinal regression, PRank • We can also define a mistake driven perceptron algorithm for solving ordinal regression problems • The updates are modified slightly due to shared parameters� � � � � � � � � � � identify all binary mistakes perform a collective update based on the mistakes update thresholds of each classifier • Lemma: if the thresholds are set to zero initially, they will maintain the correct ordering in the course of the algorithm (See Lemma 1 in [1] if interested in the proof.) PRank, mistake bound • Theorem: Assume that there exists � � � � such that� � � � then the algorithm makes at most � � � � binary mistakes on the training set. (See Theorem 2 in [1] if interested in the proof.) Today’s topics • Multi-way classification - reducing multi-class to binary - margin based solution • Rating (ordinal regression) - reduction to binary problems - SVM solution, on-line solution • Ranking - ranking SVM Ranking • Rating products, movies, etc. using a few values (e.g., 1-5 stars) results in a partial ranking of the items • Many rating / classification problems are better viewed as ranking problems - suggest movies in the order of user interest in them, - rank websites to display in response to a query, - suggest genes relevant to a particular disease condition, etc. • By casting the learning problem as a ranking problem we can also incorporate other types of data / feedback - e.g., click through data from users Ranking example • We would like to rank n websites (find top sites to display) in response to a few query words Ranking example • We would like to rank n websites (find top sites to display) in response to a few query words • The available data contain user selections (clicks) of websites out of those displayed to them x x From selections to preferences • We can interpret a user click as a statement that they prefer the selected link over others displayed in the context of the query x x Ranking function • Our goal is to estimate a ranking function over pairs f(x,y) such that its values are consistent with the observed preferences. � � � � • We can parameterize this function in terms of feature vectors extracted from each pair (context,website) � � � Ranking function • Our goal is to estimate a ranking function over pairs f(x,y) such that its values are consistent with the observed preferences. � � � � • We can parameterize this function in terms of feature vectors extracted from each pair (context,website) � � � where the features could be, e.g., Ranking function 1 2 3 • The ranking function gives rise to a total ordering of the pairs via projection to the parameter vector 1 > 2
2 > 3

SVM rank
• A training set of order relations between pairs

• An SVM style algorithm for finding a consistent ranking
function�





SVM rank
• A training set of order relations between pairs

• An SVM style algorithm for finding a consistent ranking
function

• It is important to appropriately weight or choose which
constraints to include

The effect of ranking constraints

1 > 2
2 > 3
3 > 4

1 > 2
2 > 3
3 > 4
4 > 5

adding a single constraint
can have a large effect on

the ranking solution violated
constraint