Data Mining and Machine Learning
Fall 2018, Homework 3 [VERSION 2.0]
(due on Sep 23, 11.59pm EST)
Jean Honorio jhonorio@purdue.edu
The homework is based on a total of 10 points. Your code should be in
Python 2.7. For clarity, the algorithms presented here will assume zero-based
indices for arrays, vectors, matrices, etc. Please read the submission instructions
at the end. Failure to comply to the submission instructions will cause
your grade to be reduced.
In this homework, we will focus on rating (ordinal regression). You can use
the script createsepratingdata.py to create some synthetic separable data:
import numpy as np
import scipy.linalg as la
# Input: number of samples n, number of features d, number of labels k
# Output: matrix X of features, with n rows (samples), d columns (features)
# X[i,j] is the j-th feature of the i-th sample
# vector y of labels, with n rows (samples), 1 column
# y[i] is the label (1 or 2 … or k) of the i-th sample
# Example on how to call the script:
# import createsepratingdata
# X, y = createsepratingdata.run(10,2,3)
def run(n,d,k):
if n < k: raise ValueError(’n should be at least k’) X = np.random.random((n,d)) y = np.zeros((n,1)) i = 0 for r in range(1,k+1): j = r*n/k X[i:j,0] = X[i:j,0] + 1.5*r y[i:j] = r i = j U = la.orth(np.random.random((d,d))) X = np.dot(X,U) return (X,y) 1 Here are the questions: 1) [4 points] Implement the PRank algorithm for rating (ordinal regression), introduced in Lecture 6. Recall that: stl = { −1 if yt ≤ l + 1 +1 if yt > l + 1
The algorithm to be implemented is as follows.
Input: number of iterations L, number of labels k, training data xt ∈ Rd,
yt ∈ {1, . . . , k} for t = 0, . . . , n− 1
Output: θ ∈ Rd, b ∈ Rk−1
θ ← 0
for l = 0, . . . , k − 2 do
bl ← l
end for
for iter = 1, . . . , L do
for t = 0, . . . , n− 1 do
E ← {l | stl(θ · xt − bl) ≤ 0}
if E is not an empty set then
θ ← θ +
(∑
l∈E
stl
)
xt
for l ∈ E do
bl ← bl − stl
end for
end if
end for
end for
The header of your Python script ratingprank.py should be:
# Input: number of iterations L
# number of labels k
# matrix X of features, with n rows (samples), d columns (features)
# X[i,j] is the j-th feature of the i-th sample
# vector y of labels, with n rows (samples), 1 column
# y[i] is the label (1 or 2 … or k) of the i-th sample
# Output: vector theta of d rows, 1 column
# vector b of k-1 rows, 1 column
def run(L,k,X,y):
# Your code goes here
return (theta, b)
2
2) [2 points] Implement the following rating (ordinal regression) predictor func-
tion, introduced in Lecture 6. Here we present k conditionals (if) for clarity.
Your implementation should work for any number k.
Input: number of labels k, θ ∈ Rd, b ∈ Rk−1, testing point x ∈ Rd
Output: label ∈ {1, . . . , k}
if θ · x ≤ b0 then label← 1
if b0 < θ · x ≤ b1 then label← 2
if b1 < θ · x ≤ b2 then label← 3
...
if bk−3 < θ · x ≤ bk−2 then label← k − 1
if bk−2 < θ · x then label← k
The header of your Python script ratingpred.py should be:
# Input: number of labels k
# vector theta of d rows, 1 column
# vector b of k-1 rows, 1 column
# vector x of d rows, 1 column
# Output: label (1 or 2 ... or k)
def run(k,theta,b,x):
# Your code goes here
return label
3) [4 points] Now we ask you to implement the following support vector machines
for rating (ordinal regression), introduced in Lecture 6. Recall that:
sil =
{
−1 if yi ≤ l + 1
+1 if yi > l + 1
The problem to be implemented is as follows.
minimize
1
2
θ · θ
subject to sil(xi · θ − bl) ≥ 1 for i = 0, . . . , n− 1, l = 0, . . . , k − 2
bl ≤ bl+1 for l = 0, . . . , k − 3
First some general notation for clarity: for any integers p and q, let Ip×p ∈ Rp×p
be the identity matrix with p rows and p columns. Let 0p×q ∈ Rp×q be a matrix
of zeros, with p rows and q columns. Let 1p×q ∈ Rp×q be a matrix of ones, with
p rows and q columns.
Now, recall that we have n training samples and that xi ∈ Rd for i = 1, . . . , n.
Let H =
[
Id×d 0d×(k−1)
0(k−1)×d 0(k−1)×(k−1)
]
∈ R(d+k−1)×(d+k−1). Let f = 0(d+k−1)×1.
Let A ∈ R(n(k−1)+k−2)×(d+k−1) defined as follows. (Recall that xi,j is the j-th
3
feature of the i-th sample.)
A =
−s0,0x0,0 −s0,0x0,1 . . . −s0,0x0,d−1 s0,0 0 0 . . . 0 0
−s0,1×0,0 −s0,1×0,1 . . . −s0,1×0,d−1 0 s0,1 0 . . . 0 0
…
−s0,k−2×0,0 −s0,k−2×0,1 . . . −s0,k−2×0,d−1 0 0 0 . . . 0 s0,k−2
…
…
…
−sn−1,0xn−1,0 −sn−1,0xn−1,1 . . . −sn−1,0xn−1,d−1 sn−1,0 0 0 . . . 0 0
−sn−1,1xn−1,0 −sn−1,1xn−1,1 . . . −sn−1,1xn−1,d−1 0 sn−1,1 0 . . . 0 0
…
−sn−1,k−2xn−1,0 −sn−1,k−2xn−1,1 . . . −sn−1,k−2xn−1,d−1 0 0 0 . . . 0 sn−1,k−2
0 0 . . . 0 1 −1 0 . . . 0 0
0 0 . . . 0 0 1 −1 . . . 0 0
…
0 0 . . . 0 0 0 0 . . . 1 −1
Let c =
[
−1(n(k−1))×1
0(k−2)×1
]
∈ Rn(k−1)+k−2. Since θ ∈ Rd and b ∈ Rk−1, lets
define z =
[
θ
b
]
∈ Rd+k−1 and we can rewrite the rating SVM problem as:
minimize
1
2
zTHz + fTz
subject to Az ≤ c
Fortunately, the package cvxopt can solve exactly the above problem by doing:
import cvxopt as co
z = np.array(co.solvers.qp(co.matrix(H,tc=’d’),co.matrix(f,tc=’d’),
co.matrix(A,tc=’d’),co.matrix(c,tc=’d’))[’x’])
The header of your Python script ratingsvm.py should be:
# Input: number of labels k
# matrix X of features, with n rows (samples), d columns (features)
# X[i,j] is the j-th feature of the i-th sample
# vector y of labels, with n rows (samples), 1 column
# y[i] is the label (1 or 2 … or k) of the i-th sample
# Output: vector theta of d rows, 1 column
# vector b of k-1 rows, 1 column
def run(k,X,y):
# Your code goes here
return (theta, b)
4
Notice that for prediction you can reuse the ratingpred.py function that
you wrote for question 2.
SOME POSSIBLY USEFUL THINGS.
Python 2.7 is available at the servers antor and data. From the terminal, you
can use your Career account to start a ssh session:
ssh username@data.cs.purdue.edu
OR
ssh username@antor.cs.purdue.edu
From the terminal, to start Python:
python
Inside Python, to check whether you have Python 2.7:
import sys
print (sys.version)
Inside Python, to check whether you have the package cvxopt:
import cvxopt
From the terminal, to install the Python package cvxopt:
pip install –user cvxopt
OR
python -m pip install –user cvxopt
More information at https://cvxopt.org/install/index.html
SUBMISSION INSTRUCTIONS.
Your code should be in Python 2.7. We only need the Python scripts
(.py files). We do not need the Python (compiled) bytecodes (.pyc files).
You will get 0 points if your code does not run. You will get 0 points in you
fail to include the Python scripts (.py files) even if you mistakingly include the
bytecodes (.pyc files). We will deduct points, if you do not use the right name for
the Python scripts (.py) as described on each question, or if the input/output
matrices/vectors/scalars have a different type/size from what is described on
each question. Homeworks are to be solved individually. We will run plagiarism
detection software.
Please, submit a single ZIP file through Blackboard. Your Python scripts
(ratingprank.py, ratingpred.py, ratingsvm.py) should be directly inside
the ZIP file. There should not be any folder inside the ZIP file, just
Python scripts. The ZIP file should be named according to your Career account.
For instance, if my Career account is jhonorio, the ZIP file should be named
jhonorio.zip
5