程序代写代做代考 data mining python algorithm Data Mining and Machine Learning

Data Mining and Machine Learning

Fall 2018, Homework 2

(due on Sep 11, 11.59pm EST)

Jean Honorio jhonorio@purdue.edu

The homework is based on a total of 10 points. Your code should be in
Python 2.7. For clarity, the algorithms presented here will assume zero-based
indices for arrays, vectors, matrices, etc. Please read the submission instructions
at the end. Failure to comply to the submission instructions will cause
your grade to be reduced.

In this homework, we will focus on classification for separable data. You can
use the following script createsepdata.py to create some synthetic separable
data:

import numpy as np

import scipy.linalg as la

# Input: number of samples n

# number of features d

# Output: numpy matrix X of features, with n rows (samples), d columns (features)

# X[i,j] is the j-th feature of the i-th sample

# numpy vector y of labels, with n rows (samples), 1 column

# y[i] is the label (+1 or -1) of the i-th sample

# Example on how to call the script:

# import createsepdata

# X, y = createsepdata.run(10,3)

def run(n,d):

y = np.ones((n,1))

y[n/2:] = -1

X = np.random.random((n,d))

idx_row, idx_col = np.where(y==1)

X[idx_row,0] = 0.1+X[idx_row,0]

idx_row, idx_col = np.where(y==-1)

X[idx_row,0] = -0.1-X[idx_row,0]

U = la.orth(np.random.random((d,d)))

X = np.dot(X,U)

return (X,y)

1

In this homework, you will implement algorithms that depend on a kernel
function K. You should call the following script K.py

import numpy as np

import math

# Input: numpy vector x of d rows, 1 column

# numpy vector z of d rows, 1 column

# Output: kernel K(x,z) = exp(-1/2 * norm(x-z)^2)

# Example on how to call the script:

# import K

# v = K.run( np.array([[1], [4], [3]]) , np.array([[2], [7], [-1]]) )

def run(x,z):

x = x.flatten()

z = z.flatten()

if x.size != z.size:

raise ValueError

return math.exp(-1/2 * np.sum((x-z) ** 2))

Here are the questions:

1) [4 points] Implement the following perceptron algorithm with kernels, intro-
duced in Lecture 3.

Input: number of iterations L, training data xt ∈ Rd, yt ∈ {+1,−1} for
t = 0, . . . , n− 1
Output: α ∈ Rn
α← 0
for iter = 1, . . . , L do

for t = 0, . . . , n− 1 do

if yt

(
n−1∑
i=0

αiyiK(xi, xt)

)
≤ 0 then

αt ← αt + 1
end if

end for
end for

The header of your Python script kerperceptron.py should be:

# Input: number of iterations L

# numpy matrix X of features, with n rows (samples), d columns (features)

# X[i,j] is the j-th feature of the i-th sample

# numpy vector y of labels, with n rows (samples), 1 column

# y[i] is the label (+1 or -1) of the i-th sample

# Output: numpy vector alpha of n rows, 1 column

def run(L,X,y):

# Your code goes here

return alpha

2

2) [2 points] Implement the following predictor function with kernels, introduced
in Lecture 3.

Input: α ∈ Rn, training data xt ∈ Rd, yt ∈ {+1,−1} for t = 0, . . . , n − 1,
testing point z ∈ Rd
Output: label ∈ {+1,−1}

if

n−1∑
i=0

αiyiK(xi, z) > 0 then

label← +1
else

label← −1
end if

The header of your Python script kerpred.py should be:

# Input: numpy vector alpha of n rows, 1 column

# numpy matrix X of features, with n rows (samples), d columns (features)

# X[i,j] is the j-th feature of the i-th sample

# numpy vector y of labels, with n rows (samples), 1 column

# y[i] is the label (+1 or -1) of the i-th sample

# numpy vector z of d rows, 1 column

# Output: label (+1 or -1)

def run(alpha,X,y,z):

# Your code goes here

return label

3) [4 points] Now we ask you to implement the following dual support vector
machines (DSVM) problem, introduced in Lecture 4.

maximize

n−1∑
i=0

αi −
1

2

n−1∑
i,j=0

αiαjyiyjK(xi, xj)

subject to αi ≥ 0 for i = 0, . . . , n− 1

Let f = (−1,−1, . . . ,−1)T ∈ Rn be an n-dimensional vector of minus ones. Let
b = (0, 0, . . . , 0)

T ∈ Rn be an n-dimensional vector of zeros. Let A ∈ Rn×n be a
matrix with n rows and n columns, with minus ones on the diagonal (ai,i = −1
for i = 0, . . . , n− 1) and zeros on the off-diagonal (ai,j = 0 for i 6= j). In other
words, A is the negative of the identity matrix. Let H ∈ Rn×n be a matrix
with n rows and n columns, where hi,j = yiyjK(xi, xj) for all i, j = 0, . . . , n−1.
Since α ∈ Rn, we can rewrite the DSVM problem as:

minimize
1

2
αTHα+ fTα

subject to Aα ≤ b

Fortunately, the package cvxopt can solve exactly the above problem by doing:

3

import cvxopt as co

alpha = np.array(co.solvers.qp(co.matrix(H,tc=’d’),co.matrix(f,tc=’d’),

co.matrix(A,tc=’d’),co.matrix(b,tc=’d’))[’x’])

The header of your Python script kerdualsvm.py should be:

# Input: numpy matrix X of features, with n rows (samples), d columns (features)

# X[i,j] is the j-th feature of the i-th sample

# numpy vector y of labels, with n rows (samples), 1 column

# y[i] is the label (+1 or -1) of the i-th sample

# Output: numpy vector alpha of n rows, 1 column

def run(X,y):

# Your code goes here

return alpha

Notice that for prediction you can reuse the kerpred.py script that you
wrote for question 2.

SOME POSSIBLY USEFUL THINGS.
Python 2.7 is available at the servers antor and data. From the terminal, you
can use your Career account to start a ssh session:

ssh username@data.cs.purdue.edu

OR

ssh username@antor.cs.purdue.edu

From the terminal, to start Python:

python

Inside Python, to check whether you have Python 2.7:

import sys

print (sys.version)

Inside Python, to check whether you have the package cvxopt:

import cvxopt

From the terminal, to install the Python package cvxopt:

pip install –user cvxopt

OR

python -m pip install –user cvxopt

More information at https://cvxopt.org/install/index.html

4

SUBMISSION INSTRUCTIONS.
Your code should be in Python 2.7. We only need the Python scripts
(.py files). We do not need the Python (compiled) bytecodes (.pyc files).
You will get 0 points if your code does not run. You will get 0 points in you
fail to include the Python scripts (.py files) even if you mistakingly include the
bytecodes (.pyc files). We will deduct points, if you do not use the right name for
the Python scripts (.py) as described on each question, or if the input/output
matrices/vectors/scalars have a different type/size from what is described on
each question. Homeworks are to be solved individually. We will run plagiarism
detection software.
Please, submit a single ZIP file through Blackboard. Your Python scripts
(kerperceptron.py, kerpred.py, kerdualsvm.py) should be directly inside
the ZIP file. There should not be any folder inside the ZIP file, just
Python scripts. The ZIP file should be named according to your Career account.
For instance, if my Career account is jhonorio, the ZIP file should be named
jhonorio.zip

5