CMSC5741 Big Data Tech. & Apps.
Assignment 2
Due Date: 23:59 Dec.8, 2018
Submission Instruction:
For this assignment, please submit electronic version only. We don’t accept
hard copy. For the programming questions, you need to submit BOTH your
codes and your results. Submit codes as zipped tar file and the output of your
program in plain-text file. You should place the relevant files (if any) in their
separate directory (preferable 1, 2, 3… for each question). For other questions,
answer them in one word document.
Please compress all your files as one zip file named by your student id, e.g.,
10xxxxxxxx.zip, and submit to cmsc5741@cse.cuhk.edu.hk with email title
“CMSC5741 Asg#1, Your name, Your student ID”.
Policy on Late Submission:
1). Late within 2 days – 30% deduction.
2). Late after 2 days – not allowed.
Part A. PCA Programming (20 points)
The following problem is based on lecture 6.
Principal component analysis provides a way of creating an optimal low-dimensional
representation of a dataset. In this exercise, we will do such a PCA analysis on handwritten
digits.
About the dataset (mnist-subset.zip): This is a subset of the MNIST dataset containing just
the digits 0,1 and 2. Next, begin with the homework starter source file (named with
example.py), which contains basic utilities for reading and displaying MNIST dataset.
1. Write a function to perform PCA on a group of images.
2. Use the PCA function from question 1 to compute the Digit-0-Space. Plot the mean
image and then the first 20 eigenvectors (as images). Plot the eigenvalues (in
decreasing order) as a function of dimension (for the first 100 dimensions). Describe
what you find in both plots.
3. Use the PCA function from question 1 to compute the Digit-2-Space. Plot the mean
image and then the first 20 eigenvectors (as images). Plot the eigenvalues (in
decreasing order) as a function of dimension (for the first 100 dimensions). Describe
what you find in both plots.
4. Use the PCA function from question 1 to compute the Digit-Space. Plot the mean
image and then the first 20 eigenvectors (as images). Plot the eigenvalues (in
decreasing order) as a function of dimension (for the first 100 dimensions). Compare
and contrast what you find in this plots to the ones you created in questions 2 and 3.
For this question, submit your code as zipped tar file. For question 2-3, submit your image in
jpg,or png file, submit the description in a plain text.
mailto:cmsc5741@cse.cuhk.edu.hk
Part B. Matrix Factorization (30 points)
The following problem is based on lecture 7.
In this question, you will implement a Probabilistic Matrix Factorization algorithm to predict
the potential ratings that users would assign to items. Distributed with this file is
“pmf_data.zip” which contains two files. The file named “train.dat” contains all the training
data. Each line of this file is a tab-separated triplet, of which the first column is the user id,
the second column is the movie id and the last column is the rating the user assigned to the
movie, e.g. “AH3QC2PC1VTGP 0000143561 2.0”. The other file named “test.dat”
contains all the challenges. Each line of this file is a tab-separated numbers, of which the
first is the user id and the second is the movie id, e.g. “A3R5OBKS7OM2IR 0000143502”.
Based on the training data given in “train.dat”, you are to implement a Probabilistic Matrix
Factorization method. Train your model on the training data, and output your predicted
ratings for all the challenges, in the exactly same order as the “test.dat”. Only output the
predicted number (Don’t print the user id and movie id; you can output integers or floats),
one number each line.
Your score for this question will be based on your RMSE value you can achieve. The
definition of RMSE you can find from here: http://en.wikipedia.org/wiki/Root-mean-
square_deviation
Your prediction will be compared with the ground truth. So try to be as accurate as possible.
You can modify the PMF algorithm where you see as appropriate. You will submit two
files for this assignment: one file containing the source code of your program and another txt
file containing your response to all the challenges.
In this question, you need to concern the efficiency of your code. For naïve implementation,
C/C++ and Java are fast enough to finish the training within one day, Python will take
several days to finish the training.
Part C. SVM Programming (25 points)
The following problem is based on lecture 10.
You are required to use SVM with stochastic gradient decent method introduced in the
lecture to a Google search snippets classification task. Distributed with this file is
“svm_data.zip” which contains three files. You will run your code on the given dataset
(train.dat) and then conduct prediction on the test dataset (test.dat). You can find more
details about the dataset in README. The assessment of your score is your accuracy on
the test dataset. (Tips: since there is no label given in the test dataset, you need to create a
validation dataset from the training set in order to tune parameters).
Description for the dataset:
There are 8 classes (e.g. business, computers) in the dataset. The number of instances in
the training are 10,060 and 2,280 respectively.
The format of the dataset is as follows:
Each line represents an instance, where the last word is the class label, previous words are
the search snippets.
Training and testing datasets have the same format except that in the testing dataset the last
word (i.e. class label) is missing, which you need to predict.
You will need to preprocess the input data by yourself, and are encouraged to use LIBSVM
or other SVM libraries you can get to accomplish the task.
http://en.wikipedia.org/wiki/Root-mean-square_deviation
http://en.wikipedia.org/wiki/Root-mean-square_deviation
You need to submit your code and a plain text of the results. The format of your result is:
each line only contains the class label, e.g. business, sports. And we will mark according to
your accuracy on testing dataset.
Part D. Online Learning Programming (25 points)
The following problem is based on lecture 11.
You are required to implement a version of Perceptron algorithm introduced in the lecture.
You will run your code on the given dataset (online_data.zip) and then conduct prediction on
the test dataset. All the training data can be only processed once, just like a stream input.
You need to output:
1. Final precision, recall and F1 score on the testing dataset.
2. ROC curve (as image) on the testing dataset. The definition of ROC curve can be
found in https://en.wikipedia.org/wiki/Receiver_operating_characteristic.
Description for the dataset:
There are two classes (e.g. +1 and -1) in the dataset. The number of instances in the
training and testing datasets are 29,298 and 3,263 respectively.
The format of the dataset is as follows:
Each line represents an instance, where the first filed is the class label (e.g. +1, -1), followed
by non-zero features, each of which is denoted as “featureno:value”; for example “39:1”
represents the value of feature 39 is 1. Training and testing datasets have the same format.
You need to submit your code and a word file containing the results and ROC curve image.
https://en.wikipedia.org/wiki/Receiver_operating_characteristic