程序代写代做代考 Hive 2017/18 COM6012 – Quiz 3

2017/18 COM6012 – Quiz 3

Assignment Brief

Deadline: 09:50AM​ on Wednesday 02 May 2018 (50 minutes, from 9:00am)

Late Penalty​: The penalty will be 1% for every minute of lateness.

How and what to submit: ​Submit ONE text file (.txt or .pdf) via MOLE under assignment
“​Quiz 3​” containing: the code and the required info including comment and program output
(Info1, Info2, Info3 and Info4, in the task description). Typically, you can just copy and paste
from your interactive window to this submission file. Name your file as
“LastName_Quiz3.xxx”.

Format: ​You will be given several tasks and write Scala programs to complete them. No
template will be given but you can modify existing code to complete the quiz. It is ‘open
book’. You have access to internet and teaching materials but must complete the quiz
independently​, without getting help from another person. You are encouraged to add
specific comments to your code.

Scope: Lab Sessions 8 and 9 and Lectures 8 & 9. ​You are expected to have completed
lab sessions 8 and 9​. Please use interactive mode ​to develop/test your code and get the
output for submission.

Assessment criteria (20 marks in total)

1. Being able to perform PCA to reduce the dimensionality of data.
2. Being able to perform K-means clustering and evaluate the results.

On ​Plagiarism​: ​http://www.dcs.shef.ac.uk/intranet/teaching/public/assessment/plagiarism.html
———————————————————–

Quiz 3 tasks

We will work on the ​Semeion Handwritten Digit Data Set​ from UCI:
https://archive.ics.uci.edu/ml/datasets/Semeion+Handwritten+Digit
The data file is provided for your convenience: ​semeion.data

The data set is in binary matrix format with size 1593 x 266. 1593 is the number of (binary)
handwritten digit images. Each image is of size 16×16, represented as a vector of size 256.
The last 10 bits (0/1s) of each row indicate the class label.

Task 1​: Read the data in and remove the last 10 bits. Add a comment to indicate how the
10bits are removed. [​3 marks​]

Info1​: A comment referring to your code to explain how the last 10 bits were removed.

http://www.dcs.shef.ac.uk/intranet/teaching/public/assessment/plagiarism.html

Task 2: Perform ​k​-means clustering on the 1593 images to cluster them into 10 clusters,
using the original 256 variables for each image. Evaluate the clustering quality with a
clustering evaluation metric. Report the metric for your clustering result [​5 marks​].

Info2​: Explain what is your chosen evaluation metric and report the metric (value) for your
clustering result.

Task 3​: Find the smallest number ​P of principal components that can keep ​80% of the
variance. ​Report ​the number ​P ​and the largest 5 eigenvalues. [​8 marks​]
[ If you cannot figure out the required P, choose P=5, and report the largest 5 eigenvalues,
but you will receive no more than 5 marks for this task ].

Info3​: The number ​P ​and the largest 5 eigenvalues.

Task 4: Perform ​k​-means clustering on the 1593 images to cluster them into 10 clusters,
using the PCA features determined in Task 3, i.e., P principal components instead of 256.
Use the same metric in Task 2 to evaluate the clustering result and report which (Task 2/4)
clustering result is better according to the metric [​4 marks​]

Info4​: The number ​P ​and the largest 5 eigenvalues.