数据挖掘代写 Assignment 3

 

Assignment 3
Due: 15 Nov 2018, 11:59pm

Submission Guidelines

Assignments should be sumbmitted to comp4331fall18@gmail.com as at- tachments.

You need to zip the following two files together:

– A3 itsc stuid report.pdf/.docx: Please put all your reports in this file. (Attachments should be original .pdf or .docx, NOT compressed)

– A3 itsc stuid code.zip: The zip file contains all your source codes for the first assignment.

All attachments, including report and code, should be named in the format of: Ax itsc stuid.zip. E.g., for a student with itsc account: sdiaa, student id: 20171234, the 1st assignment can be named as: A1 sdiaa 20171234.zip.

Submissions not following the rules above are NOT accepted.
20 marks will be deducted for every 24 hours after the deadline. Your grade will be based on the correctness, efficiency and clarity. The email for Q&A: sdiaa@connect.ust.hk or trasier1207@gmail.com. Plagiarism will lead to zero mark.

COMP 4331 Data Mining, Fall 2018

1

2 Major Tasks

This assignment consists of the following tasks:
• To acquire a better understanding of clustering methods.
• To learn to implement K-means for clustering.
• To learn to implement Fuzzy Clustering EM for clustering. • To learn to use a DBSCAN model for clustering.

2.1 Clustering Methods

You are required to implement the three clustering methods and report the cost time. Please employ euclidean distance as the distance metric.

  • K-means You are required to implement the clustering method K-means by yourself. You are not allowed to use any existing K-means package (But basic computation package is okay, eg., Numpy, Scipy). You should test your algorithm with different K values, K ∈ {2, 10, 20, 30}.
  • Fuzzy Clustering EM You are required to implement the Fuzzy clus- tering method by using the EM algorithm (reference to slides 11ClusAd- vanced.pdf). You are not allowed to use any existing EM package (But basic computation package is okay, eg., Numpy, Scipy). Please test your algorithm by setting K = 2.
  • DBSCAN You are required to use the DBSCAN model for clustering. You may use the DBSCAN model implemented by scikit learn. Please test your model on the dataset by setting ε = 0.12 and MinPts = 3.

    Each model is required to output a text file with clusters (e.g., https:// github.com/comp4331fall18/sample-code-DM/blob/master/Assignment3-Dataset/ outputfile_sample.txt). Your programs should be written in such a way that
    the TA can run them easily to verify the results reported by you.

2.2 Data Set

You are required to test your models on the given dataset, https://github. com/comp4331fall18/sample-code-DM/blob/master/Assignment3-Dataset/ dataset.mat. The dataset contains 500 2-dimensions points.

2.3 Report Writing

You are expected to also report the time (using python time package) required by each method to complete the task, excluding the time needed for loading the data files. For K-means, please report cost time of different K settings. For Fuzzy clustering EM, please report sum of squared error (SSE) and center points in each iteration.

2

3

Grading Scheme

• K-means:

  • –  Build the K-means model. (15 points)
  • –  Output 4 text files with clusters information based on differnt K. (10 points)

    • Fuzzy Clustering EM:

    – Build the Fuzzy Clustering EM model. (15 points)
    – Print SSE and center points in each iteration. (5 points) – Output the text file with clusters information. (10 points)

    • DBSCAN:
    – Build DBSCAN model. (5 points)

    – Output the text file with clusters information. (10 points)

    • Project Report (30 points)

While you may discuss with your classmates on general ideas about the assignment, your submission should be based on your own independent effort. In case you seek help from any person or reference source, you should state it clearly in your submission. Failure to do so is considered plagiarism which will lead to appropriate disciplinary actions.

3