CS代考 Homework 4

Homework 4
Start Assignment
Due Monday by 11:59pm Points 100 Submitting a file upload Available May 19 at 12pm – May 30 at 11:59pm 12 days
In this assignment, we are working on a list of 1200 bitstrings (https://canvas.ucdavis.edu/courses/666488/files/16365329/download?download_frd=1) , where each of them contains 16 bits.

Copyright By PowCoder代写 加微信 powcoder

We will apply Agglomerative Clustering, K-means Clustering, and PCA to this dataset.
Here is the zip (https://canvas.ucdavis.edu/courses/666488/files/16365330/download?download_frd=1) that contains the downloaded data and the problems.
Background and Data Information
(http://localhost:8888/notebooks/HW3/HW3.ipynb#Backgrou
and-Data-Information)
For a bitstring in this dataset, we describe , where is often known as the most significant bit (MSB) and as the least significant bit (LSB).
There are duplicated bitstrings in this dataset, but they will not affect this assignment. Don’t worry about them.
Equivalence Relation
(http://localhost:8888/notebooks/HW3/HW3.ipynb#Equivalen
This is an important concept to Exercise 1.
Let’s say if we have two bitstrings, .
We can flip one bit in to get another bitstring bit. We define the above transformation to be
We call two bitstrings and to be equivalent ( , where
, such that the difference of and .
) if there exists a sequence belongs to the dataset.
is only one

It can be seen that equivalence is both commutative ( ) as well as transitive ( ).
We can say that the elements in the above sequence form an equivalence class. Given a new bitstring , we can see that if , , then will be added to the above equivalence class, and
by the transitive property of equivalence relations, , and .
Example (http://localhost:8888/notebooks/HW3/HW3.ipynb#Example) Let’s say we have 4 bitstrings, each of them is 4 bits long. They are
respectively. We can say However,
Ultimately,
there are two classes.
. There may be sequences like
, but neither nor is in our dataset.
form an equivalence class, whereas
is the other. As a result,
Libraries that can be used: numpy, scipy, pandas, scikit-learn, matplotlib, seaborn (http://localhost:8888/notebooks/HW3/HW3.ipynb#Libraries-tha-can-be- used:-numpy,-scipy,-pandas,-scikit-learn,-matplotlib,-seaborn)
Any libraries used in the discussion materials are also allowed.
(http://localhost:8888/notebooks/HW3/HW3.ipynb#Exercises
Exercise 1 – Agglomerative Clustering (40 points in total)
Using agglomerative clustering with a distance threshold for early stopping, we can calculate the number of equivalence classes by counting the number of clusters. In order to perform agglomerative clustering, we have to consider what parameters may be used:
Exercise 1.1 – Choosing Parameters (20 points)
(http://localhost:8888/notebooks/HW3/HW3.ipynb#Exercise-1.1—

Choosing-Parameters-(20-points))
Explain how you would pick these parameters.
Which linkage rule should be used? (single-linkage, complete-linkage, or average-linkage) Which distance function should be used? (Euclidean distance, Manhattan distance, or cosine distance)
What should the threshold distance be?
How the distance threshold works: Whenever two clusters are picked to consider merging them, the distance between those clusters is compared to the distance threshold. If the distance is smaller than the threshold, the clusters merge and the algorithm continues; Otherwise, they will not be merged. How to choose a linkage rule: Think about how you would figure out which equivalence class the string 0001 belongs to in the previously given example.
Exercise 1.2 – Agglomerative Clustering for Equivalence Classes (20
Perform the agglomerative clustering with the parameters you picked in the above three questions. Show the frequency(number of members) of each cluster. You are encouraged to create a bar chart to show the distribution as it will help you in Exercise 2, but printing only the numbers is also fine.
The value of distance_threshold in the arguments should be slightly higher than what you picked. This is because we only merge two clusters when their distance is strictly smaller than the threshold.
Exercise 2 – K-Means Clustering (30 points in total)
(http://localhost:8888/notebooks/HW3/HW3.ipynb#Exercise- 2—K-Means-Clustering-(30-points-in-total))
Let’s see how k-means behave differently from agglomerative clustering.
Exercise 2.1 – K-Means Clustering for Equivalence Classes (20 points) (http://localhost:8888/notebooks/HW3/HW3.ipynb#Exercise- 2.1—K-Means-Clustering-for-Equivalence-Classes-(20-points))

Re-cluster the dataset with k-means, but with the number of clusters you obtained from Exercise 1. Show the frequency(number of members) of each cluster. Again, you are encouraged to create a bar chart, but printing the numbers is also fine.
Exercise 2.2 – Difference between Agglomerative Clustering and K- Means Clustering (10 points)
Compare the result from Exercise 2.1 with that from Exercise 1.2, and explain
How the two results are different Why there is such a difference
Exercise 3 – Principal Component Analysis (30 points in total)
We can visualize how the bitstrings are distributed using principal component analysis.
Exercise 3.1 – Generate 2 Clusters (10 points)
(http://localhost:8888/notebooks/HW3/HW3.ipynb#Exercise-3.1— Generate-2-Clusters-(10-points))
Re-do the k-means clustering on our dataset again, but this time we only consider k=2 . Show the frequency(number of members) of each cluster.
Exercise 3.2 – PCA for Feature Extraction (20 points)
(http://localhost:8888/notebooks/HW3/HW3.ipynb#Exercise-3.2—PCA- for-Feature-Extraction-(20-points))
Retrieve the projected dataset with PCA, using n_components=2 .
Generate a scatter plot to visualize the projected points, where they should be colored differently based on the assigned cluster in Exercise 3.1.
In the first principal component, print out the weights of all features.
Report which feature has the highest positive weight in the first principal component.

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com