代写代考 Lecture 9: Unsupervised Learning

Lecture 9: Unsupervised Learning
Semester 1, 2022 , CIS
Copyright @ University of Melbourne 2022. All rights reserved. No part of the publication may be reproduced in any form by print, photoprint, microfilm or any other means without
written permission from the author.

Copyright By PowCoder代写 加微信 powcoder

Acknowledgement: , &

• Classification
• Evaluation of supervised machine learners • Feature Selection
• Unsupervised learning: Clustering • Concepts
• Evaluation

Clustering

A possible clustering of the weather dataset
Outlook Temperature
sunny hot overcast hot rainy mild
high high high high
Windy Cluster FALSE ? TRUE ? FALSE ? FALSE ? FALSE ? TRUE ? TRUE ? FALSE ? FALSE ? FALSE ? TRUE ? TRUE ? FALSE ? TRUE ?
sunny overcast overcast rainy
cool normal cool normal cool normal mild high cool normal mild normal mild normal mild high hot normal mild high

Clustering over the weather dataset (cf. outputs)
Outlook sunny sunny overcast rainy rainy rainy overcast sunny sunny rainy sunny overcast overcast rainy
Temperature Humidity hot high hot high hot high mild high
cool normal cool normal cool normal mild high cool normal mild normal mild normal mild high hot normal mild high
Windy FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE TRUE
Cluster Play 0 no 0 no
0 yes 1 yes 1 yes 1 no 1 yes 0 no 1 yes 1 yes 1 yes 1 yes 0 yes 1 no

Clustering
• Clustering is unsupervised
• The class of an example is not known (or at least not used)
• Finding groups of items that are similar
• Success often measured subjectively
• Applications in pattern recognition, spatial data analysis, medical diagnosis, . . .

Clustering, basic contrasts
Exclusive vs. overlapping
• Can an item be in more than one cluster?
Deterministic vs. probabilistic (Hard vs. soft clustering) • Can an item be partially or weakly in a cluster?
Hierarchical vs. partitioning
• Do the clusters have subset relationships between them? e.g. nested in a tree?
Partial vs. complete
• In some cases, we only want to cluster some of the data
Heterogenous vs. homogenous
• Clusters of widely different sizes, shapes, and densities
Incremental vs. batch
• Is the whole set of items clustered in one go?

Exclusive vs. overlapping clustering
Can an item be in more than one cluster?

Deterministic vs. probabilistic clustering
Can an item be partially or weakly in a cluster?
Instance Cluster
12 23 32 41 52 62 74
C luster Instance 1 2 3 4
1 0.01 2 0.05 3 0.00 4 0.45 5 0.01 6 0.07 7 0.23
0.87 0.12 0.00 0.25 0.67 0.03 0.98 0.02 0.00 0.39 0.08 0.08 0.99 0.00 0.00 0.75 0.08 0.10 0.10 0.20 0.47

Hierarchical vs. partitioning clustering
Do the clusters have subset relationships between them? e.g. nested in a tree?

Partial vs. complete
In some cases, we only want to cluster some of the data

Heterogenous vs. homogenous
Clusters of widely different sizes, shapes, and densities

Incremental vs Batch clustering

Clustering, Desiderata
• Scalability; high dimensionality
• Ability to deal with different types of attributes • Discovery of clusters with arbitrary shape
• Able to deal with noise and outliers
• Insensitive to order of input records

What is a good clustering?

Two clusters?

Four clusters?

Six clusters?

Types of Evaluation
Unsupervised
• Measures the goodness of a clustering structure without respect to external information. Includes measures of cluster cohesion (compactness, tightness), and measures of cluster separation (isolation, distinctiveness).
Supervised
• Measures the extent to which the clustering structure discovered by a clustering algorithm matches some external structure. For instance, entropy can measure how well cluster labels match externally supplied class labels.

Unsupervised Evaluation I
A “good” cluster should have one or both of:
• High cluster cohesion: instances in a given cluster should be closely related to each other
cohesion(Ci) = 􏰅x,y∈Ci Distance(x,y)
• High Cluster Separation instances in different clusters should be
distinct from each other
separation(Ci , Cj ) = 􏰂 Distance(x , y ) x∈Ci,y∈Cj̸=i

Unsupervised Evaluation II
Most common measure for evaluating cluster quality is Sum of Squared Error (SSE) or Scatter
􏰂 􏰂 dist2(mi , x)
• x: is a data point in cluster Ci
• mi : is the representative point for cluster Ci
• For each point, the error is the distance to the nearest cluster • To get SSE, we square these errors and sum them.

Unsupervised Evaluation II
Most common measure for evaluating cluster quality is Sum of Squared Error (SSE) or Scatter
􏰂 􏰂 dist2(mi , x)
• Can show that the mi that minimises SSE corresponds to the center
(mean) of the cluster
• At ‘application time’: Given two clusters, we can choose the one with the smallest error
• One easy way to reduce SSE is to increase k, the number of clusters
• However, a good clustering with smaller k can have a lower SSE than a poor clustering with higher k

Sum of squared errors: Example
SSE for Categorical Attributes
SSE = SSE1 +SSE2 = 38

Supervised Evaluation
If labels are available, evaluate the degree to which class labels are consistent within a cluster and different across clusters
• Purity: Probability of the class with highers probability (representation) in each cluster. Higher is better.
purity = 􏰂k |Ci|maxPi(j)
• Entropy: Entropy probability of distribution over labels per cluster. Lower is better.
entropy=􏰂k |Ci|H(xi)
• where xi is the distribution of class labels in cluster i

Supervised Evaluation Example I
Ex. 1: Calculate the entropy and purity of the following cluster output
purity = 􏰂k |Ci|maxPi(j) entropy = 􏰂k |Ci|H(xi)
NjN i=1 i=1

Supervised Evaluation Example II
Ex. 2: Calculate the entropy and purity of the following cluster output
entropy1 =−1×log(1)−0×log(0)=0
entropy2 = −0.6 × log(0.6) − 0.4 × log(0.4) = 0.97
purity1 =max(1,0)=1 purity2 =max(0.6,0.4)=0.6

Supervised Evaluation Example II
Ex. 2: Calculate the entropy and purity of the following cluster output
entropy = 2 ×0+ 10 ×0.97 = 0.81 12 12
purity = 2 ×1+ 10 ×0.6 = 0.67 12 12

Similarity / Proximity / Closeness
Clustering finds groups of instances in the dataset which
• are similar or close to each other within a group
• are being different or separated from other clusters/clusters.
A key component of any clustering algorithm is a measurement of the distance between any points.

Measuring Similarity / Proximity / Closeness
Data points in Euclidean space
• Euclidean (L2) distance • Manhattan (L1) distance

Measuring Similarity / Proximity / Closeness
Discrete values
• Hamming distance (discrepancy between the bit strings)
dabc a011 b101 c110
For two bit strings, the number of positions at which the corresponding symbols are different

Measuring Similarity / Proximity / Closeness
Text documents
• Cosine similarity • Jaccard measure
Other measures
• Correlation
• Graph-based measures

k-means Clustering
Given k, the k-means algorithm is implemented in four steps:
1. 2. 3. 4. 5.
Select k points to act as seed cluster centroids repeat
Assign each instance to the cluster with the nearest centroid
Recompute the centroid of each cluster until the centroids don’t change
an exclusive, deterministic, partitioning, batch clustering method

Example, Iterations
Demo: https: //www.naftaliharris.com/blog/visualizing-k-means-clustering/

k-means Clustering – Details
• Initial centroids are often chosen randomly.
• Clusters produced vary from one run to another.
• The centroid is (typically) the mean of the points in the cluster.
• ‘Nearest’ is based on proximity/similarity metric.
• K-means will converge for common similarity measures mentioned above.
• Most of the convergence happens in the first few iterations.
• Often the stopping condition is changed to ‘Until relatively few points
change clusters’
(this way the stopping criterion will not depend on the type of similarity or dimensionality)

Example, Impact of initial seeds

Example, Different outcomes

Shortcomings of k-means

k-means, Pros and Cons
• relatively efficient:
• O(ndki), where n is no. instances, d is no. attributes, k is
no. clusters, and i is no. iterations; normally k , i ≪ n • Unfortunately we cannot a priori know the value of i!
• can be extended to hierarchical clustering
Weaknesses
• tends to converge to local minimum; sensitive to seed instances
• need to specify k in advance
• not able to handle non-convex clusters, or clusters of differing densities or sizes
• “mean” ill-defined for nominal or categorical attributes
• may not work well when the data contains outliers

How to choose the number of clusters?
• Calculate SSE for different number of clusters K
SSE = 􏰂 􏰂 (x − mi )2
• As K increases, we will have a smaller number of instances in each cluster → SSE decreases
• Elbow method: K increases to K + 1, the drop of SSE starts to diminish

Hierarchical Clustering
Bottom-up (= agglomerative) clustering
• Start with single-instance clusters
• Iteratively merge clusters in a bottom-up fashion
Top-down (= divisive) clustering
• Start with one universal cluster
• Iteratively split clusters in a top-down fashion
In contrast to k-means clustering, hierarchical clustering only requires a measure of similarity between groups of data points. No seeds, no k value.

Agglomerative Clustering (bottom-up)
Input: A distance function
1. 2. 3. 4.
Compute the proximity matrix, if necessary.
Merge the closest two clusters
Update the proximity matrix to reflect the proximity between the new cluster and the original clusters
until Only one cluster remains
Output: Adendrogram:treethatrepresentsthehierarchicalclusteringof all instances into successivly smaller groups.

Example, Step 1

Example, Step 2

Example, Step 3
C2 C3 C4 C5
Proximity)Matrix)
p1 p2 p3 p4
p9 p10 p11 p12

Graph-based measure of Proximity
Updating the proximity matrix:
• Single Link: Minimum distance between any two points in the two
clusters. (most similar members)
• Complete Link: Maximum distance between any two points in the two clusters. (most dissimilar members)
• Group Average: Average distance between all points (pairwise).

Agglomerative Clustering Example
An initial proximity matrix with five clusters (aka. five points) 12345
1 1.00 0.90 2 0.90 1.00 3 0.10 0.70 4 0.65 0.60 5 0.20 0.50
What are the two closest points?
0.10 0.65 0.20 0.70 0.60 0.50 1.00 0.40 0.30 0.40 1.00 0.80 0.30 0.80 1.00

Agglomerative Clustering Example
An initial proximity matrix with five clusters (aka. five points)
12345 1 1.00 0.90 0.10 0.65 0.20 2 0.90 1.00 0.70 0.60 0.50 3 0.10 0.70 1.00 0.40 0.30 4 0.65 0.60 0.40 1.00 0.80 5 0.20 0.50 0.30 0.80 1.00
Merge points 1 & 2 into a new cluster: {1,2}

Agglomerative Clustering Example
An initial proximity matrix with five clusters (aka. five points)
12345 1 1.00 0.90 0.10 0.65 0.20 2 0.90 1.00 0.70 0.60 0.50 3 0.10 0.70 1.00 0.40 0.30 4 0.65 0.60 0.40 1.00 0.80 5 0.20 0.50 0.30 0.80 1.00
Merge points 1 & 2 into a new cluster: {1,2}
Update (single link):
Update (complete link):
1 2 3 4 5 {1,2} {1,2}
1 2 3 4 5 {1,2} {1,2}

Agglomerative Clustering Example
An initial proximity matrix with five clusters (aka. five points)
12345 1 1.00 0.90 0.10 0.65 0.20 2 0.90 1.00 0.70 0.60 0.50 3 0.10 0.70 1.00 0.40 0.30 4 0.65 0.60 0.40 1.00 0.80 5 0.20 0.50 0.30 0.80 1.00
The two resulting dendrograms:
Single link Complete link

Quick coding demo!
https://docs.scipy.org/doc/scipy-1.2.1/reference/cluster.
hierarchy.html

Top-down clustering: Bi-secting k-means
Variant of K-means that can produce a partitional or a hierarchical clustering.
Basic Idea: to obtain K clusters, split the set of all points into two clusters, select one of these clusters to split, and so on, until K clusters have been produced.
1. Initialize the list of clusters to contain the cluster consisting of all points.
3. Pick a cluster from the current dendrogram.
4. {Perform several ‘trial’ bisections of the chosen cluster.}
5. fori=1tonumberoftrialsdo
6. Bisect the selected cluster using basic k-means.
7. end for.
8. Select the two clusters from the bisection with the lowest total SSE.
9. Add these two clusters to the dendrogram.
10. until We have produced (a dendrogram of) k clusters.

Final Thoughts
Clustering is in the eyes of the beholder
• “The validation of clustering structures is the most difficult and frustrating part of cluster analysis. Without a strong effort in this direction, cluster analysis will remain a black art […]”1 Cluster validation an open research area!
Other variants of unsupervised learning (beyond the scope of the subject)
• Density Estimation: Estimate the underlying (unobservable) probability distribution that explains the observed variables (features). E.g., Gaussian mixture models (also: ‘soft’ K-means)
• Dimensionality Reduction: Map original features to a smaller set of features that still explain the observed data well (either a subset, or a new set of features). E.g., Principle Component Analysis
1 http://homepages.inf.ed.ac.uk/rbf/BOOKS/JAIN/Clustering_Jain_Dubes.pdf

• Unsupervised learning: Clustering • Concepts
• Evaluation
Next up…
• Semi-supervised learning • Anomaly detection
• Ethics in ML

References
Resources:
Tan, Steinbach, Kumar (2006) Introduction to Data Mining. Chapter 8, Cluster Analysis http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf
Jain, Dubes (1988) Algorithms for Clustering Data. http: //homepages.inf.ed.ac.uk/rbf/BOOKS/JAIN/Clustering_Jain_Dubes.pdf

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com