留学生辅导 1. Unsupervised learning: clustering

1. Unsupervised learning: clustering
2. Concepts
3. Methods
4. Evaluation

Clustering
Definition:
1. Clustering is Unsupervised means the label class is not known
2. Finding groups of items that are similar
Basic contrasts
1. Exclusive vs overlapping
a) Can an item be in more than one cluster?
2. Deterministic(hard) vs probabilistic(soft) [hard vs soft clustering]
a) I 属于class1 –很确定的是 hard
b) I 属于 class1的概率是0.7 ， class 2 的概率是0.3 –用概率的是 soft
3. Hierarchical vs partitional
a) Do the clusters have subset relationships between them? e.g., nested in a tree?
4. Partial vs complete
a) In some cases, we only want to cluster some of the data
5. Heterogenous vs homogenous
a) Clusters of widely different sizes, shapes, and densities
6. Incremental vs batch
a) 是逐步构成cluster，还是所有数据一次分好？

理想中的clustering
1. Scalability; high dimensionality
a) Can handle attributes 很多的情况，或者dimension 很高的情况
2. Ability to deal with different types of attributes
a) Numerical / categorical
3. Discovery of clusters with arbitrary shape 能处理任意cluster形状
4. Able to deal with noise and outlier
5. Insensitive to order of the input records
a) Especially in incremental clustering, 要求data 1,2,3输入与data 3,2,1输入的结果相同
How to evaluate a good Clustering?
1. Measures of cluster cohesion (compactness, tightness) 紧凑性，严密性
2. Measures of cluster separation(isolation, distinctiveness) 孤立性，独特性
3. Evaluation
a) Unsupervised evaluation
i. High cluster cohesion
ii. High cluster separation
iii. Sum of squared error (SSE)
b) Supervised evaluation
i. If label is available, evaluate the degree to which class labels are consistent within a cluster and different across clusters.当建模的时候没有LABEL，但是在evaluation的时候有label.
2. Entropy
Methods:核心要素-距离
Clustering finds groups of instances in the dataset.
1. Similar or close to each other within a group
2. Are being different or separated from other clusters
Distance algorithm
1. Data point in Euclidean space
a) Euclidean distance
b) Manhattan distance
2. Discrete values
a) Hamming distance
3. Text documents
a) Cosine similarity
b) Jaccard measure
4. Other measure
a) Correlation
b) Graph-based measures
重点： K-means clustering
1. Definition:
i. Randomly choose the initial centroids -> 导致不同结果
ii. 无法用在heterogenous 问题中
2. Advantage
a) Relatively efficient (4-5 # iteration 就可得到converge)
b) Can be extended to hierarchical clustering (第一次分，第二次再分)
3. Disadvantage
a) Local optimum, sensitive to seed instances
b) Need to specify k in advance
c) Cannot handle non-convex clusters
d) Mean is hard to defined for nominal or categorical attributes
e) May not work well when data contains outliers
4. How to choose the number of clusters?
a) Elbow method
i. K increase to K+1, the drop of SSE starts to diminish
5. Clustering 的组成顺序
a) Bottom-up clustering
i. Agglomerative clustering （step）
1. Compute the proximity matrix
3. Merge the closest two clusters
4. Update the proximity matrix
5. Until only one cluster remains
ii. How to compute the proximity matrix?
1. Graph-based measure of proximity
a) Single link
b) Complete link
c) Group average

b) Top-down clustering
i. Bi-secting k-means

Semi-supervised learning
a. Semi-supervised leaning approach
a) Clustering + majority voting
i. Combine a supervised and unsupervised model
ii. Find clusters, choose a label for each most common label, and apply it to the unlabelled cluster members
iii. Advantage:
1. 有更多数据，表现会更好
iv. Disadvantage
1. 少数的labelled 不一定具有代表性,万一是 outlier，出大问题
b) Self-training (known as ‘bootstrapping’)
b. Active learning
c. Semi-supervised leaning vs active learning
d. Query strategies

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts