代写代考 Unsupervised Learning & Data Clustering

Unsupervised Learning & Data Clustering
Problem Set-up

Define the set-up of unsupervised learning

Copyright By PowCoder代写 加微信 powcoder

Learning from Unlabeled Data
|Given a training set of n unlabeled samples {x(i)} |What can we learn from the samples?
➔We could estimate the overall distribution of the data without knowing their label.
➔We could figure out the groupings of the samples (if any).
➔We could identify some features that may be more important than others.

An Example
|Illustrating structures/groupings of unlabeled samples may relate to the (unknown) labels of the samples
➔If we know the labels, we may find the densities of the classes →
➔What may we see if we have no label for the data samples?

Another Example
|A density estimated from unlabeled samples may help us to identify densities of different classes
|If we know there are three classes in the data, each having a normal distribution …

A Mixture-Density Model
|Assume a parametric model like this:
-The samples come from C classes.
-The prior probabilities P(j) for each class are known, for j = 1, …,C.
-The form of p(x | j, j) (j = 1, …,C) are known. -The C parameter vectors 1, 2, …, C are unknown.
|Samples from this distribution are given, but the labels of the training samples are unknown.

A Mixture-Density Model (cont’d)
|What is the PDF of the unlabeled samples? C
p(x|θ)=p(x| ,θ )P( )
where  = (1, 2, …, C)
|Can we learn  from unlabeled samples from this mixture density?

Illustrating Mixture-Density Model
|An example: with the assumption of 4 classes

Illustrating Mixture-Density Model
|An example: with the assumption of 2 classes

The Question of Identifiability
|Can we learn a unique  from a set of unlabeled samples from a mixture density?
-For continuous features (with PDFs), the answer is often “Yes”.
|An example in discrete case (with PMF). -Two coins with P(head) being p & q respectively.
-Randomly pick one and toss it; Record the outcome.
-With only the outcomes of N tosses, but not knowing which coin was used each time (→ unsupervised), can we figure out p and q?

Unsupervised Learning
Gaussian Mixture Models and the EM Algorithm

Define the Gaussian Mixture Model
Illustrate the Expectation- Maximization Algorithm

The Gaussian Mixture Models
|The mixture model: C
p(x|θ)=p(x| ,θ )P( )
|GMM: each component density is a Gaussian
distribution.
-Can be a good approximation to many real data distributions.

If We Do Have Labels…
C p(x|θ)=p(x| ,θ )P( )
|This becomes supervised learning for each component (class).
|It is more difficult without labels.

Unsupervised Case
|Consider an iterative method using the maximum likelihood estimation concept.
|Consider a 3- component 1-d example.
|What are the parameters in this case?
|We might have some initial (imprecise) guesses for the parameter, e.g., vs the k- means algorithm.
– How to improve the initial guesses?

Unsupervised Case (cont’d)
|Iterate on t
Given parameter estimates at iteration t-1
An example of Expectation- Maximization Algorithm.
Step 1. For a sample j, compute its probability of being from class k
Step 2. Update the estimates of the parameters

Unsupervised Learning
The k-Means Algorithm

Discuss the basics of data clustering
Illustrate the k- Means Algorithm

Finding the Clusters/Groupings of the Samples
|A few basic questions to answer
-How to represent the clusters?
➔ We will use the centroid to represent a
-Which cluster a sample should be assigned to (e.g., membership)?
➔ We will use the similarity to the centroid to determine the membership.
-What similarity measure to use? • E.g., Euclidean distance

More on Similarity Measures
|If we use Euclidean distance as the measure: -It is invariant to translations & rotations of the feature
-But not to more general transformations.
|E.g., if one feature is scaled.

More on Similarity Measures (cont’d)
|Other types of similarity measures
|E.g., cosine similarity
|E.g., distance on a graph, like shortest path.

Clustering as Optimization
|The sum-of-squared- error criterion/cost
-Let Di be the subset of samples from class i.
-Let ni be the number of samples in Di, and mi the mean of those samples
-The sum of squared error is:
➔Well-separated, compact data “clouds” tend to give small errors when the clusters coincide with the clouds.

Clustering as Optimization (cont’d)
➔An optimization problem to solve for finding a “good” clustering: to find the partition of the data that minimizes Je
➔If the membership of a sample is determined by the distance to the means mi
➔Then the task is to find the optimal set of {mi} ➔The problem is NP-hard.

k-Means Clustering
|Input: Given n data samples
|Goal: Partition them into k clusters/sets Di, with respective center/mean vectors 1, 2, …, k, so as to minimize
σ𝑘 σ ||𝐱 − 𝛍 ||2 𝑖=1 𝐱∈𝐷𝑖 𝑖
|Comparing with the mixture models:
-Here we do “hard” assignment of the membership to a sample (simply based on its distance to the cluster center).

The Basic k-Means Algorithm
Given: n samples, a number k.
initialize 1, 2, …, k(randomly selected)
do classify n samples according to nearest i
recompute i
until no change in i
return 1, 2, …, k End

Illustrating the Algorithm

Unsupervised Learning
Analyzing the k-Means Algorithm

Discuss the weaknesses of the k- means algorithm
Discuss a few common techniques for potential improvement

Properties of the k-Means Algorithm
|The algorithm will converge when the cluster centers no longer change.
|But the results may not be an optimal solution. ➔Sensitivity to initialization

Another Example
|What can we do to improve?
➔ The natural grouping seems to be so well defined.
➔ For k=3, what will be the clusters?

A Few Common “Tricks”
|Choosing the point |Multiple runs with furthest from the different initial centers. previous centers.
-Drawback: might be sensitive to “outlies”.

Other Variants of Basic k-Means
|k-Means++:
-New centers are chosen with probabilities (as a function of
distance to closest prior centers).
-Kind of between “random” and “furthest point” techniques.
|Hierarchical approaches -Agglomerative vs divisive.

The Question of Choosing k
|Two trivial extremes
-If k=1, the error is the variance of the samples. -If k=n, the error can become 0.
|What is a proper 1CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com