UNSUPERVISED LEARNING
Machine Learning for Financial Data
March 2021
Contents
◦ Introduction
◦ Cluster Analysis
◦ K-Means Clustering
◦ K-Modes Clustering
◦ Density-based Clustering
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 2
Unsupervised Learning
Introduction
Machine learning focuses primarily on supervised learning but the vast majority of the available data is unlabelled!
▪ Most of the applications of ML today are based on supervised learning
▪ The vast majority of the available data is unlabeled ▪ Having the input features X but not the labels y
▪ To develop a regular binary classifier to predict whether an item shown in a picture is defective or not, you will need to label every single picture as “defective” or “normal”
▪ Labelling generally requires human experts to manually go through all the pictures
▪ A long, costly, and tedious task, so usually done on only a small subset of the available pictures
▪ The labeled dataset will be quite small and the classifier’s performance will be disappointing
▪ Every time any change is made to the system, the labelling process will need to be repeated
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 4
Unsupervised Learning
Unsupervised Learning
Unsupervised learning refers to the use of ML algorithms to identify patterns in datasets containing data points that are neither classified nor labeled. The algorithms are thus allowed to classify, label and/or group the data points contained within the datasets without having any external guidance in performing that task. The ML algorithms will group data points according to similarities and differences even though there are no categories provided.
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 5
Unsupervised Learning
Unsupervised learning algorithms can only learn from samples themselves as there is no data labels to learn from
▪ In unsupervised learning, there is no hidden teacher, the main goals cannot be related to minimizing the prediction error with respect to the ground truth
▪ Unsupervised learning algorithms have to learn some pieces of information without any formal indication
▪ The only option is to learn from the samples themselves
▪ An unsupervised algorithm is usually aimed at discovering the similarities and patterns among samples or reproducing an input distribution given a set of features drawn from it
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 6
Unsupervised Learning
Unsupervised learning can be more unpredictable than supervised learning, such as creating clutter instead of order
▪ Unsupervised learning can be more unpredictable than a supervised learning model
▪ An unsupervised learning system might, for example, figure out on its own how to sort cats from dogs
▪ Such an unsupervised learning might also add unforeseen and undesired categories to deal with unusual breeds, creating clutter instead of order
▪ ML systems capable of unsupervised learning are often associated with generative learning models
▪ Chatbots, self-driving cars, facial recognition programs, expert systems and robots are among the systems that may use either supervised or unsupervised learning approaches, or both
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 7
Unsupervised Learning
Cluster Analysis / Clustering
Clustering
The task of identifying like with like and assigning them to clusters or group of similar instances. Just like in classification, each instance gets assigned to a group. However, unlike classification, clustering is an unsupervised task. Also, clustering has no notion of correctness.
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 9
Unsupervised Learning
Classification uses labelled data whereas clustering uses unlabelled data
Samples with labels
Samples without labels
Are you able to identify the clusters?
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
10
Unsupervised Learning
Clustering algorithms can identify the 3 clusters fairly well making only 5 mistakes out of 150 samples!
Data preparation use cases
▪ Data analysis
▪ When you analyze a new dataset, it can be helpful to run a clustering algorithm, and then analyze each cluster
separately
▪ Dimensionality reduction
▪ Once a dataset has been clustered, it is usually possible to measure each instance’s affinity with each cluster
(affinity is any measure of how well an instance fits into a cluster)
▪ Each instance’s feature vector x can then be replaced with the vector of its cluster affinities
▪ If there are k clusters, then this vector is k-dimensional
▪ This vector is typically much lower-dimensional than the original feature vector, but it can preserve enough information for further processing
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 11
Unsupervised Learning
Data preparation use cases
▪ Semi-supervised learning
▪ If you only have a few labels, you could perform clustering and propagate the labels to all the instances in the same
cluster
▪ This technique can greatly increase the number of labels available for a subsequent supervised learning algorithm, and thus improve its performance
▪ Anomaly detection (outlier detection)
▪ Any instance that has a low affinity to all the clusters is likely to be an anomaly
▪ For example, if you have clustered the users of your website based on their behavior, you can detect users with unusual behavior, such as an unusual number of requests per second
▪ Anomaly detection is particularly useful in detecting defects in manufacturing, or for fraud detection
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 12
Unsupervised Learning
Customer segmentation, recommendation system & image segmentation use cases
▪ Customer segmentation
▪ For marketing campaigns and recommender systems
▪ Search engines
▪ Some search engines let you search for images that are similar to a reference image
▪ To build such a system, you would first apply a clustering algorithm to all the images in your database; similar images would end up in the same cluster
▪ Then when a user provides a reference image, all you need to do is use the trained clustering model to find this image’s cluster, and you can then simply return all the images from this cluster
▪ Segment an image
▪ By clustering pixels according to their color, then replacing each pixel’s color with the mean color of its cluster, it is
possible to considerably reduce the number of different colors in the image
▪ Image segmentation is used in many object detection and tracking systems, as it makes it easier to detect the contour of each object
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 13
Unsupervised Learning
Clustering algorithms group samples according to their similarities, which capture the distances between samples
𝑑
𝑥ҧ ,𝑥ҧ 𝑠𝑖𝑚𝑖𝑗
=
1
𝛿 𝑥ҧ ,𝑥ҧ 𝑖𝑗
𝑑𝑠𝑖𝑚 measures the similarity between 2 vectors 𝑋 = 𝑥ҧ ,𝑥ҧ ,⋯,𝑥ҧ is dataset to be clustered
12 𝑁
+ 𝜖
𝑁 is the number of data points in the dataset
𝜖 is a constant introduced to avoid division by 0
𝑚
𝛿 measures the Euclidean distance between 2 vectors 𝑚 is the number features in a vector
𝑥ҧ𝑖 = 𝑥ҧ𝑖1, 𝑥ҧ𝑖2, ⋯ , 𝑥ҧ𝑖𝑚 is a sample vector
2 𝑖𝑗 𝑖𝑙𝑗𝑙
𝛿𝑥ҧ,𝑥ҧ = 𝑥 −𝑥 𝑙=1
𝐶 = 𝑥ҧ:𝑑 𝑥ҧ,𝜇ҧ 𝑖 𝑗 𝑠𝑖𝑚 𝑗 𝑖
>𝑑 𝑥ҧ,𝜇ҧ 𝑠𝑖𝑚 𝑗 𝑘
𝐶𝑖 , 𝐶𝑘 are clusters generated by the clustering algorithm 𝜇ҧ𝑖 is a representative vector of 𝐶𝑖
𝜇ҧ𝑘 is a representative vector of 𝐶𝑘
𝑘∈ 1,2,⋯,𝑖−1,𝑖+1,⋯,𝐾 𝐾 is the number of clusters
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 14
Unsupervised Learning
Cluster algorithms produce different types of clustering results
Hierarchy Exclusive Membership Hard Assignment Complete
Partitions VS Multiple Membership
Soft Assignment Incomplete
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 15
Unsupervised Learning
K-Means Clustering
Linear Space
The challenge is to get a computer to identify the same three clusters that are relatively obvious to the naked eyes
naked eyes
?
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 18
Unsupervised Learning
Select the number of clusters (K=3) to identify in the dataset and randomly select 3 data points as cluster centroids
Centroids of 3 new clusters
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 19
Unsupervised Learning
For each data point, find the closest centroid to each data point and assign the corresponding cluster to the data point
Distance from the 1st data point to the blue cluster
Assign the 1st data point to the blue cluster
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 20
Unsupervised Learning
For each cluster, calculate the new centroid using the cluster’s data points
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 21
Unsupervised Learning
For each data point, re-cluster it to the cluster corresponding to the closest centroid
No change in this case
The clustering algorithm has converged!
Is that the end? No, not when working in a linear space!
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 22
Unsupervised Learning
Quality of the clustering can be assessed through adding up the variation within each cluster
total variation within the clusters
As far as the algorithm goes, it is not clear if this clustering is the best and therefore the predicted clustering.
The algorithm can only repeat the process with different initial centroids and rate its quality using the total variance within the clusters.
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 23
Unsupervised Learning
Calculate the total variation resulted from using the 3 randomly picked new centroids
total variation of another set of clusters
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 24
Unsupervised Learning
Iterate the clustering with new centroids and record the corresponding total variation
total variation 1st clustering total variation 3rd clustering
The algorithm will do a few iterations of clustering (it will do as many as you tell it to do) and suggest the one with the least total variation.
total variation 2nd clustering
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 25
Unsupervised Learning
Multi-dimensional Space
In the same fashion, initial centroids are selected in the multi-dimensional space
X-AXIS
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 27
Unsupervised Learning
Y-AXIS
The Euclidean distances of each data point from the three clusters are then measured to decide the clustering
𝑥2 + 𝑦2
y
x
data point is assigned to the closest cluster
X-AXIS
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
28
Unsupervised Learning
Y-AXIS
The centre of each cluster is then calculated and all data points will be re-clustered using the new centres
X-AXIS
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 29
Unsupervised Learning
Y-AXIS
Repeat the process until the centroid values converge or maximum iteration limit has been achieved
X-AXIS
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 30
Unsupervised Learning
Y-AXIS
Recalculating the centroids effectively formulate an optimal clustering but it may not be globally optimal
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 31
Unsupervised Learning
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 32
Unsupervised Learning
Hyperparameter Tuning
Supervised learning has no ground truth to evaluate model performance
▪ Understanding the performance of unsupervised learning methods is inherently much more difficult than supervised learning methods because there is no ground truth available
▪ Moreover, K-means explicitly requests for the number of clusters as a hyperparameter
▪ K-means performance can be evaluated based on different K clusters
▪ We can also use the elbow method or the silhouette coefficient to find the
optimal K numbers of clusters for the unsupervised learning model
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 34
Unsupervised Learning
It is genuinely ambiguous how many clusters there are in a dataset and there is no way to decide this automatically
4 clusters?
OR
2 clusters?
X-AXIS
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 35
Unsupervised Learning
Y-AXIS
Sometimes the number of clusters to used is imposed by external constraints (e.g. later or downstream processing)
T-SHIRT SIZE
T-SHIRT SIZE
L
M
XL
L M
S XSS
HEIGHT HEIGHT
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 36
Unsupervised Learning
WEIGHT
WEIGHT
Elbow Method
◦ The elbow method is used to select the optimal number of clusters by examining the visualization of the data
◦ Inertia is used as the cost functions σ𝑁 𝑥−𝜇2
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
37
Unsupervised Learning
𝑖=1 𝑖 𝑖
𝜇𝑖 is the centroid closest to the data point 𝑥𝑖
𝑁 is the number of data points in the dataset
◦ The elbow method requires drawing a line plot using the cost function against the number of clusters
◦ The elbow point is a point of the plot after which the plot starts to flatten out
Ideally, the average intra-cluster distance should be much much less than the inter-cluster distance to the nearest labour cluster
distance between 𝑥𝑖 and other points in the same cluster
𝑎 𝑥𝑖
= average intra-cluster distance for 𝑥𝑖
distance between 𝑥𝑖 and other points in the nearest other cluster
38
▪ Objectives
▪ Points in the same cluster should be as similar as possible
▪ Points in different clusters should be as dissimilar as possible
▪ When 𝑎 𝑥𝑖 > 𝑏 𝑥𝑖 , it is likely that the data point 𝑥𝑖 has been misclassified
𝑥𝑖
𝑏 𝑥𝑖
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
Unsupervised Learning
= average inter-cluster distance for 𝑥𝑖 to its nearest neighbor cluster
Silhouette Coefficient
◦ Evaluates the quality of clustering
◦ The coefficient ranges from -1 to 1
◦ Ideally, 𝑎 𝑥𝑖 = 0 and 𝑏 𝑥𝑖 = ∞ therefore S(𝑥𝑖)=1 suggesting dense & well separation between clusters
◦ In the worst case scenario, 𝑎 𝑥𝑖 = ∞ and 𝑏 𝑥𝑖 = 0 giving S(𝑥𝑖)=−1 suggesting wrong clustering
◦ S(𝑥𝑖) near 0 suggests overlapping clusters with data points very close to the cluster boundary of the nearest neighbor cluster
◦ The coefficient is calculated for each data point in the dataset
◦ Plotting the data points against their silhouette coefficients provides the silhouette plot
S(𝑥𝑖) =
𝑏(𝑥𝑖) − 𝑎(𝑥𝑖) max 𝑎(𝑥𝑖), 𝑏(𝑥𝑖)
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
39
Unsupervised Learning
Silhouette score is calculated for each data point in the dataset – that is for all data points in all clusters
S(𝑥𝑖) =
𝑏(𝑥𝑖) − 𝑎(𝑥𝑖) max 𝑎(𝑥𝑖), 𝑏(𝑥𝑖)
1 means the data point is far away from the neighboring clusters meaning minimal confusion and good clustering (positive means the data point is closer to the assigned cluster than it is to neighboring clusters)
0 means the data point lies on the boundary between the assigned cluster and the next closest cluster
-1 means the data point is assigned to an incorrect cluster and the data point in fact likely belongs to a neighboring cluster
𝑎 = Mean Intra−cluster Distance
Mean distance between a data point and all other data points in the same cluster
𝑏 = Mean Nearest−cluster Distance
Mean distance between a data point and all other data points of the nearest neighbour cluster
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 40
Unsupervised Learning
The Silhouette plot shows two clusters that are dense and well-separated
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 41
Unsupervised Learning
The Silhouette plot shows three clusters that are dense except for one cluster
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 42
Unsupervised Learning
The Silhouette plot shows four clusters that are also dense and well-separated
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 43
Unsupervised Learning
The Silhouette plot shows five clusters that are not so dense
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 44
Unsupervised Learning
The Silhouette plot shows six clusters that are not dense
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 45
Unsupervised Learning
Misclassified data points are shown on the left of the Silhouette Plot
SILHOUETTE PLOT average silhouette coefficient 0.16
CLUSTER 1
Number of data points: 16
Average Silhouette coefficient: 0.03
CLUSTER 2
Number of data points: 11
Average Silhouette coefficient: 0.26
CLUSTER 3
Number of data points: 20
Average Silhouette coefficient: 0.21
Outlier points are those with the Silhouette coefficient value less than 0
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 46
Unsupervised Learning
The optimal K is chosen based on the number of outliers and the average Silhouette coefficients
2 CLUSTERS
Average Silhouette coefficient: 0.705
3 CLUSTERS
Average Silhouette coefficient: 0.588
4 CLUSTERS
Average Silhouette coefficient: 0.651
5 CLUSTERS
Average Silhouette coefficient: 0.564
6 CLUSTERS
Average Silhouette coefficient: 0.450
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
47
Unsupervised Learning
K-Means in a Nutshell
1
2
3
4
5
6
7
8
Property
Feature Data Types
Target Data Types
Key Principles
Hyperparameters
Data Assumptions
Performance
Accuracy
Explainability
Description
Numerical.
Catagorical.
Likeness is described as a function of Euclidean distance. The goal is to find K centroids (therefore clusters) that minimize the within cluster Euclidean distances. Will group together all data points in the space until no points are left.
Number of clusters (K).
Distance metric assumes clusters are spheres. Features are uncorrelated. Normalized.
Fast. Very scalability due to linear time and memory complexity. Even cluster size.
Will always converge. Converges to local optimum. May not produce meaningful clusters in a sparse feature space with outliers. Intuition fails in high dimensions and dimensionality reduction is therefore advised as part of the pre-processing.
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 48
Unsupervised Learning
K-Modes Clustering
K-Modes Clustering
K-Modes clustering is an extension of K-Means clustering by replacing cluster means by cluster modes. Modes are updated based on frequency. It is widely used for grouping categorical data. It defines clusters based on the number of matching categories between data points using a simple similarity measure.
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 50
Unsupervised Learning
The algorithm is essentially the same as K-Means except the cost function is based on equality over categories
𝐾𝑁
𝑃𝑊,𝑄 =𝑤𝑖𝑙∙𝑑𝑠𝑖𝑚 𝑥𝑖,𝑞𝑙 𝑙=1 𝑖=1
𝑃 is the cost function for the clustering
𝑊 is an 𝑁𝑥𝐾 matrix of either 0 or 1 representing cluster membership
𝑁 is the number of data points in the dataset
𝐾 is the number of clusters
𝑄 is the vectors of cluster centroids 𝑋 is dataset to be clustered
𝑚
𝑑𝑠𝑖𝑚 𝑥𝑖 , 𝑞𝑙 = 𝛿 𝑥𝑖𝑗 , 𝑞𝑙𝑗 𝑑𝑠𝑖𝑚 measures the similarity between 2 vectors 𝑗=1
𝛿𝑥𝑖𝑗,𝑞𝑙𝑗 =൝1if𝑥𝑖𝑗=𝑞𝑙𝑗 0 if 𝑥𝑖𝑗≠𝑞𝑙𝑗
𝛿 measures the similarity between 2 features
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 51
Unsupervised Learning
Density-based Clustering
Density-based Spatial Clustering of Applications with Noise (DBSCAN)
Shortcomings of Simple Clustering
▪ Clustering algorithms discussed so far are suitable for finding spherical-shaped clusters or convex clusters
▪ In other words, they work well only for compact and well-separated clusters
▪ Moreover, they are also severely affected by the presence of noise and outliers in the dataset
▪ Unfortunately, real life data may exhibit arbitrary shapes and properties (including multiple shapes)
▪
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
53
Unsupervised Learning
K-Means runs into problem with clusters of different sizes
GROUND TRUTH THREE CLUSTERS FROM K-MEANS
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 54
Unsupervised Learning
K-Means runs into problem with clusters of different densities
GROUND TRUTH THREE CLUSTERS FROM K-MEANS
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 55
Unsupervised Learning
K-Means runs into problem with clusters of non-spherical or non-convex shapes
GROUND TRUTH TWO CLUSTERS FROM K-MEANS
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 56
Unsupervised Learning
Shortcomings of K-Means with cluster size can be dealt with using more clusters first and then put them together
GROUND TRUTH CLUSTERS FROM K-MEANS
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 57
Unsupervised Learning
Shortcomings of K-Means with cluster densities can be dealt with using more clusters first and then put them together
GROUND TRUTH CLUSTERS FROM K-MEANS
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 58
Unsupervised Learning
Shortcomings of K-Means with cluster shapes can be dealt with using more clusters first and then put them together
GROUND TRUTH CLUSTERS FROM K-MEANS
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 59
Unsupervised Learning
DBSCAN
DBSCAN is a density-based clustering algorithm. Given a set of points in some space, it groups together points that are closely packed together (points with many nearby neighbors) and marks as outliers points that lie alone in low-density regions (whose nearest neighbors are too far away).
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 60
Unsupervised Learning
DBSCAN provides a more flexible and direct solution to address the shape and size issues with K-Means
ORIGINAL DATA CLUSTERS & NOISE POINTS
FROM DBSCAN
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 61
Unsupervised Learning
DBSCAN
◦ From each unvisited data point, measure the distance to every other point in the dataset
◦ All points that fall within the radius of neighborhood will be considered as neighbors
◦ The number of neighbors reaches the minimum neighbor point threshold, the points should be grouped together as a new cluster
◦ Data points not reachable from any cluster will be considered as noise
◦ Repeat the process until all data points are categorized in clusters or marked as noise
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
62
Unsupervised Learning
Unlike other clustering algorithms, not all data points are classified – unclassified data points are considered noise
Core Point
𝜀
MinPts = 6
▪ Core Point
▪ At least a minimum number of data points (MinPts)
within its radius of neighborhood (𝜀-neighborhood)
▪ All core points within the 𝜀-neighborhood of a core point
are grouped as a cluster
▪ Border Point
▪ Lies within the 𝜀-neighborhood of a core point but not a core point itself due to not having enough MinPts in its 𝜀-neighborhood
▪ Will be grouped in the cluster of its nearest core point
▪ Noise Point
▪ Not reachable from any cluster
▪ A noise point, not enough MinPts in its neighborhood, not associated with a core point
▪ Excluded from clustering
Border Point
Noise Point
𝜀-neighborhood
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
63
Unsupervised Learning
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 64
Hyperparameters can be tricky to tune
▪ Radius of Neighborhood (𝜀 / epsilon)
▪ The maximum distance a point to the
nearest cluster
▪ The greater the value, the fewer clusters are found because clusters eventually merge into other clusters
▪ Minimum Neighbor Points
▪ Required to produce a new cluster
▪ A larger value assures a more robust cluster but may exclude some smaller clusters as it attempts to merge them in a larger one
▪ Increases with the size of the dataset
▪ A smaller value may extract many
clusters with possible inclusion of noise Unsupervised Learning
DBSCAN moves through all data points to form clusters based on neighbourhood and density
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 65
Unsupervised Learning
DBSCAN in a Nutshell
1
2
3
4
5
6
7
8
Property
Feature Data Types
Target Data Types
Key Principles
Hyperparameters
Data Assumptions
Performance
Accuracy
Explainability
Description
Numerical. Should be scaled.
Categorical.
Expands the distance metric with the notion of density and clusters are therefore high density areas. Cluster membership is based on neighbourhood radius and the number of data points in the neighbourhood. Identify core, boundary, and noise points. Noise points are excluded from clustering. Therefore less prone to the distortion caused by outliers.
K is not required. Neighbourhood radius (epsilon). Minimum data points per neighbourhood.
Will find clusters of arbitrary shapes and sizes including highly complex data.
It will often immensely outperform K-means (in practice, this often happens with highly intertwined, yet still discrete, data, such as a feature space containing two half-moons). Parameter tuning can be challenging. Finds non-convex and non-linearly separable clusters.
Difficulties with clusters of varying density and high-dimensional data.
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 66
Unsupervised Learning
References
References
“The Unsupervised Learning Workshop”, Aaron Jones, Christopher Kruger, Benjamin Johnston, Packt Publishing, July 2020
“Hands-On Unsupervised Learning Using Python”, Ankur A. Patel, O’Reilly Media, Inc., March 2019
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
68
Unsupervised Learning
References
▪ “K-Means Clustering using sklearn and Python”, Dhiraj K, October 2019 (https://heartbeat.fritz.ai/k-means-clustering-using-sklearn-and- python-4a054d67b187)
▪ “K-Means Clustering Explained with Python Example”, Ajitesh Kumar, September 2020 (https://vitalflux.com/k-means-clustering- explained-with-python-example/)
▪ “K-Means Clustering Elbow Method & SSE Plot – Python”, Ajitesh Kumar, September 2020 (https://vitalflux.com/k-means-elbow-point- method-sse-inertia-plot-python/)
▪ “K-Means Silhouette Score Explained with Python Example”, Ajitesh Kumar, September 2020 (https://vitalflux.com/kmeans-silhouette- score-explained-with-python-example/)
▪ “K-Modes Clustering”, Shailja Jaiswal, July 2020 (https://medium.com/@shailja.nitp2013/k-modesclustering-ef6d9ef06449)
▪ “How to Create an Unsupervised Learning Model with DBSCAN”, Anasse Bari, Mohamed Chaouchi & Tommy Jung
(https://www.dummies.com/programming/big-data/data-science/how-to-create-an-unsupervised-learning-model-with-dbscan/)
▪ ” Scikit-Learn – Clustering: Density-Based Clustering of Applications with Noise [DBSCAN]”, June 2020
(https://coderzcolumn.com/tutorials/machine-learning/scikit-learn-sklearn-clustering-dbscan)
▪ “A Step by Step approach to Solve DBSCAN Algorithms by tuning its hyper parameters”, Mohanty Sandip, Mar 2020 (https://medium.com/@mohantysandip/a-step-by-step-approach-to-solve-dbscan-algorithms-by-tuning-its-hyper-parameters- 93e693a91289)
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 69
Unsupervised Learning
References
▪ “DBSCAN Python Example: The Optimal Value For Epsilon (EPS)”, Cory Maklin, Jun 2019 (https://towardsdatascience.com/machine- learning-clustering-dbscan-determine-the-optimal-value-for-epsilon-eps-python-example-3100091cfbc)
▪ “DBSCAN: Density-Based Clustering Essentials” (https://www.datanovia.com/en/lessons/dbscan-density-based-clustering-essentials/)
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 70
Unsupervised Learning
THANK YOU