PowerPoint Presentation
Re-Cap from Week 3 – Data Exploration
Data Characteristics (Data Types, Cardinality, Data Distribution, Outliers, Missing Data)
Chart Types (Bar Chart, Histogram, Line Chart, Scatter Plot, Bubble Chart, Geo Map, Box Plot, Tree Map, Heat Map, Pie Chart)
Data Visualization (Benefits, SAS VA demo)
Key Principles of Data Visualization
Keep your audience in mind
Keep it simple – less is Better
Keep the context – tell a story
Week 4
Lecture Outcomes
Lecture Outcomes
The learning outcomes from this week’s lecture are:
Discuss applications of clustering
Describe different types of clustering algorithms
Perform clustering in SAS VA
Use cluster matrices and parallel coordination plots to refine the number of clusters
Understand the characteristics of different clusters
Describe the role of clustering in predictive analytic models
Clustering and Segmentation
Clustering
Clustering is an unsupervised learning process that groups together people or objects with similar characteristics
Clustering is exploratory and useful when trying to understand a new dataset
Allows us to discover if there are any naturally occurring patterns within the data
Clustering VS Segmentation
Clustering and segmentation are mostly the same but with a different emphasis
Clustering is the technical process for unsupervised groupings, while segmentation is the application of creating segments of customers or markets
Thus, clustering can be used to segment consumer groups
https://medium.com/analytics-vidhya/customer-segmentation-for-differentiated-targeting-in-marketing-using-clustering-analysis-3ed0b883c18b
Applications of Clustering
In predictive modelling, firms can utilise segments by creating separate predictive models for each segment
If the size of influencing factors are found to differ between heterogeneous groups, then separate models based on segments can be created to better capture the underlying behaviours driving consumer decision-making in each group
A prime example is Netflix’s recommendations algorithm
Algorithms
Clustering Algorithm
FIGURE 5.1, Vidgen et al. 2019
Hierarchical Clustering
FIGURE 5.2, Vidgen et al. 2019
Hierarchical Clustering
Key takeaways:
At the start, the two items that have the closest distance to each other are joined together creating N-1 clusters, and then among the N-1 clusters, the next two closest items are joined together. This keeps occurring until there is only one cluster that contains all N objects.
The algorithm produces N different options for the number of clusters, ranging from one single cluster to all individual unique clusters.
The dendrogram can be used to determine the best possible number of clusters. However, the best heuristic for deciding the number of clusters is based on the largest vertical line separating the various horizontal lines joining clusters.
Hierarchical Clustering
NSS.COM
K-means Clustering
FIGURE 5.3, Vidgen et al. 2019
K-means Clustering
Key takeaways:
Assume that the k centres are randomly distributed across the variable space. Each observation is then allocated to the cluster of one of these k centres which it is closest to. Based on these new clusters, the midpoint readjusts to the centre of its cluster.
Since the centre of the cluster has changed, there may be observations in other clusters that are now closer to a different centroid. Thus, the observation’s clusters are reassigned after the centroids are updated.
If there are new additions or subtractions from a cluster, then the centre of the cluster will again change. This process repeats until changes in the centroid no longer results in changes to the clusters. At this point, the algorithm terminates and the clusters are finalized.
Distance Measures
Methods to measure closeness of observations:
Euclidean distance
Squared Euclidean
In a 2-dim space, the distance between the data points (x1, y1) and (x2, y2) is:
In a 3-dim space, the distance between the data points (x1, y1, z1) and (x2, y2, z2) is:
SAS VA Demo
Model
Evaluation
Cluster Matrices
Parallel Coordination Plot (5 Clusters)
Parallel Coordination Plot (3 Clusters)
Parallel Coordination Plot (4 Bins)
Summary
Summary of Clustering
Clustering is a representative technique for descriptive analytics that enables a natural exploration of data by creating groups of objects or segments of people to discover patterns and similarities across clusters
These groupings can be used by firms to customize content, advertisements, services, products, and other offerings, to create higher value for customers
Clustering can be used to enhance predictive models through data reduction and removing outlier data
Thus, selecting the number of groups and interpreting the underlying meanings of each group in a cluster as much of an art as it is a science
/docProps/thumbnail.jpeg