CS代考 IE 332 in-class session

IE 332 in-class session
Nov 3rd: A3 Unsupervised Learning

Unsupervised learning: Cluster Analysis

Copyright By PowCoder代写 加微信 powcoder

¡ñ Use Fall 2020 as an example
¡ñ Steps of a Cluster Analysis:
¡ð Step 1: Select variables for clustering.
¡ð Step 2: Scale the data (sometimes optional because many clustering packages now will do the
scaling for you).
¡ð Step 3: Define similarity(distance) measure. (How to calculate the distance?)
¡ð Step 4: Decide the Clustering Method (Kmeans or Hierarchical?)
¡ð Step 5: Decide the Number of Clusters.
¡ð Step 6: Evaluate the clustering result.

Find the optimal number of clusters—WSS
¡ñ WSS: within cluster sum of squares
¡ñ https://uc-r.github.io/km eans_clustering

kmeans function in R
¡ñ Observe the parameters of the kmeans function: x, centers, nstart
¡ñ km <- kmeans(x = cust,centers = k, nstart = 25) ¡ñ nstart: attempts multiple initial configurations and reports on the best one ¡ñ ¡ñ Observe the results from kmeans function: cluster, centers, totss, tot.withinss, ¡ñ wss <- km$tot.withinss Calculate WSS of kmeans in R wss <- km$tot.withinss fviz_nbclust function from factoextra package Why not choose k = 10? Find the optimal number of clusters---Silhouette(¡®si.lu.wet) ¡ñ Silhouette(S) ¡ñ mean intra-cluster distance(I): Mean distance between the observation and all other data points in the same cluster ¡ñ mean nearest-cluster distance(N): Mean distance between the observation and all other data points of the next nearest cluster. ¡ñ S = (N-I)/max(N,I) (calculate this for each data sample) ¡ñ https://dzone.com/articles/kmeans-silhouette-score-explained-with-python-exam ¡ñ Higher the better Calculate Silhouette of kmeans in R ¡ñ library(cluster): silhouette(km$cluster, dist(df)) ¡ñ library(factoextra) Find the optimal number of clusters---Problem based ¡ñ Wss says 2, Silhouette says 5, which one to pick? ¡ñ What is the objective of our problem? Minimize the distance between each customer and its assigned facility plus the cost of opening this facility Find the optimal number of clusters---Problem based ¡ñ Wss says 2, Silhouette says 5, which one to pick? ¡ñ What is the objective of our problem? ¡ñ If we ignore the fixed_cost, what is function findObj doing? Does it look familiar?(Hint: we talked about this a few slides ago) ¡ñ Answer the question in the WSS slide(Why not choose k=10) ¡ð If k is too large, it suffers from overfitting problem ¡ð It is also subject to practical considerations. (The cost to open each facility is high) Select variables for clustering ¡ñ Examine your best result from part (a) and identify any significant issues with the result with respect to how well it actually solves the problem. Select variables for clustering ¡ñ Examine your best result from part (a) and identify any significant issues with the result with respect to how well it actually solves the problem. ¡ñ Instead of finding clusters based on 3 dimensions, we have to find based on x-y coordinates only. How to deal with the priority column? ¡ñ Instead of finding clusters based on 3 dimensions, we have to find based on x-y coordinates only. ¡ñ How to deal with the priority column? ¡°Moreover, you want to locate the facility closer to those customers who have a higher priority¡± Adding dummy customers ¡ñ How to deal with the priority column? ¡°Moreover, you want to locate the facility closer to those customers who have a higher priority¡± ¡ñ Priorities could be included by adding dummy customers to the existing data set. For example, if $p_j$ is 3, then we will create 2 more dummy customers for this customer. What if the priority is not an integer? ¡ñ However, priorities must be included by adding dummy customers to the existing data set. For example, if $p_j$ is 3, then we will create 2 more dummy customers for this location. ¡ñ What if the priority is not an integer? 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com