Foundations of Data Analytics and Machine Learning
Summer 2022
Week 3: Foundations of Learning
• K-NearestNeighbors • DecisionTrees
Copyright By PowCoder代写 加微信 powcoder
• Clustering
10 min quick review!
Data pre-processing
ØWhy is splitting the dataset needed? ØWhy is standardization needed?
ØErrors (what could go wrong):
ØHaving multiple train-test-splits in the code. ØStandardizing the data before splitting.
ØFitting the standard scaler on the test/validation set.
Validation Set
ØWe still want to track the loss/accuracy on a data set not used for training
ØIdea: set aside a separate data set, called the validation set ØTrack validation accuracy in the training curve
ØMake decisions about model architecture using the validation set
Standardization is the key
x:(Area, # bedrooms, # Wash )
x:(2500, 5, 5)
x:(2400, 2, 3)
x:(2350, 5, 5)
Euclidean distance
K- nearest neighbor classifier
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
Today’s focus is on Foundations of Learning
ØSupervised Learning: Øk-Nearest Neighbors ØDecisions Trees
ØClustering Strategies (unsupervised) Ø k-Means
ØAgglomerative Clustering
k-Nearest-Neighbors (Supervised Learning)
k-Nearest Neighbor Classification
ØDistance-based supervised learning approach
ØInstance-based learning or “lazy” learning (just stores the training data)
ØFlexible and makes no assumptions on data distribution
ØTypically used for classification, but can also be extended to regression
k-Nearest Neighbor Classification
ØWorks on multidimensional data
ØFor visualization purposes we will use a 2-dimensional example Ø+ represents unknown test samples that we want to classify
Simple Algorithm
ØWe can assume that similar samples will be located close together ØSimple Algorithm
Calculate distance Obtain the nearest Determine labels Neighbor
ØFor a single nearest Neighbor
How do we measure distance?
Dimension of x or y
Decision Boundary
ØCan generate arbitrary test points on the plane and apply kNN
ØThe boundary between regions of input space assigned to different categories.
Voronoi Diagrams (k=1)
Euclidian distance
Source: Wikipedia
Spherical Voronoi – Source:
Selection of k
ØQ: What happens if we let k be very small? ØQ: What happens if we let k be very large?
Tradeoffs in choosing k?
ØGood at capturing fine-grained patterns
ØMay overfit, i.e. be sensitive to random noise
Excellent for training data, not that good for new data, too complex
ØMakes stable predictions by averaging over lots of
ØMay underfit, i.e. fail to capture important regularities
Not that good for training data, not good for new data, too simple
What is the best k?
ØSelect the k based on the best performance on the validation set, report the results on the test set
ØGenerally, the performance on the validation set will be better than on the test set
ØQ: How do you think it will perform on the training set?
Normalization
ØNearest Neighbors can be sensitive to the ranges of different features
ØOften, the units are arbitrary
ØSimple fix: normalize each dimension to be zero mean and unit variance (i.e., compute the mean μj and standard deviation 𝜎! , and take,
ØCaution: depending on the problem, the scale might be important!
KNN Code Example (Google Colab)
Decision Trees (Supervised Learning)
Decision Trees
ØA rule-based supervised learning algorithm ØPowerful algorithm capable of fitting complex
ØCan be applied to classification (discrete) and regression (continuous) tasks.
ØHighly interpretable!
ØA fundamental component of Random Forests which are one of the most used Machine Learning algorithms today
Lemon Vs. Orange!
Flowchart-like structure!
Test example
Constructing a Decision Tree
ØDecision trees make predictions by recursively splitting on different attributes according to a tree structure
width (cm)
What if the attributes are discrete?
What if the attributes are discrete?
Attributes: Features (inputs)! Discrete or Continuous
Output is Discrete
Wait for table
Go somewhere else
Output is Continuous (Regression)
Source: GDCoder
ØInstead of predicting a class at each leaf node, predict a value based on the average of all instances at the leaf node.
Summary: Discrete vs Continuous Output
ØClassification Tree: Ødiscrete output
Øoutput node (leaf) typically set to the most common value ØRegression Tree:
Øcontinuous output
Øoutput node (leaf) value typically set to the mean value in data
Generalization
ØDecision trees can fit any function arbitrarily closely ØCould potentially create a leaf for each example in
the training dataset
ØNot likely to generalize to test data!
ØNeed some way to prune the tree!
Managing Overfitting
ØAdd parameters to reduce potential for overfitting
ØParameters include:
Ødepth of tree
Øminimum number of samples
Gini Impurity
ØFormally: is a measurement of the likelihood of an incorrect
classification of a new instance of a random variable, if that new instance were randomly classified according to the distribution of class labels from the data set.
ØIntuitively: Chance of making a mistake (lower is better)
The j represents the number of classes in the label, and The P represents the ratio of class at the ith node.
Model Interpretation
1−( !” $+ !” $+ !” $)=0.667 #!” #!” #!”
1−(!”$+ ” $+ ” $)=0 !” !” !”
1−( ” $+ %& $+ ! $)=0.168 !% !% !%
Random Forests
ØOne of the most popular variants of decision trees
ØAddresses overfitting by training multiple trees on subsampling of features among other things
ØMajority vote of all the trees is used to make the final output
Comparison to k-NN
ØThere are many advantages of Decision Trees over k-Nearest Neighbors:
ØGood with discrete attributes
ØRobust to scale of inputs (does not require normalization)
ØEasily handle missing values
ØGood at handling lots of attributes, especially when only a few are important ØFast test time
ØMore interpretable
ØDecision trees not good at handling rotations (transformation) in data
Decision Trees Code Example (Google Colab)
Clustering Strategies (Unsupervised Learning)
Clustering
ØClustering algorithms group samples/instances based on similarities in features
ØInput: set of samples/instances described by features
ØOutput: assigned cluster (group) for each sample/instance
ØClustering are unsupervised techniques and do not have access to the sample/instance labels
Clustering Strategies
Øk-Means Clustering
ØAssigns each point to the nearest cluster center
ØAgglomerative clustering
ØAssumes each point is a cluster and iteratively merges the closest clusters
ØMost well-known clustering method ØDistance-based unsupervised learning algorithm,
NOT to be confused with k-NN.
Ø Algorithm:
1. Assign each sample/instance to its closest mean
2. Update the means based on the assignment
3. Repeat until convergence
Ø Requires:
—Selection of the number of clusters ‘k’ (hyperparameter) —Centre of each cluster is randomly initialized at the start
K-Means Example
• K is number of clusters
• STEP 1: Guess center locations
• STEP 2: Map out what data point is closest to what center
• STEP 3: Center moves to the centroid of all points it “owns”
K-Means Example
• K is number of clusters
• STEP 1: Guess center locations
• STEP 2: Map out what data point is closest to what center
• STEP 3: Center moves to the centroid of all points it “owns”
K-Means Example
• K is number of clusters
• STEP 1: Guess center locations
• STEP 2: Map out what data point is closest to what center
• STEP 3: Center moves to the centroid of all points it “owns”
K-Means Example
• K is number of clusters
• STEP 1: Guess center locations
• STEP 2: Map out what data point is closest to what center
• STEP 3: Center moves to the centroid of all points it “owns”
K-Means Example
• K is number of clusters
• STEP 1: Guess center locations
• STEP 2: Map out what data point is closest to what center
• STEP 3: Center moves to the centroid of all points it “owns”
• K is number of clusters
• STEP 1: Guess center locations
• STEP 2: Map out what data point is closest to what center
• STEP 3: Center moves to the centroid of all points it “owns”
• K is number of clusters
• STEP 1: Guess center locations
• STEP 2: Map out what data point is closest to what center
• STEP 3: Center moves to the centroid of all points it “owns”
• K is number of clusters
• STEP 1: Guess center locations
• STEP 2: Map out what data point is closest to what center
• STEP 3: Center moves to the centroid of all points it “owns”
• K is number of clusters
• STEP 1: Guess center locations
• STEP 2: Map out what data point is closest to what center
• STEP 3: Center moves to the centroid of all points it “owns”
• K is number of clusters
• STEP 1: Guess center locations
• STEP 2: Map out what data point is closest to what center
• STEP 3: Center moves to the centroid of all points it “owns”
• K is number of clusters
• STEP 1: Guess center locations
• STEP 2: Map out what data point is closest to what center
• STEP 3: Center moves to the centroid of all points it “owns”
• K is number of clusters
• STEP 1: Guess center locations
• STEP 2: Map out what data point is closest to what center
• STEP 3: Center moves to the centroid of all points it “owns”
• K is number of clusters
• STEP 1: Guess center locations
• STEP 2: Map out what data point is closest to what center
• STEP 3: Center moves to the centroid of all points it “owns”
• K is number of clusters
• STEP 1: Guess center locations
• STEP 2: Map out what data point is closest to what center
• STEP 3: Center moves to the centroid of all points it “owns”
• K is number of clusters
• STEP 1: Guess center locations
• STEP 2: Map out what data point is closest to what center
• STEP 3: Center moves to the centroid of all points it “owns”
• K is number of clusters
• STEP 1: Guess center locations
• STEP 2: Map out what data point is closest to what center
• STEP 3: Center moves to the centroid of all points it “owns”
• K is number of clusters
• STEP 1: Guess center locations
• STEP 2: Map out what data point is closest to what center
• STEP 3: Center moves to the centroid of all points it “owns”
• K is number of clusters
• STEP 1: Guess center locations
• STEP 2: Map out what data point is closest to what center
• STEP 3: Center moves to the centroid of all points it “owns”
• K is number of clusters
• STEP 1: Guess center locations
• STEP 2: Map out what data point is closest to what center
• STEP 3: Center moves to the centroid of all points it “owns”
• K is number of clusters
• STEP 1: Guess center locations
• STEP 2: Map out what data point is closest to what center
• STEP 3: Center moves to the centroid of all points it “owns”
Doesn’t always work… can get stuck!
What are we trying to do?
Note: N points, K clusters Source: Ethan Fetaya, , Emad Andrews
Slide credit: Ethan Fetaya, , Emad Andrews
k-Means Convergence
ØWhenever an assignment is changed, the sum squared distances, J, of data points from their assigned cluster centers is reduced
ØWhenever a cluster is moved, J is reduced
ØTest for convergence: if the assignments do not change in the assignment step, we have converged (to at least a local minimum).
credit: Ethan Fetaya, , Emad Andrews
Local Minima
ØThere is nothing to prevent k-means from getting stuck at a local minimum
Options for avoiding local minimum:
Øwe could try many random starting points Øsplit a big cluster into two
Ømerge to nearby clusters
credit: Ethan Fetaya, , Emad Andrews
How to choose number of clusters (k)?
ØAs k increase the sum of squared distance goes towards zero
ØIf the plot looks like an arm, then the elbow of the arm is the optimal k
Øe.g., the elbow is at k=3 indicating the optimal number of clusters is 3
“k” number of clusters
sum of squared distance
Shape of K-Means Clusters
ØK-means split the space according on the closest mean:
Source: scikit-learn
ØNote that the clusters form convex regions.
Convex Region
ØA region is convex if any line between two points in the region remains in the region.
K-Means with Non-Convex Clusters
K-Means (K = 2)
K-means is unable to separate non-convex clusters
Agglomerative Clustering
ØA type of Hierarchical Clustering
Ø Algorithm:
1. Starts with each point in its
own cluster
2. Each step merges the two “closest” clusters
Source: MachineLearningStories 71
Example: Agglomerative Clustering
Dendrogram
Distance Threshold
# of Clusters
Comparison of Clustering Methods
Ø There are many clustering
algorithms to chose from
Ø Performance will depend on your data
Source: scikit-learn
Applications in Computer Vision
ØReplace samples with the mean of their cluster (vector quantization)
ØVisual example:
ØInputs: color pixels represented as RGB values
ØOutputs: cluster (average RGB) value obtained using k-means
ØQ: How can this be applied for compression?
ØQ: How can this be applied for image segmentation?
Source: PRML Bishop 2006
K-Means Code Example (Google Colab)
ØLab: Q/A Support Session ØHelp with Python and Project 1
ØWeek 4 Lecture – Measuring Uncertainty ØProbability Theory
ØSummary Statistics
ØMultivariate Gaussians
ØPerformance Metrics
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com