计算机代写 Foundations of Data Analytics and Machine Learning

Foundations of Data Analytics and Machine Learning
Summer 2022
Week 3: Foundations of Learning
• K-NearestNeighbors • DecisionTrees

• Clustering

10 min quick review!

Data pre-processing
ØWhy is splitting the dataset needed? ØWhy is standardization needed?
ØErrors (what could go wrong):
ØHaving multiple train-test-splits in the code. ØStandardizing the data before splitting.
ØFitting the standard scaler on the test/validation set.

Validation Set
ØWe still want to track the loss/accuracy on a data set not used for training
ØIdea: set aside a separate data set, called the validation set ØTrack validation accuracy in the training curve
ØMake decisions about model architecture using the validation set

Standardization is the key
x:(Area, # bedrooms, # Wash )
x:(2500, 5, 5)
x:(2400, 2, 3)
x:(2350, 5, 5)
Euclidean distance

K- nearest neighbor classifier
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

Today’s focus is on Foundations of Learning
ØSupervised Learning: Øk-Nearest Neighbors ØDecisions Trees
ØClustering Strategies (unsupervised) Ø k-Means
ØAgglomerative Clustering

k-Nearest-Neighbors (Supervised Learning)

k-Nearest Neighbor Classification
ØDistance-based supervised learning approach
ØInstance-based learning or “lazy” learning (just stores the training data)
ØFlexible and makes no assumptions on data distribution
ØTypically used for classification, but can also be extended to regression

k-Nearest Neighbor Classification
ØWorks on multidimensional data
ØFor visualization purposes we will use a 2-dimensional example Ø+ represents unknown test samples that we want to classify

Simple Algorithm
ØWe can assume that similar samples will be located close together ØSimple Algorithm
Calculate distance Obtain the nearest Determine labels Neighbor

ØFor a single nearest Neighbor

How do we measure distance?
Dimension of x or y

Decision Boundary
ØCan generate arbitrary test points on the plane and apply kNN
ØThe boundary between regions of input space assigned to different categories.

Voronoi Diagrams (k=1)
Euclidian distance
Source: Wikipedia
Spherical Voronoi – Source:

Selection of k
ØQ: What happens if we let k be very small? ØQ: What happens if we let k be very large?

Tradeoffs in choosing k?
ØGood at capturing fine-grained patterns
ØMay overfit, i.e. be sensitive to random noise
Excellent for training data, not that good for new data, too complex
ØMakes stable predictions by averaging over lots of
ØMay underfit, i.e. fail to capture important regularities
Not that good for training data, not good for new data, too simple

What is the best k?
ØSelect the k based on the best performance on the validation set, report the results on the test set
ØGenerally, the performance on the validation set will be better than on the test set
ØQ: How do you think it will perform on the training set?

Normalization
ØNearest Neighbors can be sensitive to the ranges of different features
ØOften, the units are arbitrary
ØSimple fix: normalize each dimension to be zero mean and unit variance (i.e., compute the mean μj and standard deviation 𝜎! , and take,
ØCaution: depending on the problem, the scale might be important!

KNN Code Example (Google Colab)

Decision Trees (Supervised Learning)

Decision Trees
ØA rule-based supervised learning algorithm ØPowerful algorithm capable of fitting complex
ØCan be applied to classification (discrete) and regression (continuous) tasks.
ØHighly interpretable!
ØA fundamental component of Random Forests which are one of the most used Machine Learning algorithms today

Lemon Vs. Orange!
Flowchart-like structure!

Test example

Constructing a Decision Tree
ØDecision trees make predictions by recursively splitting on different attributes according to a tree structure
width (cm)

What if the attributes are discrete?

What if the attributes are discrete?
Attributes: Features (inputs)! Discrete or Continuous

Output is Discrete
Wait for table
Go somewhere else

Output is Continuous (Regression)
Source: GDCoder
ØInstead of predicting a class at each leaf node, predict a value based on the average of all instances at the leaf node.

Summary: Discrete vs Continuous Output
ØClassification Tree: Ødiscrete output
Øoutput node (leaf) typically set to the most common value ØRegression Tree:
Øcontinuous output
Øoutput node (leaf) value typically set to the mean value in data

Generalization
ØDecision trees can fit any function arbitrarily closely ØCould potentially create a leaf for each example in
the training dataset
ØNot likely to generalize to test data!
ØNeed some way to prune the tree!

Managing Overfitting
ØAdd parameters to reduce potential for overfitting
ØParameters include:
Ødepth of tree
Øminimum number of samples

Gini Impurity
ØFormally: is a measurement of the likelihood of an incorrect
classification of a new instance of a random variable, if that new instance were randomly classified according to the distribution of class labels from the data set.
ØIntuitively: Chance of making a mistake (lower is better)
The j represents the number of classes in the label, and The P represents the ratio of class at the ith node.

Model Interpretation
1−( !” $+ !” $+ !” $)=0.667 #!” #!” #!”
1−(!”$+ ” $+ ” $)=0 !” !” !”
1−( ” $+ %& $+ ! $)=0.168 !% !% !%

Random Forests
ØOne of the most popular variants of decision trees
ØAddresses overfitting by training multiple trees on subsampling of features among other things
ØMajority vote of all the trees is used to make the final output

Comparison to k-NN
ØThere are many advantages of Decision Trees over k-Nearest Neighbors:
ØGood with discrete attributes
ØRobust to scale of inputs (does not require normalization)
ØEasily handle missing values
ØGood at handling lots of attributes, especially when only a few are important ØFast test time
ØMore interpretable
ØDecision trees not good at handling rotations (transformation) in data

Decision Trees Code Example (Google Colab)

Clustering Strategies (Unsupervised Learning)

Clustering
ØClustering algorithms group samples/instances based on similarities in features
ØInput: set of samples/instances described by features
ØOutput: assigned cluster (group) for each sample/instance
ØClustering are unsupervised techniques and do not have access to the sample/instance labels

Clustering Strategies
Øk-Means Clustering
ØAssigns each point to the nearest cluster center
ØAgglomerative clustering
ØAssumes each point is a cluster and iteratively merges the closest clusters

ØMost well-known clustering method ØDistance-based unsupervised learning algorithm,
NOT to be confused with k-NN.
Ø Algorithm:
1. Assign each sample/instance to its closest mean
2. Update the means based on the assignment
3. Repeat until convergence
Ø Requires:
—Selection of the number of clusters ‘k’ (hyperparameter) —Centre of each cluster is randomly initialized at the start

K-Means Example
• K is number of clusters
• STEP 1: Guess center locations
• STEP 2: Map out what data point is closest to what center
• STEP 3: Center moves to the centroid of all points it “owns”

• K is number of clusters
• STEP 1: Guess center locations
• STEP 2: Map out what data point is closest to what center
• STEP 3: Center moves to the centroid of all points it “owns”

Doesn’t always work… can get stuck!

What are we trying to do?
Note: N points, K clusters Source: Ethan Fetaya, , Emad Andrews

Slide credit: Ethan Fetaya, , Emad Andrews

k-Means Convergence
ØWhenever an assignment is changed, the sum squared distances, J, of data points from their assigned cluster centers is reduced
ØWhenever a cluster is moved, J is reduced
ØTest for convergence: if the assignments do not change in the assignment step, we have converged (to at least a local minimum).
credit: Ethan Fetaya, , Emad Andrews

Local Minima
ØThere is nothing to prevent k-means from getting stuck at a local minimum
Options for avoiding local minimum:
Øwe could try many random starting points Øsplit a big cluster into two
Ømerge to nearby clusters
credit: Ethan Fetaya, , Emad Andrews

How to choose number of clusters (k)?
ØAs k increase the sum of squared distance goes towards zero
ØIf the plot looks like an arm, then the elbow of the arm is the optimal k
Øe.g., the elbow is at k=3 indicating the optimal number of clusters is 3
“k” number of clusters
sum of squared distance

Shape of K-Means Clusters
ØK-means split the space according on the closest mean:
Source: scikit-learn
ØNote that the clusters form convex regions.

Convex Region
ØA region is convex if any line between two points in the region remains in the region.

K-Means with Non-Convex Clusters
K-Means (K = 2)
K-means is unable to separate non-convex clusters

Agglomerative Clustering
ØA type of Hierarchical Clustering
Ø Algorithm:
1. Starts with each point in its
own cluster
2. Each step merges the two “closest” clusters
Source: MachineLearningStories 71

Example: Agglomerative Clustering
Dendrogram
Distance Threshold
# of Clusters

Comparison of Clustering Methods
Ø There are many clustering
algorithms to chose from
Ø Performance will depend on your data
Source: scikit-learn

Applications in Computer Vision
ØReplace samples with the mean of their cluster (vector quantization)
ØVisual example:
ØInputs: color pixels represented as RGB values
ØOutputs: cluster (average RGB) value obtained using k-means
ØQ: How can this be applied for compression?
ØQ: How can this be applied for image segmentation?
Source: PRML Bishop 2006

K-Means Code Example (Google Colab)

ØLab: Q/A Support Session ØHelp with Python and Project 1
ØWeek 4 Lecture – Measuring Uncertainty ØProbability Theory
ØSummary Statistics
ØMultivariate Gaussians
ØPerformance Metrics

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts