CSCI 567: Machine Learning
Copyright By PowCoder代写 加微信 powcoder
Lecture 9, Nov 3
Administrivia
• Project details are out
• Make groups (of 4) by Nov 11, minimum team size is 3.
• Q5 on HW4 will help you get started on it.
• We’ll give an overview of the project and general tips in today’s discussion.
• HW4 is due in about two weeks (Nov 16 at 2pm).
• We’ll release another question on PCA tomorrow.
• Today’s plan:
• Finish ensemble methods
• Unsupervised learning:
• Clustering
Ensemble methods:
Ensemble methods
• Random forests
• Boosting: Basics
• Adaboost
• Gradient boosting
Collect T subsets each of some fixed size (say m) by sampling with replacement from
training data.
Let ft(x) be the classifier (such as a decision tree) obtained by training on the subset
t ! {1, . . . , T}. Then the aggregrated classifier fagg(x) is given by:
ft(x) for regression,
= Majority Vote{ft(x)}
t=1 for classification.
• Reduces overfitting (i.e., variance)
• Can work with any type of classifier (here focus on trees)
• Easy to parallelize (can train multiple trees in parallel)
• But loses on interpretability to single decision tree (true for all ensemble methods..)
Ensemble methods
• Random forests
• Boosting: Basics
• Adaboost
• Gradient boosting
Random forests: When growing a tree on a bootstrapped (i.e. subsampled) dataset,
before each split select ! ≤ # of the # input variables at random as candidates for
splitting.
When ! = # → same as Bagging
When ! < # → Random forests
! is a hyperparameter, tuned via cross-validation
Random forests
Ensemble methods
• Random forests
• Boosting: Basics
• Adaboost
• Gradient boosting
Boosting: Idea
The boosted predictor is of the form fboost(x) = sign(h(x)), where,
!tft(x) for !t ! 0 and ft " F .
The goal is to minimize "(h(x), y) for some loss function ".
Q: We know how to find the best predictor in F on some data, but how do we find the best weighted
combination h(x)?
A: Minimize the loss by a greedy approach, i.e. find !t, ft(x) one by one for t = 1, . . . , T .
Specifically, let ht(x) =
!!f! (x). Suppose we have found ht!1(x), how do we find !t, ft(x)?
Find the !t, ft(x) which minimizes the loss "(ht(x), y).
Different loss function " give different boosting algorithms.
"(h(x), y) =
(h(x)# y)2 $ Least squares boosting,
exp(#h(x)y) $ AdaBoost.
Ensemble methods
• Random forests
• Boosting: Basics
• Adaboost
• Gradient boosting
Given a training set S and a base algorithm A, initialize D1 to be uniform
For t = 1, . . . , T
• obtain a weak classifier ft(x)! A(S,Dt)
• calculate the weight !t of ft(x) as
(!t > 0# “t < 0.5)
where "t =
i:ft(xi) !=yi
Dt(i) is the weighted error of ft(x).
• update distributions
Dt+1(i) $ Dt(i)e
"!tyift(xi) =
"!t if ft(xi) = yi
Output the final classifier fboost = sgn
t=1 !tft(x)
AdaBoost: Full algorithm
Put more weight on difficult to classify instances and less on those already handled well
New weak learners are added sequentially that focus their training on the more difficult patterns
Adaboost: Example
10 data points in R2
The size of + or - indicates the weight, which starts from uniform D1
Toy ExampleToy ExampleToy ExampleToy ExampleToy Example
weak classifiers = vertical or horizontal half-planes
Base algorithm is decision stump:
Go through the calculations in the example to make sure you understand the algorithm
Ensemble methods:
Gradient boosting
Ensemble methods
• Random forests
• Boosting: Basics
• Adaboost
• Gradient boosting
Gradient Boosting
Recall ht(x) =
!=1 !!f! (x). For Adaboost (exponential loss), given ht!1(x), we
found what ft(x) should be.
Gradient boosting provides an iterative approach for general (any) loss function "(h(x), y):
• For all training datapoints (xi, yi) find the gradient
#"(h(xi), yi)
h(xi)=ht!1(xi)
• Use the weak learner to find ft which fits (xi, ri) as well as possible:
ft = argmin
(ri ! f(xi))
• Update ht(x) = ht!1(x) + $ft(x), for some step size $.
Gradient Boosting
Usually we add some regularization term to prevent overfitting (penalize the size of the tree etc.)
Gradient boosting is extremely successful!!
A variant XGBoost is one of the most popular algorithms for structured data (tables etc. with
numbers and categories where each feature typically has some meaning, unlike images or text).
(for e.g. during Kaggle competitions back in 2015, 17 out of 29 winning solutions used XGBoost)
Unsupervised
learning: PCA
A simplistic taxonomy of ML
Supervised learning:
Aim to predict
outputs of future
datapoints
Unsupervised
Aim to discover
hidden patterns and
explore data
Reinforcement
Aim to make
sequential decisions
Principal Component Analysis (PCA)
• Introduction
• Formalizing the problem
• How to use PCA, and examples
• Solving the PCA optimization problem
• Conclusion
Acknowledgement & further reading
Our presentation is closely based on ’s
notes for CS168 at Stanford.
https://web.stanford.edu/class/cs168/l/l7.pdf
https://web.stanford.edu/class/cs168/l/l8.pdf
You can refer to these notes for further reading.
Also review our Linear algebra Colab notebooks:
https://web.stanford.edu/class/cs168/l/l7.pdf
https://web.stanford.edu/class/cs168/l/l8.pdf
https://colab.research.google.com/drive/1Wo6pC6ybebfIV8Nz06aHAVa1amjYDGQO?usp=sharing
https://colab.research.google.com/drive/1Wo6pC6ybebfIV8Nz06aHAVa1amjYDGQO?usp=sharing
We’ll start with a simple and fundamental unsupervised learning problem: dimensionality reduction.
Goal: reduce the dimensionality of a dataset so that
• it is easier to visualize and discover patterns
• it takes less time and space to process for any downstream application (classification, regression, etc)
• noise is reduced
There are many approaches, we focus on a linear method: Principal Component Analysis (PCA).
Dimensionality reduction & PCA
Consider the following dataset:
• 17 features, each represents the average consump-
tion of some food
• 4 data points, each represents some country.
What can you tell?
Hard to say anything looking at all these 17 features.
Picture from here
See this for more details
PCA: Motivation
http://setosa.io/ev/principal-component-analysis/
https://people.duke.edu/~hpgavin/SystemID/References/Richardson-PCA-2009.pdf
PCA: Motivation
PCA can help us! The projection of the data onto its first principal component:
i.e. we reduce the dimensionality from 17 to just 1.
Now one data point is clearly different from the rest!
PCA: Motivation
i.e. we reduce the dimensionality from 17 to just 1.
Now one data point is clearly different from the rest!
That turns out to be data from Northern Ireland,
the only country not on the island of Great Britain out of the 4 samples.
Can also interpret components: PC1 tells us that the Northern Irish eat more grams of fresh
potatoes and fewer of fresh fruits and alcoholic drinks.
PCA can help us! The projection of the data onto its first principal component (PC1):
PCA: Motivation
We can find the second (and more) principal components of the data too:
PCA: Motivation
And the components themselves are interpretable too:
See this for more details
https://people.duke.edu/~hpgavin/SystemID/References/Richardson-PCA-2009.pdf
Principal Component Analysis (PCA)
• Introduction
• Formalizing the problem
• How to use PCA, and examples
• Solving the PCA optimization problem
• Conclusion
Suppose we have a dataset of n datapoints x1,x2, . . . ,xn ! R
The high level goal of PCA is to find a set of k principal components (PCs) /principal vectors v1, . . . ,vk !
d such that for each xi,
for some coefficients !ij ! R.
High-level goal
Preprocessing the data
• Before we apply PCA, we usually preprocess the data to center it
• In many applications, it is also important to scale each coordinate properly. This is especially true if
the coordinates are in different units or scales.
Figure from CS168 notes
Objective function for PCA
Key difference from supervised learning problems:
No labels given, which means no ground-truth to measure the quality of the answer!
However, we can still write an optimization problem based on our high-level goal.
For clarity, we first discuss the special case of ! = 1.
Optimization problem for finding the 1st principal component %!:
Figure from CS168 notes
An example:
Objective function for larger values of !
The generalization of the original formulation for general k is to find a k-dimensional subspace S
such that the points are as close to it as possible:
S = argmin
k-dim subspaces S
(distance between xi and subspace S)
By the same reasoning as for k = 1, this is equivalent to,
S = argmax
k-dim subspaces S
(length of xi’s projection on S)
It is useful to think of the subspace S as the span of k orthonormal vectors v1, . . . ,vk ! R
• k = 1, span is line through the origin.
• k = 2, if v1,v2 are linearly independent, the span is a plane through the origin, and so on.
Formal problem solved by PCA:
Given x1, . . . ,xn ! R
d and a parameter k " 1, compute orthonormal vectors v1, . . . ,vk ! R
d to maximize,
Equivalent view:
• Pick v1 to be the variance maximizing direction.
• Pick v2 to be the variance maximizing direction, orthogonal to v1.
• Pick v3 to be the variance maximizing direction, orthogonal to v1 and v2, and so on.
Principal Component Analysis (PCA)
• Introduction
• Formalizing the problem
• How to use PCA, and examples
• Solving the PCA optimization problem
• Conclusion
Input: n datapoints x1,x2, . . . ,xn ! R
d, #components k we want
Step 1 Perform PCA to get top k principal components v1, . . . ,vk ! R
Step 2 For each datapoint xi, define its “v1-coordinate” as "xi,v1#, its “v2-coordinate” as "xi,v2#.
Therefore we associate k coordinates to each datapoint xi, where the j-th coordinate denotes the
extent to which xi points in the direction of vj .
Step 3 We now have a new “compressed” dataset where each datapoint is k-dimensional. For visu-
alization, we can plot the point xi in R
k as the point ("xi,v1#, "xi,v2#, . . . , "xi,vk#).
Using PCA for data compression and visualization
Visualization example: Do Genomes Encode Geography?
Dataset: genomes of 1,387 Europeans (each individual’s genotype at 200,000 locations in the genome)
& = 1387, + ≈ 200,000
Project the datapoints onto top 2 PCs
Plot shown below; looks remarkably like the map of Europe!
From paper: ``Genes mirror geography within Europe” Novembre et al., Nature’08
Compression example: : 256x256 (≈65K pixels) dimensional images
of about 2500 faces, all framed similarly
& = 2500, + ≈ 65,000
We can represent each image with high accuracy
using only 100-150 principal components!
The principal components (called eigenfaces here)
are themselves interpretable too!
From paper: ``Eigenfaces for recognition” Turk & Pentland, Journal of Cognitive Neuroscience’91
Principal Component Analysis (PCA)
• Introduction
• Formalizing the problem
• How to use PCA, and examples
• Solving the PCA optimization problem
• Conclusion
How to solve the PCA optimization problem?
The diagonal case
Let’s solve argmax
Av for the special case where A is a diagonal matrix.
Any d ! d matrix A can be thought of as a function that maps points in Rd back to points in Rd:
The matrix
maps (x, y) to (2x, y):
Figure from CS168 notes
Points on circle {(x, y) : x2 + y2 = 1} are mapped to the ellipse {(x, y) :
+ y2 = 1}.
So what direction v should maximize vTAv for diagonal A?
It should be the direction of maximum stretch:
Diagonals in disguise
The previous figure, rotated 45 degrees.
1 still does nothing other than stretch
out different orthogonal axes, possibly
with these axes being a “rotated
version” of the original ones.
How do we formalize the concept of a rotation in high dimensions as a matrix operation?
Answer: Orthogonal matrix (also called orthonormal matrix).
Recall that we want to find v1 = argmaxv:!v!2=1 v
Now consider A that can be written as A = QDQ
for an orthogonal matrix Q and diagonal
matrix D with diagonal entries !1 ! !2 ! !3 ! . . .!d ! 0.
What is the direction which gets stretched the maximum?
(Informal answer) The maximum possible stretch by D is !1. The direction of maximum stretch
under D is e1. Therefore, direction of maximum stretch under DQ
is v s.t. Q
v = e1 =! v =
(QT)!1e1 = Qe1.
General covariance matrices
When k = 1, the solution to argmax
vTAv is the first column of Q, where A = X
with Q orthogonal and D diagonal with sorted entries.
General values of !
What is the solution to the PCA objective for general values of k?
Solution: Pick the first k columns of Q, where the covariance X
with Q orthogonal
and D diagonal with sorted entries.
Since Q is orthogonal, the first k columns of Q are orthogonal vectors. These are called the top k
principal components (PCs).
Eigenvalues & eigenvectors
How to compute the top k columns of Q in the decomposition X
Solution: Eigenvalue decomposition!
Eigenvectors: axes of stretch in geometric intuition
Eigenvalues: stretch factors
PCA boils down to computing the ! eigenvectors of the covariance matrix 2"2
that have the largest eigenvalues.
Principal Component Analysis (PCA)
• Introduction
• Formalizing the problem
• How to use PCA, and examples
• Solving the PCA optimization problem
• Conclusion
For visualization, we usually choose k to be small and just pick the first few principal components.
In other applications such as compression, it is a good idea to plot the eigenvalues and see. A lot of
data is close to being low rank, so the eigenvalues may decay and become small.
We can also choose the threshold based on how much variance we want to capture. Suppose we
want to capture 90% of the variance in the data. Then we can pick k such that i.e.
where !1 ! · · · ! !d are sorted eigenvalues.
j=1 !j = trace(X
X), so no need to actually find all eigenvalues.
How many PCs to use?
When and why does PCA fail?
1. Data is not properly
scaled/normalized.
2. Non-orthogonal structure in data: PCs
are forced to be orthogonal, and
there may not be too many
orthogonal components in the data
which are all interpretable.
3. Non-linear structure in data.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com