Concise Machine Learning
Jonathan Richard Shewchuk May 26, 2020
Department of Electrical Engineering and Computer Sciences University of California at Berkeley
Berkeley, California 94720
Abstract
This report contains lecture notes for UC Berkeley’s introductory class on Machine Learning. It covers many methods for classification and regression, and several methods for clustering and dimensionality reduction. It is concise because nothing is included that cannot be written or spoken in a single semester’s lectures (with whiteboard lectures and almost no slides!) and because the choice of topics is limited to a small selection of particularly useful, popular algorithms.
Supported in part by the National Science Foundation under Awards CCF-1423560 and CCF-1909204, in part by the University of California Lab Fees Research Program, and in part by an Alfred P. Sloan Research Fellowship. The claims in this document are those of the author. They are not endorsed by the sponsors or the U.S. Government.
Keywords: machine learning, classification, regression, density estimation, dimensionality reduction, clus- tering, perceptrons, support vector machines (SVMs), Gaussian discriminant analysis, linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), logistic regression, decision trees, random forests, ensemble learning, bagging, boosting, AdaBoost, neural networks, convolutional neural networks (CNNs, ConvNets), nearest neighbor search, least-squares linear regression, logistic regression, polynomial regres- sion, ridge regression, Lasso, bias-variance tradeoff, maximum likelihood estimation (MLE), principal com- ponents analysis (PCA), singular value decomposition (SVD), random projection, latent factor analysis, latent semantic indexing, k-means clustering, hierarchical clustering, spectral graph clustering, the kernel trick, learning theory
Contents
1 Introduction 1
2 Linear Classifiers and Perceptrons 7
3 Perceptron Learning; Maximum Margin Classifiers 13
4 Soft-Margin Support Vector Machines; Features 18
5 Machine Learning Abstractions and Numerical Optimization 25
6 Decision Theory; Generative and Discriminative Models 30
7 Gaussian Discriminant Analysis, including QDA and LDA 35
8 Eigenvectors and the Anisotropic Multivariate Normal Distribution 40
9 Anisotropic Gaussians, Maximum Likelihood Estimation, QDA, and LDA 46
10 Regression, including Least-Squares Linear and Logistic Regression 53
11 More Regression; Newton’s Method; ROC Curves 58
12 Statistical Justifications; the Bias-Variance Decomposition 64
13 Shrinkage: Ridge Regression, Subset Selection, and Lasso 70
14 Decision Trees 75
15 More Decision Trees, Ensemble Learning, and Random Forests 80
16 The Kernel Trick 88
17 Neural Networks 93
18 Neurobiology; Variations on Neural Networks 100
19 Better Neural Network Training; Convolutional Neural Networks 107
20 Unsupervised Learning and Principal Components Analysis 115
21 The Singular Value Decomposition; Clustering 124
i
22 Spectral Graph Clustering 132
23 Multiple Eigenvectors; Random Projection; Applications 140
24 Boosting; Nearest Neighbor Classification 149
25 Nearest Neighbor Algorithms: Voronoi Diagrams and k-d Trees 154
A Bonus Lecture: Learning Theory
B Bonus Mini-Lecture: Latent Factor Analysis
159 165
ii
About this Report
This report compiles my lectures notes for UC Berkeley’s class CS 189/289A, Machine Learning, which is both an undergraduate and introductory graduate course. I hope it will serve as a fast introduction to the subject for readers who are already comfortable with vector calculus, linear algebra, probability, and statistics. Please consult my CS 189/289A web page1 as an addendum to this report; it includes an extended description of each lecture and additional web links and reading assignments related to the lectures. Consider this report and the web page to be living documents; both will be refined a bit every time I teach the class.
The term “lecture notes” has shifted to include long textbook-style treatments written by professors as supplements to their classes. Not so here. This report compiles the actual notes that I lecture from. I call it Concise Machine Learning because I include almost nothing that I do not have time to write or speak during one fourteen-week semester of twice-weekly 80-minute lectures. (After holidays and the midterm exam, that amounts to 25 lectures.) Words that appear [in brackets] are spoken; everything else is written on the “whiteboard”—in my class, a tablet computer. My whiteboard software permits me to incorporate (and write on) figures, included here. However, I am largely anti-Powerpoint and I resort to prepared slides for just three or four brief segments during the semester.
These notes might be ideal for mathematically sophisticated readers who want to learn the basics of machine learning as quickly as possible. But they’re not ideal for everybody. The time limitation necessitates that many details are omitted. I believe that the most mathematically well-prepared readers will be able to fill in those details themselves. But many readers, including most students who take the class, will need additional readings or discussion sections for greater detail. My class web page lists additional readings for most of the lectures, many of them from two textbooks that have been kindly made available for free on the web: An Introduction to Statistical Learning with Applications in R,2 by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani, Springer, New York, 2013, ISBN # 978-1-4614-7137-0; and The Elements of Statistical Learning: Data Mining, Inference, and Prediction,3 second edition, by Trevor Hastie, Robert Tibshirani, and Jerome Friedman, Springer, New York, 2008. Wikipedia also has excellent introductions to many machine learning algorithms. Readers wanting the verbose kind of “lecture notes” should consider the fine ones written by Stanford University’s Andrew Ng.4 I have no interest in duplicating these efforts; instead, I’m aiming for the neglected niche of “shortest introduction.” (And perhaps also “best stolen illustrations.”)
The other thing that makes this report concise is the choice of topics. CS 189/289A was introduced at UC Berkeley in the spring of 2013 by Prof. Jitendra Malik, and most of his topic choices remain intact here. Jitendra told me that he only taught a machine learning algorithm if he or his collaborators had used it successfully for some application. He said, “the machine learning course is too important to leave to the machine learning experts”—that is, users of machine learning algorithms often have a more clear-eyed view of their usefulness than inventors of machine learning algorithms.
I thank Peter Bartlett, Alyosha Efros, Isabelle Guyon, and Jitendra Malik—the previous teachers of CS 189/289A—for their lectures and lecture notes, from which I learned the topic myself. While I’ve given the lectures my own twist and rearranged the material a lot, I am ultimately making incremental improvements (and perhaps incremental worsenings) to a structure they handed down to me. I also thank Carlos Flores for sharing screenshots from my lectures.
1 http://www.cs.berkeley.edu/∼jrs/189/
2 http://faculty.marshall.usc.edu/gareth-james/ISL/ 3 http://statweb.stanford.edu/∼tibs/ElemStatLearn/ 4 http://cs229.stanford.edu/notes2020spring/
iii
iv
ics
ira
o
r
H
as
t
i
Springer Texts in Statistics
Gareth James Daniela Witten
e Robert Tibshirani
T
n
r
i
•
Je
e
v
r
tical Learning
om
e
Fr
i
e
dman
Springer Series in Statistics
Trevor Hastie Robert Tibshirani Jerome Friedman
ci- the
AtonSItnattriosduction
cal 1
During the past decade there has been an explosion in computation and information tech-
nology. With it have come vast amounts of data in a variety of fields such as medicine, biolo-
agny,ifinance, and marketing. The challenge of underst
tical
anding
ionpement of new tools in the field of statistics, and spawned new areas such as data mining,
maechine learning, and bioinformatics. Many of these tools have common underpinnings but are often expressed with different terminology. This book describes the important ideas in
these areas in a common conceptual framework. While the approach is statistical, the
at
rn-
emphasis is on concepts rather than mathematics. M
al
any
use of color graphics. It should be a valuable resource for statisticians and anyone interested
ear
, with
in data mining in science or industry. The book’s coverage is broad, from supervised learning (parsediction) to unsupervised learning. The many topics include neural networks, support rvnec-tor machines, classification trees and boosting—the first comprehensive treatment of this
topic in any book.
ual
This major new edition features many topics not covered in the original, including graphical
er
models, random forests, ensemble methods, least angle regression & path algorithms for the lasso, non-negative matrix factorization, and spectral clustering. There is also a chapter on
micse,thods for “wide” data (p bigger than n), including multiple testing and false discovery rates. hat
Trevor Hastie, Robert Tibshirani, and Jerome Friedman are professors of statistics at Stanford University. They are prominent researchers in this area: Hastie and Tibshirani
nd
developed generalized additive models and wrote a popular book of that title. Hastie co-
nd
STATISTICS
ISBN 978-0-387-84857-0
› springer.com
these dat
as led
to th
example
s are g
iven
ah
ed
Learning
with Applications in R
ibe
evel-
ral
developed much of the statistical modeling software and environment in R/S-PLUS and hinavtented principal curves and surfaces. Tibshirani proposed the lasso and is co-author of the evnerty successful An Introduction to the Bootstrap. Friedman is the co-inventor of many data- mssoining tools including CART, MARS, projection pursuit and gradient boosting.
The Elements of
Statistical Learning
Data Mining, Inference, and Prediction
Second Edition
Trevor Hastie • Robert Tibsh
ant
ude sed
The Elements of Static
rld ook
d
h
H h
1 Introduction
CS 189 / 289A [Spring 2020] Machine Learning
Jonathan Shewchuk
http://www.cs.berkeley.edu/∼jrs/189
Questions: Please use Piazza, not email.
for most questions so other people can benefit.]
[Piazza has an option for private questions, but please use public For personal matters only, jrs@berkeley.edu
Discussion sections (20, Tue/Wed):
Attend any section. If room is too full, please go to another one.
[We might have a few advanced sections, including research discussion or exam problem preparation.] Sections start Tuesday. [Next week.]
Homework 1 (see Piazza) due next Wednesday. HW party on Monday. [Probably.]
[Enrollment: 744 students max. 413 waitlisted. Expecting many drops. EECS grads have highest priority; undergrads second; non-EECS grads third; no hope for concurrent enrollment students.]
[Textbooks: Available free online. Linked from class web page.]
STS
Springer Series in Statist
eld lex g to
Hastie • Tibshirani • Friedman The Elements of Statistical Learning
James · Witten · Hastie · Tibshirani An Introduction to Statistical Learning
2 Jonathan Richard Shewchuk Prerequisites
Math 53 (vector calculus)
Math 54, Math 110, or EE 16A+16B (linear algebra)
CS 70, EECS 126, or Stat 134 (probability)
NOT CS 188 [Might still be listed as a prerequisite, but we’re having it removed.]
Grading: 189
40% 7 Homeworks. Late policy: 5 slip days total
20% Midterm: Tentatively Monday, March 16, in class (6:30–8 pm) 40% Final Exam: Friday, May 15, 3–6 PM (Exam group 19)
Grading: 289A
40% HW 20% Midterm 20% Final 20% Project
Cheating
– Discussion of HW problems is encouraged.
– All homeworks, including programming, must be written individually.
– We will actively check for plagiarism.
– Typical penalty is a large NEGATIVE score, but I reserve right to give an instant F for even one
violation, and will always give an F for two.
[Last time I taught CS 61B, we had to punish roughly 100 people for cheating. It was very painful. Please don’t put me through that again.]
CORE MATERIAL
– Finding patterns in data; using them to make predictions. – Models and statistics help us understand patterns.
– Optimization algorithms “learn” the patterns.
[The most important part of this is the data. Data drives everything else.
You cannot learn much if you don’t have enough data.
You cannot learn much if your data sucks.
But it’s amazing what you can do if you have lots of good data.
Machine learning has changed a lot in the last two decades because the internet has made truly vast quantities of data available. For instance, with a little patience you can download tens of millions of photographs. Then you can build a 3D model of Paris.
Some techniques that had fallen out of favor, like neural networks, have come back big in recent years because researchers found that they work so much better when you have vast quantities of data.]
CLASSIFICATION
4à2 Why Not Linear Regression
29
0 500 1000 1500 Balance
2000 2500
No Yes No Yes Default Default
Introduction
3
0
Income 20000 40000
60000
0
Balance 500 1000 1500
2000 2500
0
Income 20000 40000
60000
FIGURE 4àà The Default data setà Left: The annual incomes and monthly credit card balances of a number of individualsà The individuals who defaulted on their credit card payments are shown in orange, and those who did not are shown in blueà Center: Boxplots of balance as a function of default statusà Right: Boxplots of income as a function of default statusà
creditcards.pdf(ISL,Figure4.1) [Theproblemofclassification.Wearegivendatapoints,
4à2 Why Not Linear Regression
each belonging to one of two classes. Then we are given additional points whose class is unknown, and we are asked to predict what class each new point is in. Given the credit card bWalaencheaavnedsatnantueadl intchoamtelionfeaacrarrdehgorldeesrs,ioprnediisctnwohtetahperptrhoepyrwiailtledeifnautlht oencthaesier doefbta.]
qualitative response. Why not?
– Collect training data: reliable debtors & defaulted debtors
Suppose that we are trying to predict the medical condition of a patient
– Evaluate new applicants (prediction)
in the emergency room on the basis of her symptoms. In this simplified example, there are three possible diagnoses: stroke, drug overdose, and
epileptic seizure. We could consider encoding these values as a quantita- tive response variable, Y , as follows:
⎧⎪⎨⎪⎩1 if stroke;
Y = 2 if drug overdose;
3 if epileptic seizure.
Using this coding, least squares could be used to fit a linear regression model to predict Y on the basis of a set of predictors X1, . . . , Xp. Unfortunately,
[Drawthisfigurebyhand. classify.pdf]
this coding implies an ordering on the outcomes, putting drug overdose in [Draw 2 colors of dots, almost but not quite linearly separable.]
between stroke and epileptic seizure, and insisting that the difference [“How do we classify a new point?” Draw a point in a third color.]
between stroke and drug overdose is the same as the difference between [One possibility: look at its nearest neighbor.]
drug overdose and epileptic seizure. In practice there is no particular [Another possibility: draw a linear decision boundary; label it.]
reason that this needs to be the case. For instance, one could choose an
[Those are two different models for the nature of this data.] equally reasonable coding,
[We’ll learn some ways to draw these lin⎧ear decision boundaries in the next several lectures. But for now, let’s compare these two methods.] ⎪⎨⎪⎩1 if epileptic seizure;
Y= 2 ifstroke;
3 if drug overdose.
decision boundary
4
Jonathan Richard Shewchuk
Cl
tween the two, but closer to Scenario 2. First we generated 10 means mk
as
si
– Posi9ve:(( – Nega9ve:((
fy
6
2à Overview of Supervised Learning
2à3 Least Squares and Nearest Neighbors 3
o oo
oo oooooo
ooo o
o o
o o o o
o
o
. . . . . . . . o. . . . . . . . o. . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
a n d w e h a v . . o e . . . . n . . . . o . . o . . t . . . . s . . a . . i . . d . . . . . . w . . . . h . . . . e . . r . . e . . . . . . t . . h . . e . . . . . . c o . . o . . n . . . . s . . t . . r . . . . u . . c . . . . t . . e . . d . . . . . . d . . a . . . . t . . a . . . . . . c . . a . . m. . . . . . e . . . . f . . r . . o . . . . m . . . . . . . . . C . . . . o n s i d e r t h e
o
1−Nearest Neighbor Classifier
Linear Regression of 0/1 Response
o
……………………………………………. . . . . . . . . . . . . . . . . . ………………….o………………………… …………….. ……………………o………………………. …………….. ……o………………………………………. . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ………………………o……………………. . . . . . . . . . . . . . . . . . ………..o………….o……………..o………………………. . . . . . . . . . . . . . . . .o. . . . . . . . o. . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …………………………………………………………… ……………………o..o…o….o……………………………… …o…..o…..o………….o…..o……………………………….. ….o………..o……………………………………………… .o…..o……………..o……o…………………………………. ……………………..o……………………………………. ………………o…………………..o.o……………………… ……………………..o……………………………………. …..o……………o………..o……..o..o………………………. ………………………….o……..o…o………….o………….. . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . .o. . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …………………..o………….o.o…o……………………….. ……………….o……o……………………..o……………… . . . . . . . . . . . . . o. . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . .o. . . . . . . …………………..o…o………….o….o…………………….. …………..o…o……….o..o.o…..o…..o…….o…….o…….o…….. ………………o.o…………………………..o……………… ……………….o.o..o.o……………o…………………………o. .o…………………o…………..o………….o………….o……. …………………o….o..o……….o..o..o….o…o.o…………o…….. …………………..o…o..o…….o…o……………o……………. …………..o………….o……o…….o……………………….. ……..o…………………….o.o…o..o…………….o………….. ……………………………o….o…………o…….o…………. . …………………………………o……o…………………… . ………………….o..o…o….o………………………o……….. …………………….o………….o…….o……………..o…….
o
o ooo o
o
ooo o
o oo
o
o o o oo
oooo o ooo o
ooo o o ooooo oo oooooo o
oooo oooo oooo o
oooooo oooo
oooo ooo ooo o ooo
ooo ooo o o
oo
oo o
o oo
oo o o
ooooo o
o o oooo oo
………………………..o.o….o…….o……..o……………….. …………………….o….o…….o….o………..o……………… …………………………………o…o……………………… ………………..o……….o………..o………………………. . ……………………………….o……..o….o……………….. . …………………..o……o.o.o..o..o….o…………….o………….. ………………………………….o……………………….. …………………………o……o..o…………………………. …………………………..o…..o………………………….. . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . .o. . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . ……………………………….o……o…………………….. ……………………….o….o…o……..o……..o……………… ……………….o……o…………………………………….. …………………………………………………………… …………………………………………………………… ………………………o…………..o……….o……………… . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o . . . . . . . . . . . . . . o . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
oo o o o
o o ooooo o
o
o ooo
o o o
oo o
oo
oo o
ooooo
o
ooo o
o o o
o o
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ……………………………….o……o…..o………………… ……………………………..o……o………………………. …………………………………o…..o……………………. ……………………………o……………………………… …………………………………………..o………………. . . …………………………………………….o…………….. . . …………………………………………………………… . ……………………………………………..o……………. . ……………………………….o………………………….. …………………………………………………………… …………………………………………………………… …………………………………………………………… …………………………………………………………… …………………………………………………………… ……………………………………………..o…………….
oo
o o
FIGURE 2àà A classiÞcation example in two dimensionsà The classes are coded classnear.pdf,classlinear.pdf(ESL,Figures2.3&2.1) [Herearetwoexamplesofclassi-
FIGURE 2à3à The same classiÞcation example in two dimensions as in Fig-
as a binary variable BLUE = à, ORANGE = , and then Þt by linear regressionà
ure 2àà The classes are coded as a binary variable BLUE = à, ORANGE = , and fiersforthesamedata.AtleftwehavTheelainenisethaerdescitsionebiogunhdabryodreÞcnleadsbysixfiβe=r,à.5wàThhiecohrancgelashsasdeidfiregsiona
everything below t
elrasses are separated by the fic]
1
−N
.
eha
etswt
sdsi
h
e
l
ea
re
i
e
i
n
u
eipghrb
s
tN
eig
s
b
h
bo
rC
l
e
parently stable to fit. It does appear to rely heavily on the assumption T ˆ
l
as
sif
ier
Tö
denotes that part of input space classiÞed as ORANGE, while the blue region is
then predicted by -nearest-neighbor classiÞcationà
point by finding the nearest point in thcelassiinÞepduastBLdUaEàta, and assigning it the same class. At
26à3à3 2àFrOovmervLieewaosftSSuqpueravriseesd tLoeaNrneinagrest Neighbors
right we have a linear classifier, which guesses that everything above the line is brown, and
The linear decision boundary from least squares is very smooth, and ap-
ThedeincdicsaitoedninbFoigunred2a.1r,iaen1
beo ldr
decision boundary {x : x β = 0.5}, which is linear in this case. We see that a linear decision boundary is appropriate. In language we will develop
o . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
o t h a t f o r t h e . . . . s . . e . . . . d . . . . a . . t . . a . . . . . . t . . h . . . . e . . r . . e . . . . . . a . . r . . o e . . . . . . s . . e . . v . . . . e . . r . . a . . . . l . . m. . . . . . i . . s . . c . . l . . a . . . . s . . s . . i . . fi . . . . c . . a . . t . . . . i o . . . . n . . . . s . . . . o . . n . . . . . . b . . . . o . . t . . h . . . . . . s i d e s o f t h e
o
shortly, it has low voariance and potentially high bias. . . . . . .o. . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
o . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
o ooo o decisionbo.u.n..d..a.r.y….P..e..r.h..a.p..s..o.ou..r.o.l.in..e.a..r..m..o..d..e.l.i.s..t.o..o..r.i.g..id..—….o.r..a..r.e. sucherrors O n t h e o t h e r h a n d , toh e k – n e a r e s t – n e i g h b o r p r o c e d u r e s d o n o t a p p. e. .a. r. . t. .o. . . o. . . . .o. . . . . . . . o. . . . o. . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . .
…………………………………………………………… ooo o ……………………………………………………………
unavoidabl.e.?. . .R. .e.m. . .e.m. . b. .e. r. . .t.h. .a.ot. .o.t.h.oe. .s.e.o. a. .r.e. . .e.r.r. o. .r.s. . o. .n. . .t.h. .e. .t. r. a. .i.n. i. n. .g. . data itself, rely on any stroingoent assumptioons aobout the underlying data, and can. .a.od. .a. p. .t. . . . . o. . . . . . . . . . . . .o. . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
o o . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . to any situation. However, anyoparticular subregion of the decision b. .o.u. .n. d. .-. . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . .o.o. . . . . . . . . . . . . . . . . . . . . . . . . . .
o o o o o
o o o o
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . twopossibl.e..s..co.e.n..a.r..io..s.:…..o……………….o..o………………………. ary depends on a handful of ionpuot poointos and their particular pos.i.t.i.o.n..s.,……………….o….o.o..o..o…o…o………….o………….. o oo o o …………………..o..o…………o…o……………………….. o o o o o . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . o . . . . . . o. . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . o . . . . . . . . . . . . . . and is thus wiggly ando unstaobloe—oohigoh voariance and low bias. . . . . . . . . . . . . . . . . . . . . . . .o. . .o. . . . . . . . . . . . .o. . . . o. . . . . . . .o. . . . . . . . . . . . . . . . . . o o o . . . . . . . . . . . . . . o. . .o. . . . . . . . . . . .o. o. . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . .
oo oo
o o o o
o o o o Scenario ..:..T…h.e..t..r.a.i.n..i.no.g…d.a..t.a.o.i.on…e.a.c..h..c..loa..s.s..w..eo..r.e.o.g.e.o.n.e..r.a.t..e.d..f.r..o.mbivariate
o . . . . . . . . . . . . . . . . . . . o o. o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Each metohod has its ownoosituations forowhiochoit works booest; in par.to.i.c.u. .la. .r. . . . . . . . . . . . .o.o. o. . . . . . . . . . . . . o. . o.o. . . . . . . . . .o. . . . . . . . . . . . . o. . . . . .o.
o
o o o o o o
o
oooo o oo oo o
Gaus.s.i.a.n…d..i.s.t.r.i.b..u.t..io..n..so..wo..io.t.h…u..n.o.c.o.r..r.eo.l.a.t.o.e.d.o.o.c.o..m..p..o.n..e.on..t.s…a.n. d different . . . . . . . . . . . . . . o. . . . . . . . . . . . o. . o. . . . . o. . . . . o. . o. . . . . . . . . . . . . o. . . . . . . . . . . . . . . .
o . . . . . . . .o. . . . . . . . . . . . . . . . . . .o. . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
linear regression is more appropriaotoe foor Scenaorio 1oabove, while n.e..a.r.e..s.t…………………….o.o…o..o…………….o…………..
o o mean.s………………………………o..o……o….o…….o………….
o . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
o o ……………………o.o..o….o………………………o….o…….
n e i g h b o r s a r e m o r e s u i t a b l e f o r S o oc e n a r i o 2 . T h e t i m e h a s c o m e t o e. . x . p. . o. . s . e . . . . . . . . . . . . . . . . . o. . . . . o. . . . . . . . o. . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . .
o o oo o oo …………………….o….o.o….o..o…..o……..o..o………………
o o ………………………..o……….o.o………………………..
o o o o . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
o oo o o
t h e o r a c l e ! T h e d a t a i n f a c t w e r e s i m u l a t e d o o f r o o m a m o d e S l c s e o n m a e r w i o h e 2 . r . : e . . T. b . . h e . – e . . t . . r . a . i . n . . i . n . g . . d . . a . . t . a . . i . n . o . e . a . . c . h . . . c . l . a . o s . s . . c . o o . a . m. . o . e . . f . r . o . m. . . a . . . m . . i . x . . t . u . r . e o f 1 0 l o w – oo o . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. .o. . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
o o oo o o o . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . o.o.o. . o. . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . .
o o …………………………o……o….o………………………..
o o
tween the two, but closer to Scenaorio o2o. First we generated 10vamriean.n.c.se. .m.G. .ka. .u. .s.s.i.a. n. . . d. .i.s.t. r. i. b. .u. .t.i.o.n. .s.,. .ow. . i. t. h. . . .in. .d. .i.v.i.d. .u. a. .l. .m. . .e.a. n. .s. . themselves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
o oo o T ………………….o……………o..o…o………………………
o . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . fromabivariateGaussiandistribuotionN((o1,0) ,oI)andlabeledisthri.sb.u.c.tl.ae..ds.s.a..s..G..a..u..s.s.i.a.n………………………..o………………
o o . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . .o. . .o. . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . ……………….o……o……………………………………..
T ……………………………………………………………
BLUE. Similarly, 10 more were dorawn from N((0,o1) ,I) and labeled…c.l.a..s.s…………………………………………………….
o o ………………………o…………..o.o…..o….o………………
o . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
o . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . .
ooo A mixtu.r.e…o.f…G..a..u.s..s.i.a.n..s…is…b..e.s.t…d.o.eo.so.c..r.i.b.e..d…i.n…t.e.r..m..s…o.f…t.h..e. generative O R A N G E . T h e n f o r e a c h c l a s s w e g e n e r a t e d 1 o 0 0 o b s e r v a t i o n s a s f o l l o w. . . . s . . : . . . . f . . o. . . . r . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
o oo
o . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . o. . . . . . . . . . . . . . . . . . . . . ……………………………..o….o..o……………………….
o
each observation, we picked an mk oat random with probability 1/1.0. ,. . a. .n. d. . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
o
o . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . o o . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …………………………………………………………… …………………………………………………………… o ……………………………….o…………….o……………. …………………………………………………………… …………………………………………………………… …………………………………………………………… …………………………………………………………… …………………………………………………………… …………………………………………………………… ……………………………………………………………
Classifica9on(Pipeline(
for comparing the different methods. rely on any stringent assumptions about the underlying data, and can adapt
in
g
D
ig
its
sd5-Nta
2à3 Least Squares and Nearest Neighbors 5
The set of points in IR2 classified as ORANGE corresponds to {x : xT βˆ > 0.5},
rer e
Non i
aiCcctlaek .
model. On.e..fi..r.s.t…g.e..n.e..r.a..t.e.s…a…d.i.s.c..r.e.t.e…v.a..r.i.a.ob..l.e…t.h..a.t…d.e..t.e.r.m…i.n.e. s which of
oo
FIGURE 2à2à The same classiÞcation example in two dimensions as in Fig- classnear.pdf, classnear15.pdf (ESL, Figures 2.3 & 2.2) [At right we have a
FIGURE 2à3à The same classiÞcation example in two dimensions as in Fig-
ure 2àà The classes are coded as a binary variable BLUE = à, ORANGE = and
then Þt by 5-nearest-neighbor averaging as in 2à8 à The predicted class is hence 15-nearestneighborclassifier. Instead of looking at the nearest neighbor of a new
ure 2àà The classes are coded as a binary variable BLUE = à, ORANGE = , and
then predicted by -nearest-neighbor classiÞcationà
point, it looks at the 15 nearest neighbors and lets them vote for the correct class. The
2à3à3 From Least Squares to Nearest Neighbors
chosen by majority vote amongst the 5-nearest neighborsà
In Figure 2.2 we see that far fewer training observations are misclassified
1-nearest neighbor classifier at left has a big advantage: it classifies all the training data
than in Figure 2.1. This should not give us too much comfort, though, since The linear decision boundary from least squares is very smooth, and ap-
correctly,whereasthe15-nearestneighbinoFrigculreas2.s3ifinoenreoaftthreigtrhaitnifinggduatraeadreomeisclnasosifit.ed.BAultittlheethoruighhtt parently stable to fit. It does appear to rely heavily on the assumption
suggests that for k-nearest-neighbor fits, the error on the training data that a linear decision boundary is appropriate. In language we will develop
figure has an advantage too. Somebody please tell me what.]
shortly, it has low variance and potentially high bias.
On the other hand, the k-nearest-neighbor procedures do not appear to
should be approximately an increasing function of k, and will always be 0
for k = 1. An independent test set would give us a more satisfactory means
[The left figure is an example of what’s called overfitting. In the left figure, observe how intricate the
It appears that k-nearest-neighbor fits have a single parameter, the num- decisionbtooaunynsditauratyionis.Htohwaetvesr,eapnayrpartteicsultahresubproegsiointiovfethexdeacmisiopnlbeosunfdr-omthenegativeexamples.It’sabittoointricate
ber of neighbors k, compared to the p parameters in least-squares fits. Al- ary depends on a handful of input points and their particular positions,
though this is the case, we will see that the effective number of parameters
to reflect reality. In the right figure, the decision boundary is smoother. Intuitively, that smoothness is
and is thus wiggly and unstable—high variance and low bias.
of k-nearest neighbors is N/k and is generally bigger than p, and decreases
Each method has its own situations for which it works best; in particular
probably more likely to correspond to reality.] with increasing k. To get an idea of why, note that if the neighborhoods
linear regression is more appropriate for Scenario 1 above, while nearest
nrain
eig
hbo
were nonoverlapping, there would be N/k neighborhoods and we would fit are more suitable for Scenario 2. The time has come to expose
rs
• Collect(T one parameter (a mean) in each neighborhood.
ing(Images(
the oracle! The data in fact were simulated from a model somewhere be-
It is also clear that we cannot use sum-of-squared errors on the training
set as a criterion for picking k, since we would always pick k = 1! It would from a bivariate Gaussian distribution N((1,0) ,I) and labeled this class
T
T seem that k-nearest-neighbor methods would be more appropriate for the BLUE. Similarly, 10 more were drawn from N((0,1) ,I) and labeled class
mixture Scenario 2 described above, while for Gaussian data the decision ORANGE. Then for each class we generated 100 observations as follows: for
boundaries of k-nearest neighbors would be unnecessarily noisy. each observation, we picked an mk at random with probability 1/10, and
• Training(Time(
and 1’s, and we are asked to learn to distinguish the 7’s from the 1’s.]
sevensones.pdf [Inthissimplifieddigitrecognitionproblem,wearegivenhandwritten7’s
– Compute(feature(vectors(for(posi9ve(and(nega9ve( example(images(
– Train(a(classifier( • Test(Time(
Express these images as vectors
3 3
3
3
0
0
Introduction 5
3
3
3
3
0
0
2
3
0
0
1
3
3
3
3
3
2
3
→
0
0
1
3
3
3
3 3
Images are points in 16-dimensional space. Linear decision boundary is a hyperplane.
Validation
– Train a classifier: it learns to distinguish 7 from not 7 – Test the classifier on NEW images
2 kinds of error:
– Training set error: fraction of training images not classified correctly
[This is zero with the 1-nearest neighbor classifier, but nonzero with the 15-nearest neighbor and
linear classifiers we’ve just seen.]
– Test set error: fraction of misclassified NEW images, not seen during training.
[When I underline a word or phrase, that usually means it’s a definition. If you want to do well in this course, my advice to you is to memorize the definitions I cover in class.]
outliers: points whose labels are atypical (e.g. solvent borrower who defaulted anyway).
overfitting: when the test error deteriorates because the classifier becomes too sensitive to outliers or other spurious patterns.
[In machine learning, the goal is to create a classifier that generalizes to new examples we haven’t seen yet. Overfitting is counterproductive to that goal. So we’re always seeking a compromise: we want decision boundaries that make fine distinctions without being downright superstitious.]
6 Jonathan Richard Shewchuk
Most ML algorithms have a few hyperparameters that control over/underfitting, e.g. k in k-nearest neighbors.
k: # of nearest neighbors
151 101 69 45 31 21 11 7 5 3 1
underfit
error rate
overfit!
best (7)
Linear
test error
training error
Train Test Bayes
overfitlabeled.pdf (modified from ESL, Figure 2.4)
We select them by validation:
– Hold back a subset of the labeled data, called the validation set.
– Train the classifier multiple times with different hyperparameter settings. – Choose the settings that work best on validation set.
Now we have 3 sets:
training set used to learn model weights
validation set used to tune hyperparameters, choose among different models
test set used as FINAL evaluation of model. Keep in a vault. Run ONCE, at the very end.
[It’s very bad when researchers in medicine or pharmaceuticals peek into the test set prematurely!]
Kaggle.com:
– Runs ML competitions, including our HWs
– We use 2 data sets:
“public” set labels available during competition
“private” set revealed only after due date
[If your public results are a lot better than your private results, we will know that you overfitted.]
Techniques [taught in this class, NOT a complete list]
Supervised learning:
– Classification: is this email spam?
– Regression: how likely does this patient have cancer?
Unsupervised learning:
– Clustering: which DNA sequences are similar to each other?
– Dimensionality reduction: what are common features of faces? common differences?
0.10 0.15 0.20 0.25 0.30
Linear Classifiers and Perceptrons
7
2 Linear Classifiers and Perceptrons CLASSIFIERS
You are given sample of n observations, each with d features [aka predictors]. Some observations belong to class C; some do not.
Example: Observations are bank loans Features are income & age (d = 2)
Some are in class “defaulted,” some are not
Goal: Predict whether future borrowers will default, based on their income & age.
Represent each observation as a point in d-dimensional space, called a sample point / a feature vector / independent variables.
income
overfitting
X XXXXXXCX
XXXXCXC CXC
CXCXCCC income income X C
CXCXCX XX
CCX CC CX CCCCC
age age age
[Draw this by hand; decision boundaries last. ]
[We draw these lines/curves separating C’s from X’s. Then we use these curves to predict which future borrowers will default. In the last example, though, we’re probably overfitting, which could hurt our predic- tions.]
decision boundary: the boundary chosen by our classifier to separate items in the class from those not.
overfitting: When sinuous decision boundary fits sample points so well that it doesn’t classify future points well.
[A reminder that underlined phrases are definitions, worth memorizing.]
Some (not all) classifiers work by computing a
decision function: A function f (x) that maps a sample point x to a scalar such that f(x) > 0 if x ∈ class C;
f(x) ≤ 0 if x class C.
Aka predictor function or discriminant function.
For these classifiers, the decision boundary is {x ∈ Rd : f (x) = 0} [That is, the set of all points where the decision function is zero.] Usually, this set is a (d − 1)-dimensional surface in Rd.
{x : f (x) = 0} is also called an isosurface of f for the isovalue 0. f has other isosurfaces for other isovalues, e.g., {x : f (x) = 1}.
classify3.pdf
8
Jonathan Richard Shewchuk
-6 -4 -2 0 2 4 6
radiusplot.pdf, radiusiso.pdf [3D plot and isocontour plot of the cone] f (x, y) = x2 + y2 − 3. [Imagine a decision function in Rd, and imagine its (d − 1)-dimensional isosurfaces.]
radiusiso3d.pdf
[One of these spheres could be the decision boundary.]
linear classifier: The decision boundary is a line/plane.
Usually uses a linear decision function. [Sometimes no decision fn.]
64 5
5
44 3
2
0
-2
-4 4
-6 5
-1
-2
0
2
1
4
5
Math Review
[I will write vectors in matrix notation.] x1
3 1 2 3 4 5 ⊤ Vectors:x=x =[x x x x x]
x 4
x5
Think of x as a point in 5-dimensional space.
Conventions (often, but not always):
uppercase roman = matrix, random variable, set lowercase roman = vector
Greek = scalar
Other scalars:
function (often scalar)
x 2
inner product (aka dot product): x · y = x1y1 + x2y2 + … + xdyd also written x⊤y
Clearly, f(x)=w·x+α isalinearfunctioninx.
√222 Euclideannorm:∥x∥= x·x= x1+x2+…+xd
∥x∥ is the length (aka Euclidean length) of a vector x. Given a vector x, x is a unit vector (length 1).
Linear Classifiers and Perceptrons 9
X
x
α
n = # of sample points
d = # of features (per point)
= dimension of sample points i j k = indices
f ( ), s( ), . . .
∥x∥
“Normalize a vector x”: replace x with x .
Use dot products to compute angles:
y
θ
∥x∥
cosθ= x·y = x · y
∥x∥ ∥y∥
∥x∥ ∥y∥
length 1 length 1
x
acute
right
obtuse
x·y>0 x·y=0 x·y<0
Given a linear decision function f (x) = w · x + α, the decision boundary is H = {x : w · x = −α}.
The set H is called a hyperplane. (A line in 2D, a plane in 3D.)
[A hyperplane is what you get when you generalize the idea of a plane to higher dimensions. The three most important things to understand about a hyperplane is (1) it has dimension d − 1 and it cuts the d-dimensional space into two halves; (2) it’s flat; and (3) it’s infinite.]
10 Jonathan Richard Shewchuk
Theorem: Let x, y be 2 points that lie on H. Then w · (y − x) = 0.
Proof: w · (y − x) = −α − (−α) = 0. [Therefore, w is orthogonal to any vector that lies on H.]
w is called the normal vector of H,
because (as the theorem shows) w is normal (perpendicular) to H. [I.e., w is perpendicular to every line through any pair of points on H.]
w
I.e., positive on w’s side of H; negative on other side.
Moreover, the distance from H to the origin is α. [How do we know that?] Hence α = 0 if and only if H passes through origin.
[w does not have to be a unit vector for the classifier to work.
If w is not a unit vector, w · x + α is the signed distance times some real.
If you want to fix that, you can rescale the equation by computing ∥w∥ and dividing both w and α by ∥w∥.]
The coefficients in w, plus α, are called weights (or parameters or regression coefficients). [That’s why we call the vector w; “w” stands for “weights.”]
The input data is linearly separable if there exists a hyperplane that separates all the sample points in class C from all the points NOT in class C.
[At the beginning of this lecture, I showed you one plot that’s linearly separable and two that are not.]
[We will investigate some linear classifiers that only work for linearly separable data and some that do a decent job with non-separable data. Obviously, if your data are not linearly separable, a linear classifier cannot do a perfect job. But we’re still happy if we can find a classifier that usually predicts correctly.]
A Simple Classifier
Centroid method: compute mean μC of all points in class C and mean μX of all points NOT in C. We use the decision function
w·x=1 w·x=0
w · x = −2
If w is a unit vector, then w · x + α is the signed distance from x to H.
[Draw black part first, then red parts. ]
f(x)= (μC−μX)·x−(μC−μX)·μC+μX 2
normal vector midpoint between μC, μX
so the decision boundary is the hyperplane that bisects line segment w/endpoints μC, μX.
hyperplane.pdf
Linear Classifiers and Perceptrons 11
X
CC C
X X
XXX
CCC
[Draw data, then μC, μX, then line & normal. ]
[In this example, there’s clearly a better linear classifier that classifies every sample point correctly.
Note that this is hardly the worst example I could have given.
If you’re in the mood for an easy puzzle, pull out a sheet of paper and think of an example, with lots of sample points, where the centroid method misclassifies every sample point but one.]
[Nevertheless, there are circumstances where this method works well, like when all your positive examples come from one Gaussian distribution, and all your negative examples come from another.]
[We can sometimes improve this classifier by adjusting the scalar term to minimize the number of misclas- sified points. Then the hyperplane has the same normal vector, but a different position.]
Perceptron Algorithm (Frank Rosenblatt, 1957)
Slow, but correct for linearly separable points.
Uses a numerical optimization algorithm, namely, gradient descent.
[Poll:
How many of you know what gradient descent is?
How many of you know what a linear program is?
How many of you know what the simplex algorithm for linear programming is? How many of you know what a quadratic program is?
We’re going to learn what most of these things are. As machine learning people, we will be heavy users of optimization methods. Unfortunately, I won’t have time to teach you algorithms for many optimization problems, but we’ll learn a few. To learn more, take EECS 127.]
Consider n sample points X1, X2, ..., Xn.
[The reason I’m using capital X here is because we typically store these vectors as rows of a matrix X. So
the subscript picks out a row of X, representing a specific sample point.]
1 ifX ∈classC,and i
For each sample point, the label yi =
For simplicity, consider only decision boundaries that pass through the origin. (We’ll fix this later.)
−1 ifXiC.
centroid.pdf
12 Jonathan Richard Shewchuk
Goal: find weights w such that
Xi ·w≥0 ifyi =1,and
Xi ·w≤0 ifyi =−1. [remember,Xi ·wisthesigneddistance]
Equivalently: yi Xi · w ≥ 0. ← inequality called a constraint.
Idea: We define a risk function R that is positive if some constraints are violated. Then we use optimization
to choose w that minimizes R. [That’s how we train a perceptron classifier.] Define the loss function
But if z has the wrong sign, the loss function is positive.
[For each sample point, you want to get the loss function down to zero, or as close to zero as possible. It’s
called the “loss function” because the bigger it is, the bigger a loser you are.] Define risk function (aka objective function or cost function)
n
R(w) = L(Xi · w, yi),
i=1
= −yiXi·w whereVisthesetofindicesiforwhichyiXi·w<0.
i∈V
If w classifies all X1, . . . , Xn correctly, then R(w) = 0. Otherwise, R(w) is positive, and we want to find a better w.
Goal: Solve this optimization problem:
[Plot of risk R(w). Every point in the dark green flat spot is a minimum. We’ll look at this more next lecture.]
0 ifyz≥0,and i
L(z,yi) =
[Here, z is the classifier’s prediction, and yi is the correct answer.]
−yi z otherwise.
If z has the same sign as yi, the loss function is zero (happiness).
Find w that minimizes R(w).
riskplot.pdf
Perceptron Learning; Maximum Margin Classifiers 13 3 Perceptron Learning; Maximum Margin Classifiers
Perceptron Algorithm (cont’d)
Recall:
– linear decision fn f (x) = w · x
– decision boundary {x : f (x) = 0}
– sample points X1,X2,...,Xn ∈ Rd; class labels y1,...,yn = ±1
– goal:findweightswsuchthatyiXi·w≥0
– goal, revised: find w that minimizes R(w) = i∈V −yi Xi · w
where V is the set of indices i for which yiXi · w < 0.
(for simplicity, no α) (a hyperplane through the origin)
[risk function]
[Our original problem was to find a separating hyperplane in one space, which I’ll call x-space. But we’ve transformed this into a problem of finding an optimal point in a different space, which I’ll call w-space. It’s important to understand transformations like this, where a geometric structure in one space becomes a point in another space.]
Objects in x-space transform to objects in w-space: x-space w-space
Point x lies on hyperplane {z : w · z = 0} ⇔ w · x = 0 ⇔ point w lies on hyperplane {z : x · z = 0} in w-space. [So a hyperplane transforms to its normal vector. And a sample point transforms to the hyperplane whose
normal vector is the sample point.]
[In this algorithm, the transformations happen to be symmetric: a hyperplane in x-space transforms to a point in w-space the same way that a hyperplane in w-space transforms to a point in x-space. That won’t always be true for the decision boundaries we use this semester.]
hyperplane: {z : w · z = 0} point: x
point: w hyperplane: {z : x · z = 0}
If we want to enforce inequality x · w ≥ 0, that means
– inx-space,xshouldbeonthesamesideof{z:w·z=0}asw – inw-space,w ” ” ” ” ” ” ”{z:x·z=0}asx
x-space X
w-space
[Draw this by hand. ] [Observe that the x-space sample points are the normal vectors for the w-space lines. We can choose w to be anywhere in the shaded region.]
X
C
w
w
[For a sample point x in class C, w and x must be on the same side of the hyperplane that x transforms into. For a point x not in class C (marked by an X), w and x must be on opposite sides of the hyperplane that x transforms into. These rules determine the shaded region above, in which w must lie.]
[Again, what have we accomplished? We have switched from the problem of finding a hyperplane in x- space to the problem of finding a point in w-space. That’s a better fit to how we think about optimization algorithms.]
xwspace.pdf
14 Jonathan Richard Shewchuk [Let’s take a look at the risk function these three sample points create.]
riskplot.pdf,riskiso.pdf [Plot&isocontoursofriskR(w).NotehowR’screasesmatchthe w-space drawn above.]
[In this plot, we can choose w to be any point in the bottom pizza slice; all those points minimize R.] [We have an optimization problem; we need an optimization algorithm to solve it.]
An optimization algorithm: gradient descent on R.
[Draw the typical steps of gradient descent on the plot of R.]
Given a starting point w, find gradient of R with respect to w; this is the direction of steepest ascent. Take a step in the opposite direction. Recall [from your vector calculus class]
and
∂wd
∇R(w)=∇−yiXi ·w=−yiXi
∇R(w) = . .
∇(z·w)= . =z .
∂R
z1
∂w 1
∂R
z2
∂w 2
. ∂R
. z
i∈V i∈V
At any point w, we walk downhill in direction of steepest descent, −∇R(w).
w ← arbitrary nonzero starting point (good choice is any yiXi) while R(w) > 0
V ←setofindicesiforwhichyiXi ·w<0
w←w+εi∈V yiXi return w
ε > 0 is the step size aka learning rate, chosen empirically. [Best choice depends on input problem!] Problem: Slow! Each step takes O(nd) time. [Can we improve this?]
4
2
0
-2
-4
d
-4 -2 0 2 4
Perceptron Learning; Maximum Margin Classifiers 15 Optimization algorithm 2: stochastic gradient descent
Idea: each step, pick one misclassified Xi;
do gradient descent on loss fn L(Xi · w, yi).
Called the perceptron algorithm. Each step takes O(d) time. [Not counting the time to search for a misclassified Xi.]
while some yiXi · w < 0 w ← w + ε yiXi
return w
[Stochastic gradient descent is quite popular and we’ll see it several times more this semester, especially for neural nets. However, stochastic gradient descent does not work for every problem that gradient descent works for. The perceptron risk function happens to have special properties that guarantee that stochastic gradient descent will always succeed.]
What if separating hyperplane doesn’t pass through origin? Add a fictitious dimension. Decision fn is
x1
f(x)=w·x+α=[w w α]· x 122 1
Now we have sample points in Rd+1, all lying on hyperplane xd+1 = 1.
Run perceptron algorithm in (d + 1)-dimensional space. [We are simulating a general hyperplane in
d dimensions by using a hyperplane through the origin in d + 1 dimensions.]
[The perceptron algorithm was invented in 1957 by Frank Rosenblatt at the Cornell Aeronautical Laboratory. It was originally designed not to be a program, but to be implemented in hardware for image recognition on a 20 × 20 pixel image. Rosenblatt built a Mark I Perceptron Machine that ran the algorithm, complete with electric motors to do weight updates.]
Mark I perceptron.jpg (from Wikipedia, “Perceptron”)
This is what it took to process a 20 × 20 image in 1957.]
[The Mark I Perceptron Machine.
16 Jonathan Richard Shewchuk
[Then he held a press conference where he predicted that perceptrons would be “the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence.” We’re still waiting on that.]
[One interesting aspect of the perceptron algorithm is that it’s an “online algorithm,” which means that if new data points come in while the algorithm is already running, you can just throw them into the mix and keep looping.]
[Perceptron Convergence Theorem: If data is linearly separable, perceptron algorithm will find a linear classifier that classifies all data correctly in at most O(r2/γ2) iterations, where r = max ∥Xi∥ is “radius of data” and γ is the “maximum margin.”]
[I’ll define “maximum margin” shortly.]
[We’re not going to prove this, because perceptrons are obsolete.]
[Although the step size/learning rate doesn’t appear in that big-O expression, it does have an effect on the running time, but the effect is hard to characterize. The algorithm gets slower if ε is too small because it has to take lots of steps to get down the hill. But it also gets slower if ε is too big for a different reason: it jumps right over the region with zero risk and oscillates back and forth for a long time.]
[Although stochastic gradient descent is faster for this problem than gradient descent, the perceptron algo- rithm is still slow. There’s no reliable way to choose a good step size ε. Fortunately, optimization algorithms have improved a lot since 1957. You can get rid of the step size by using any decent modern “line search” al- gorithm. Better yet, you can find a better decision boundary much more quickly by quadratic programming, which is what we’ll talk about next.]
MAXIMUM MARGIN CLASSIFIERS
The margin of a linear classifier is the distance from the decision boundary to the nearest sample point. What if we make the margin as wide as possible?
X
X
C C
C C
XC XC
X
X w·x+α=1
w · x + α = −1 w · x + α = 0 We enforce the constraints
yi(w·Xi +α)≥1 fori∈[1,n]
[Draw this by hand. ]
[Notice that the right-hand side is a 1, rather than a 0 as it was for the perceptron algorithm. It’s not obvious, but this a better way to formulate the problem, partly because it makes it impossible for the weight vector w to get set to zero.]
maxmargin.pdf
Perceptron Learning; Maximum Margin Classifiers 17 Recall: if ∥w∥ = 1, signed distance from hyperplane to Xi is w · Xi + α.
α . [We’ve normalized the expression to get a unit weight vector.] ∥w∥
1 |w · Xi + α| ≥ 1 . [We get the inequality by substituting the constraints.] ∥w∥ ∥w∥
∥w∥
Otherwise, it’s w · Xi + ∥w∥
Hence the margin is mini
Called a quadratic program in d + 1 dimensions and n constraints.
It has one unique solution! [If the points are linearly separable; otherwise, it has no solution.]
[A reason we use ∥w∥2 as an objective function, instead of ∥w∥, is that the length function ∥w∥ is not smooth at w = 0, whereas ∥w∥2 is smooth everywhere. This makes optimization easier.]
The solution gives us a maximum margin classifier, aka a hard-margin support vector machine (SVM).
[Technically, this isn’t really a support vector machine yet; it doesn’t fully deserve that name until we add features and kernels, which we’ll study in later lectures.]
≥1
2 containing no sample points [with the hyperplane running along its middle].
There is a slab of width
To maximize the margin, minimize ∥w∥. Optimization problem:
Find w and α that minimize ∥w∥2 subjecttoyi(Xi ·w+α)≥1 foralli∈[1,n]
[Let’s see what these constraints look like in weight space.]
-1.0 -0.8 -0.6
-0.4 -0.2
alpha 1.0
0.5
w2
-0.5
-1.0
[This is an example of what the linear constraints look like in the 3D weight space (w1,w2,α) for the SVM we’ve been studying with three training points. The SVM is looking for the point nearest the origin that lies above the blue plane (representing an in-class training point) but below the red and pink planes (representing out-of-class training points). In this example, that optimal point lies where the three planes intersect. At right we see a 2D cross section w1 = 1/17 of the 3D space, because the optimal solution lies in this cross section. The constraints say that the solution must lie in the leftmost pizza slice, while being as close to the origin as possible, so the optimal
solution is where the three lines meet.]
weight3d.pdf, weightcross.pdf
18 Jonathan Richard Shewchuk
4 Soft-Margin Support Vector Machines; Features
SOFT-MARGIN SUPPORT VECTOR MACHINES (SVMs)
Solves 2 problems:
– Hard-margin SVMs fail if data not linearly separable.
–
” ” ” sensitive to outliers. 9à2 Support Vector ClassiÞers 345
Idea:
Allow some points to violate the margin, with slack variables.
sensitive.pdf (ISL, Figure 9.5)
−1 0 1 2
[Example where one outlier moves the hard-margin SVM
X1 X1 decision boundary a lot.]
• Greater robustness to individual observations, and ξi ≥ 0
3 −1 0 1 2 3
FIGURE 9à5à Left: Two classes of observations are shown in blue and in
Mpuordpilfie,edaclongstrwaintht ftohrepominatxi:mal margin hyperplaneà Right: An additional blue
observation has been added, leading to a dramatic shift in the maximal margin
hypyer(Xpla·nwe +shαo)w≥n 1as−aξ solid lineà The dashed line indicates the maximal margin iii
hyperplane that was obtained in the absence of this additional pointà
[Observe that the only difference between these constraints and the hard-margin constraints we saw last
lecture is the extra slack term ξi.]
[We also impose new constraints, that the slack variables are never negative.]
• Better classification of most of the training observations.
[This inequality ensures that all sample points that don’t violate the margin are treated the same; they all
That is, it could be worthwhile to misclassify a few training observations
have ξi = 0. Point i has nonzero ξi if and only if it violates the margin.]
in order to do a better job in classifying the remaining observations.
The support vector classiÞer, sometimes called a soft margin classiÞer, CCC
support vector classifier soft margin classifier
does exactly this. Rather than seeking the largest possible margin so that
w·x+α=0
every observation is not only on the correct side of the hyperplane but
CCC
C ξ5/∥w∥ CCC
to be on the incorrect siξd1e/∥owf t∥he margin, or even the incorrect side of ξ4/∥w∥ C ξ3/∥w∥
the hyperplane. (The margin is soft because it can be violated by some
XC
of the training observations.) An examplXe is shown in the left-hand panel
X
X
X
also on the correct side of the margin, we instead allow some observations
C
X
ξ /∥w∥
of Figure 9.6. Most of the observ2ations are on the correct side of the margin.
X 1/∥w∥ margin. XXX
However, a small subset of the observations are on the wrong side of the
X
X
XX
X
X X 1/∥w∥ (margin)
on the wrong side of the hyperplane. In fact, when there is no separating
An observation can be not only on the wrong side of the margin, but also
slacker+.pdf h a situation
[A margin where some points have slack.]
hyperplane, suc is inevitable. Observations on the wrong side of
[For soft-hmeahrgyipneSrpVlManse, cwoerresdpefionedthoetwraoirndin“gmoarbgsienr.”vaTtihoenmstahrgaitnairsenmoilsocnlgaesrsitfiheddbisytance from the
decision boundary to the nearest sample point. Instead, we define the margin to be 1/∥w∥.] the support vector classifier. The right-hand panel of Figure 9.6 illustrates
such a scenario.
9à2à2 Details of the Support Vector ClassiÞer
The support vector classifier classifies a test observation depending on
X2
−1 0 1 2 3
X2
−1 0 1 2 3
Soft-Margin Support Vector Machines; Features 19 To prevent abuse of slack, we add a loss term to objective fn.
Optimization problem:
. . . a quadratic program in d + n + 1 dimensions and 2n constraints.
[It’s a quadratic program because its objective function is quadratic and its constraints are linear inequalities.]
C > 0 is a scalar regularization hyperparameter that trades off: small C big C
desire danger
outliers boundary
[The last row only applies to nonlinear decision boundaries, which we’ll discuss next. Obviously, a linear decision boundary can’t be “sinuous.”]
Find w, α, and ξ that minimize ∥w∥2 + C n ξ
i
subjectto yi(Xi ·w+α)≥1−ξi
ξi ≥0
i=1 i foralli∈[1,n]
foralli∈[1,n]
maximize margin 1/∥w∥
keep most slack variables zero or small
underfitting (misclassifies much training data)
overfitting
(awesome training, awful test)
less sensitive
very sensitive
more “flat”
more sinuous
Use validation to choose C.
348 9à Support Vector Machines
−1 0 1 2
X1
−1 0 1 2
−1 0 1 2
X1
−1 0 1 2
X1
support vector classiÞer was Þt using four different values of the
tuning parameter C in 9à2 9à5 à The largest value of C was used in the top left; largest C at lower right.]
left panel, and smaller values were used in the top right, bottom left, and bottom
[One way to think aboutrsiglhatcpkanieslsàtoWpherneCteinsdlartghea, thsenlatchekreiis ma hoignhetoylewranececfaonr osbpseervnadtiotnos beuinyg permission for a sample
X1
[Examples of how the slab varies with C. Smallest C at upper
svmC.pdf (ISL, Figure 9.7)
FIGURE 9ààà A
on the wrong side of the margin, and so the margin will be largeà As C decreases,
point to violate the margin. The further a point penetrates the margin, the bigger the fine you have to pay.
the tolerance for observations being on the wrong side of the margin decreases, and the margin narrowsà
We want to make the margin as wide as possible, but we also want to spend as little money as possible. If the regularization parameter C is small, it means we’re willing to spend lots of money on violations so we
but potentially high bias. In contrast, if C is small, then there will be fewer
can get a wider margin. If C is big, it means we’re cheap and we won’t pay much for violations, even though support vectors and hence the resulting classifier will have low bias but
we’ll suffer a narrower mhiagrhgvianr.iaInfceC. Tihseibnofittnoimter,igwhte’praeneblainckFigtourea9h.7airldlu-smtratregsitnhisSsVetMtin.g], with only eight support vectors.
The fact that the support vector classifier’s decision rule is based only on a potentially small subset of the training observations (the support vec- tors) means that it is quite robust to the behavior of observations that are far away from the hyperplane. This property is distinct from some of the other classification methods that we have seen in preceding chapters, such as linear discriminant analysis. Recall that the LDA classification rule
X2
−3 −2 −1 0 1 2 3
X2
−3 −2 −1 0 1 2 3
X2
−3 −2 −1 0 1 2 3
X2
−3 −2 −1 0 1 2 3
20 Jonathan Richard Shewchuk FEATURES
Q: How to do nonlinear decision boundaries?
A: Make nonlinear features that lift points into a higher-dimensional space.
High-d linear classifier → low-d nonlinear classifier.
[Features work with all classifiers—not only linear classifiers like perceptrons and SVMs, but also classifiers
that are not linear.]
Example 1: The parabolic lifting map
Φ:Rd →Rd+1
x 2
Φ(x) = ∥x∥2 ← lifts x onto paraboloid xd+1 = ∥x∥
[We’ve added one new feature, ∥x∥2. Even though the new feature is just a function of other input features,
it gives our linear classifier more power.] Find a linear classifier in Φ-space.
It induces a sphere classifier in x-space. XXX
X CXXX
X
X
XX
XXXX
X
C C
X
XX
C
C
C
XCC
X
C
C
C C
[Draw this by hand. ] Theorem: Φ(X1), . . ., Φ(Xn) are linearly separable iff X1, . . ., Xn are separable by a hypersphere.
(Possibly an ∞-radius hypersphere = hyperplane.)
Proof: Consider hypersphere in Rd w/center c & radius ρ. Points inside:
∥x − c∥2 < ρ2
∥x∥2 −2c·x+∥c∥2 <ρ2
x
[−2c⊤ 1] 2 <ρ2−∥c∥2
∥x∥ normal vector in Rd+1
Φ(x)
Hence points inside sphere ↔ same side of hyperplane in Φ-space.
[The implication works in both directions.]
[Hyperspheres include hyperplanes as a special, degenerate case. A hyperplane is essentially a hypersphere with infinite radius. So hypersphere decision boundaries can do everything hyperplane decision boundaries can do, plus a lot more. With the parabolic lifting map, if you pick a hyperplane in Φ-space that is vertical, you get a hyperplane in x-space.]
circledec.pdf
Soft-Margin Support Vector Machines; Features 21 Example 2: Axis-aligned ellipsoid/hyperboloid decision boundaries
[Draw examples of axis-aligned ellipse & hyperbola.] In 3D, these have the formula
Ax12 +Bx2 +Cx32 +Dx1 +Ex2 +Fx3 +α=0 [Here, the capital letters are scalars, not matrices.]
Φ:Rd →R2d
Φ(x) = [x12 ... xd2 x1 ... xd]⊤
Hyperplaneis [A,B,C,D,E,F]·Φ(x)+α=0
w
[We’ve turned d input features into 2d features for our linear classifier. If the points are separable by an axis-aligned ellipsoid or hyperboloid, per the formula above, then the points lifted to Φ-space are separable by a hyperplane whose normal vector is (A, B, C, D, E, F).]
Example 3: Ellipsoid/hyperboloid
[Draw example of non-axis-aligned ellipse.]
3D formula: [for a general ellipsoid or hyperboloid]
Ax12 +Bx2 +Cx32 +Dx1x2 +Ex2x3 +Fx3x1 +Gx1 +Hx2 +Ix3 +α=0 Φ(x) : Rd → R(d2+3d)/2
[Now, our decision function can be any degree-2 polynomial.]
Isosurface defined by this equation is called a quadric. [In the special case of two dimensions, it’s also known as a conic section. So our decision boundary can be an arbitrary conic section.]
[You’ll notice that there is a quadratic blowup in the number of features, because every pair of input features creates a new feature in Φ-space. If the dimension is large, these feature vectors are getting huge, and that’s going to impose a serious computational cost. But it might be worth it to find good classifiers for data that aren’t linearly separable.]
22 Jonathan Richard Shewchuk
Example 4: Decision fn is degree-p polynomial E.g., a cubic in R2:
Φ(x) = [x13 x12x2 x1x2 x23 x12 x1x2 x2 x1 x2]⊤ d O(dp)
Φ(x):R →R
[Now we’re really blowing up the number of features! If you have, say, 100 features per sample point and you want to use degree-4 decision functions, then each lifted feature vector has a length of roughly 4 million, and your learning algorithm will take approximately forever to run.]
[However, later in the semester we will learn an extremely clever trick that allows us to work with these huge feature vectors very quickly, without ever computing them. It’s called “kernelization” or “the kernel trick.” So even though it appears now that working with degree-4 polynomials is computationally infeasible, it can actually be done quickly.]
[Hard-margin SVMs with degree 1/2/5 decision functions. Observe that the margin tends to get wider as the degree increases.]
Figur[eIn6cr:eaTsihngetehe↵deecgtreoeflitkheethdiseagcrceoempolfisaheps otwlyonthoimngsia. l kernelà The polynomial kernel of degree leads t–oFairslti,ntheeardastaepmaigrhattbioecnomAe liàneHarilgyhseprardaebglerweehepnoyloyunloifmt thiaeml ktoeranheilgshaenllouwghademgroeer,eevßeenxiifble
degree5.pdf
the original data are not linearly separable.
decision boundary B-C à The style follows that of Figure 5à
– Second, raising the degree can widen the margin, so you might get a more robust decision boundary that generalizes better to test data.
However, if you raise the degree too high, you will overfit the data and then generalization will get worse.]
features. The dimensionality of the feature-space associated with the above example is quadratic in the number of dimensions of the input space. If we were to use monomials of degree d rather than degree 2 monomials as above, the dimensionality would be exponential in d, resulting in a substantial increase in memory usage and the time required to compute the discriminant function. If our data are high-dimensional to begin with, such as in the case of gene expression data, this is not acceptable. Kernel methods avoid this complexity by avoiding the step of explicitly mapping the data to a high dimensional feature-space.
We have seen above (Equation (5)) that the weight vector of a large margin separating hyperplane can be expressed as a linear combination of the training points, i.e. w = ni= yi↵ixi. The same holds true for a large
P
class of linear algorithms, as shown by the representer theorem (see [2]).
atures and overÞtting
Soft-Margin Support Vector Machines; Features 23
overfit.pdf
[Training vs. test error for degree 1/2/5 decision functions. (Artist’s conception; these aren’t actual calculations, just hand-drawn guesses. Please send me email if you know thappenstothispictwuhrereatsosfianmdpfilgeurseiszelikgertohwissw?ithactualdata.)Inthisexample,adegree-2decisiongives
the smallest test error.]
[You should search for the ideal degree—not too small, not too big. It’s a balancing act between underfitting
and overfitting. The degree is an example of a hyperparameter that can be optimized by validation.]
[If you’re using both polynomial features and a soft-margin SVM, now you have two hyperparameters: the degree and the regularization hyperparameter C. Generally, the optimal C will be different for every polynomial degree, so when you change the degree, you have to run validation again to find the best C for that degree.]
[So far I’ve talked only about polynomial features. But features can get much more complicated than polynomials, and they can be tailored to fit a specific problem. Let’s consider a type of feature you might use if you wanted to implement, say, a handwriting recognition algorithm.]
24 Jonathan Richard Shewchuk Example 5: Edge detection
Edge detector: algorithm for approximating grayscale/color gradients in image, e.g., – tap filter
– Sobel filter
– oriented Gaussian derivative filter
[images are discrete, not continuous fields, so approximation of gradients is necessary.]
[See “Image Derivatives” on Wikipedia.]
Collect line orientations in local histograms (each having 12 orientation bins per region); use histograms as features (instead of raw pixels).
[Image histograms.]
Paper: Maji & Malik, 2009.
[If you want to, optionally, use these features in future homeworks and try to win the Kaggle competition, this paper is a good online resource.]
[When they use a linear SVM on the raw pixels, Maji & Malik get an error rate of 15.38% on the test set. When they use a linear SVM on the histogram features, the error rate goes down to 2.64%.]
[Many applications can be improved by designing application-specific features. There’s no limit but your own creativity and ability to discern the structure hidden in your application.]
orientgrad.png
Machine Learning Abstractions and Numerical Optimization 25 5 Machine Learning Abstractions and Numerical Optimization
ML ABSTRACTIONS [some meta comments on machine learning]
[When you write a large computer program, you break it down into subroutines and modules. Many of you know from experience that you need to have the discipline to impose strong abstraction barriers between different modules, or your program will become so complex you can no longer manage nor maintain it.]
[When you learn a new subject, it helps to have mental abstraction barriers, too, so you know when you can replace one approach with a different approach. I want to give you four levels of abstraction that can help you think about machine learning. It’s important to make mental distinctions between these four things, and the code you write should have modules that reflect these distinctions as well.]
APPLICATION/DATA
data labeled (classified) or not?
yes: labels categorical (classification) or quantitative (regression)? no: similarity (clustering) or positioning (dimensionality reduction)?
MODEL [what kinds of hypotheses are permitted?]
e.g.:
– decision fns: linear, polynomial, logistic, neural net, . . .
– nearest neighbors, decision trees
– features
– low vs. high capacity (affects overfitting, underfitting, inference)
OPTIMIZATION PROBLEM
– variables, objective fn, constraints
e.g., unconstrained, convex program, least squares, PCA
OPTIMIZATION ALGORITHM e.g., gradient descent, simplex, SVD
[In this course, we focus primarily on the middle two levels. As a data scientist, you might be given an application, and your challenge is to turn it into an optimization problem that we know how to solve. We will talk a bit about optimization algorithms, but usually you’ll use an optimization code that’s faster and more robust than what you would write yourself.]
[The second level, the model, has a huge effect on the success of your learning algorithm. Sometimes you get a big improvement by tailoring the model or its features to fit the structure of your specific data. The model also has a big effect on whether you overfit or underfit. And if you want a model that you can interpret so you can do inference, the model has to have a simple structure. Lastly, you have to pick a model that leads to an optimization problem that can be solved. Some optimization problems are just too hard.]
[It’s important to understand that when you change something in one level of this diagram, you probably have to change all the levels underneath it. If you switch your model from a linear classifier to a neural net, your optimization problem changes, and your optimization algorithm changes too.]
26 Jonathan Richard Shewchuk
[Not all machine learning methods fit this four-level decomposition. Nevertheless, for everything you learn in this class, think about where it fits in this hierarchy. If you don’t distinguish which math is part of the model and which math is part of the optimization algorithm, this course will be very confusing for you.]
OPTIMIZATION PROBLEMS
[I want to familiarize you with some types of optimization problems that can be solved reliably and effi- ciently, and the names of some of the optimization algorithms used to solve them. An important skill for you to develop is to be able to go from an application to a well-defined optimization problem. That skill depends on your ability to recognize well-studied types of optimization problems.]
Unconstrained
Goal: Find w that minimizes (or maximizes) a continuous objective fn f (w). f is smooth if its gradient is continuous too.
A global minimum of f is a value w such that f(w) ≤ f(v) for every v. Alocalminimum ” ” ”” ” ” ” ” ” ”
for every v in a tiny ball centered at w.
[In other words, you cannot walk downhill from w.]
global minimumlocal minima
Usually, finding a local minimum is easy;
finding the global minimum is hard. [or impossible]
Exception: A function is convex if for every x, y ∈ Rd ,
the line segment connecting (x, f (x)) to (y, f (y)) does not go below f (·).
xy
E.g., perceptron risk fn is convex and nonsmooth.
[Draw this by hand. ]
minima.pdf
[Draw this by hand. ]
convex.pdf
Machine Learning Abstractions and Numerical Optimization 27 [When you sum together convex functions, you always get a convex function. The perceptron risk function
is a sum of convex loss functions.]
A [continuous] convex function [on a closed, convex domain] has either
– no minimum (goes to −∞), or
– just one local minimum, or
– a connected set of local minima that are all global minima with equal f .
[The perceptron risk function has the last of these three.]
[In the last two cases, if you walk downhill, you eventually reach a global minimum.]
[However, there are many applications where you don’t have a convex objective function, and your machine learning algorithm has to settle for finding a local minimum. For example, neural nets try to optimize an objective function that has lots of local minima; they rarely find a global minimum.]
Algs for smooth f :
– Gradient descent:
– blind [with learning rate] – with line search:
– secant method
– Newton–Raphson (may need Hessian matrix of f )
repeat: w ← w − ε ∇ f (w)
– stochastic (blind)
– Newton’s method (needs Hessian matrix) – Nonlinear conjugate gradient
Algs for nonsmooth f : – Gradient descent:
– blind
[trains on one point per iteration, or a small batch] [uses the secant or Newton–Raphson line search methods]
– with direct line search (e.g., golden section search)
– BFGS [Broyden–Fletcher–Goldfarb–Shanno]
These algs find a local minimum. [They don’t reliably find a global minimum, because that’s very hard.] [If you’re optimizing over a d-dimensional space, the Hessian matrix is a d × d matrix and it’s usually dense,
so most methods that use the Hessian are computationally infeasible when d is large.]
line search: finds a local minimum along the search direction by solving an optimization problem in 1D.
[. . . instead of using a blind step size like the perceptron algorithm does. Solving a 1D problem is much easier than solving a higher-dimensional one.]
[Neural nets are unconstrained optimization problems with many, many local minima. They sometimes benefit from line searches or second-order optimization algorithms, but when the input data set is very large, researchers often favor the dumb, blind, stochastic versions of gradient descent.]
Constrained Optimization (smooth equality constraints)
Goal: Find w that minimizes (or maximizes) f (w) subject to g(w) = 0
where g is a smooth fn Alg: Use Lagrange multipliers.
[← observe that this is an isosurface] [g may be a vector, encoding multiple constraints] [to transform constrained to unconstrained optimization]
28 Jonathan Richard Shewchuk
Linear Program
Linear objective fn + linear inequality constraints.
Goal:
Find w that maximizes (or minimizes) c · w subject to Aw ≤ b
where A is n × d matrix, b ∈ Rn, expressing n linear constraints:
Ai · w ≤ bi,
i ∈ [1, n]
active constraint
c in w-space:
optimum
feasible region
active constraint
[Draw this by hand. ]
linprog.pdf
The set of points w that satisfy all constraints is a convex polytope called the feasible region F [shaded]. The optimum is the point in F that is furthest in the direction c. [What does convex mean?]
A point set P is convex if for every p, q ∈ P, the line segment with endpoints p, q lies entirely in P.
[What is a polytope? Just a polyhedron, generalized to higher dimensions.]
The optimum achieves equality for some constraints (but not most), called the active constraints of the optimum. [In the figure above, there are two active constraints. In an SVM, active constraints correspond to the sample points that touch or violate the slab, and they’re also known as support vectors.]
[Sometimes, there is more than one optimal point. For example, in the figure above, if c pointed straight up, every point on the top horizontal edge would be optimal. The set of optimal points is always convex.]
Example: EVERY feasible point (w, α) gives a linear classifier:
IMPORTANT: The data are linearly separable iff the feasible region is not the empty set. → Also true for maximum margin classifier (quadratic program)
Algs for linear programming:
– Simplex (George Dantzig, 1947)
[Indisputably one of the most important and useful algorithms of the 20th century.]
[Walks along edges of polytope from vertex to vertex until it finds optimum.] – Interior point methods
[Linear programming is very different from unconstrained optimization; it has a much more combinatorial flavor. If you knew which constraints would be the active constraints once you found the solution, it would be easy; the hard part is figuring out which constraints should be the active ones. There are exponentially many possibilities, so you can’t afford to try them all. So linear programming algorithms tend to have a very discrete, computer science feeling to them, like graph algorithms, whereas unconstrained optimization algorithms tend to have a continuous, numerical mathematics feeling.]
Find w, α that maximizes 0 subjecttoyi(w·Xi +α)≥1 foralli∈[1,n]
Machine Learning Abstractions and Numerical Optimization 29 [Linear programs crop up everywhere in engineering and science, but they’re usually in disguise. An ex-
tremely useful talent you should develop is to recognize when a problem is a linear program.]
[A linear program solver can find a linear classifier, but it can’t find the maximum margin classifier. We need something more powerful.]
Quadratic Program
Quadratic, convex objective fn + linear inequality constraints. Goal: Find w that minimizes f (w) = w⊤ Qw + c⊤ w
subject to Aw ≤ b
where Q is a symmetric, positive definite matrix.
A matrix is positive definite if w⊤Qw > 0 for all w 0.
Only one local minimum! [Which is therefore the global minimum.]
[What if Q is not positive definite? If Q is indefinite, then f is not convex, the minimum is not always unique, and quadratic programming is NP-hard. If Q is positive semidefinite, meaning w⊤Qw ≥ 0 for all w, then f is convex and quadratic programming is tractable, but there may be infinitely many solutions.]
Example: Find maximum margin classifier.
3
2
1
0
-1 10
1615
10
13 11
11 13 15 17 8 16
-2
-3
12 14 16
2
7
49
11
13
17
14 12
14 3 12
5
17 15 13 11
10
15 12 14 16 17
-3 -2 -1 0 1 2 3
quadratic.pdf [Draw two polygons on these isocontours—one with one active constraint, and one with two—and show the constrained minimum for each polygon. “In an SVM, we are looking for the point in this polygon that’s closest to the origin.”]
Algs for quadratic programming:
– Simplex-like [commonly used for general-purpose quadratic programs, but not as good for SVMs as
the following two algorithms that specifically exploit properties of SVMs] – Sequential minimal optimization (SMO, used in LIBSVM)
– Coordinate descent (used in LIBLINEAR)
[One clever idea SMO uses is that it does a line search that uses the Hessian, but it’s cheap to compute because SMO doesn’t walk in the direction of steepest descent; instead it walks along just two coordinate axes at a time.]
Numerical optimization @ Berkeley: EECS 127/227AT/227BT/227C.
1
6
10
30 Jonathan Richard Shewchuk
6 Decision Theory; Generative and Discriminative Models
DECISION THEORY aka Risk Minimization
[Today I’m going to talk about a style of classifier very different from SVMs. The classifiers we’ll cover in the next few weeks are based on probability.]
[One aspect of probabilistic data is that sometimes a point in feature space doesn’t have just one class. Suppose one borrower with income $30,000 and debt $15,000 defaults.
another ” ” ” ” ” ” ” doesn’t default.
So in your feature space, you have two feature vectors at the same point with different classes. Obviously, in that case, you can’t draw a decision boundary that classifies all points with 100% accuracy.]
Multiple sample points with different classes could lie at same point: we want a probabilistic classifier.
Suppose 10% of population has cancer, 90% doesn’t. Probability distributions for calorie intake, P(X|Y):
(X) < 1,200 1,200–1,600
(Y = 1) 20% 50% 30%
89%
[I made these numbers up. Please don’t take them as medical advice.]
P(1,200 ≤ X ≤ 1,600) = 0.5 × 0.1 + 0.1 × 0.9 = 0.14
You meet guy eating x = 1,400 calories/day. Guess whether he has cancer?
[If you’re in a hurry, you might see that 50% of people with cancer eat 1,400 calories, but only 10% of people with no cancer do, and conclude that someone who eats 1,400 calories probably has cancer. But that would be wrong, because that reasoning fails to take the prior probabilities into account.]
Bayes’ Theorem:
↓ posterior probability ↓ prior prob. ↓ for 1,200 ≤ X ≤ 1,600 P(Y =1|X) = P(X|Y =1)P(Y =1) = 0.05
P(X) 0.14
P(Y = −1|X) = P(X|Y = −1)P(Y = −1) = 0.09 [These two probs always sum to 1.]
calories cancer no cancer
> 1,600
[caps here mean random variables, not matrices.]
(Y = −1) 1% 10%
Recall: P(X) = P(X|Y = 1) P(Y = 1) + P(X|Y = −1) P(Y = −1)
P(X) 0.14
P(cancer | 1,200 ≤ X ≤ 1,600 cals) = 5/14 ≈ 36%.
[So we probably shouldn’t diagnose cancer.]
[BUT . . . we’re assuming that we want to maximize the chance of a correct prediction. But that’s not always the right assumption. If you’re developing a cheap screening test for cancer, you’d rather have more false positives and fewer false negatives. A false negative might mean somebody misses an early diagnosis and dies of a cancer that could have been treated if caught early. A false positive just means that you spend more money on more accurate tests.]
Decision Theory; Generative and Discriminative Models 31 A loss function L(z, y) specifies badness if classifier predicts z, true class is y.
1 ifz=1,y=−1 falsepositiveisbad
E.g., L(z,y) = 5 if z = −1,y = 1 false negative is BAAAAAD 0 ifz=y.
A 36% probability of loss 5 is worse than a 64% prob. of loss 1, so we recommend further cancer screening.
Defs: loss fn above is asymmetrical.
The 0-1 loss function is 1 for incorrect predictions, [symmetrical]
0 for correct.
[Another example where you want a very asymmetrical loss function is for spam detection. Putting a good email in the spam folder is much worse than putting spam in your inbox.]
Let r : Rd → ±1 be a decision rule, aka classifier:
a fn that maps a feature vector x to 1 (“in class”) or −1 (“not in class”).
R(r) = =
= P(Y=1)L(r(x),1)P(X=x|Y=1)+P(Y=−1)L(r(x),−1)P(X=x|Y=−1) xx
The Bayes decision rule aka Bayes classifier is the fn r∗ that minimizes functional R(r). Assuming L(z,y) = 0 for z = y:
1 if L(−1,1)P(Y = 1|X = x) > L(1,−1)P(Y = −1|X = x), −1 otherwise
When L is symmetric, [the big, key principle you should memorize is]
pick the class with the biggest posterior probability.
[But if the loss function is asymmetric, then you must weight the posteriors with the losses.]
In cancer example, r∗(x) = 1 for x ≤ 1,600; r∗(x) = −1 for x > 1,600. The Bayes risk, aka optimal risk, is the risk of the Bayes classifier.
[In our cancer example, the last expression for risk R gives:] R(r∗)=0.1(5×0.3)+0.9(1×0.01+1×0.1)=0.249 Nodecisionrulegivesalowerrisk.
[It is interesting that, if we really know all these probabilities, we really can construct an ideal probabilistic classifier. But in real applications, we rarely know these probabilities; the best we can do is use statistical methods to estimate them.]
Deriving/using r∗ is called risk minimization.
[Did you memorize the two boldfaced lines above yet?]
The risk for r is the expected loss over all values of x, y:
[Memorize this definition!] L(r(x), 1) P(Y = 1|X = x) + L(r(x), −1) P(Y = −1|X = x) P(X = x)
E[L(r(X), Y)]
x
r∗(x) =
32 Jonathan Richard Shewchuk Continuous Distributions
Suppose X has a continuous probability density fn (PDF).
Review: [Go back to your CS 70 or stats notes if you don’t remember this.]
f(x)
x1 x2 x area under whole curve = 1 =
−∞
[Draw this by hand.
x2
] [shaded area]
xf(x)dx
prob. that random variable X ∈ [x1, x2] =
∞ ∞
f (x) dx
g(x) f (x) dx
f (x) dx expected value of g(X) : E[g(X)] =
−∞
varianceσ2 =E[(X−μ)2]=E[X2]−μ2
x1
integrate.pdf
∞ −∞
meanμ=E[X]=
[Perhaps our cancer statistics look like this:]
Draw this figure by hand (cancerconditional.png)
[Let’s go back to the 0-1 loss function for a moment. In other words, suppose you want a classifier that maximizes the chance of a correct prediction. The wrong answer would be to look where these two curves cross and make that be the decision boundary. As before, it’s wrong because it doesn’t take into account the prior probabilities.]
Suppose P(Y = 1) = 1/3, P(Y = −1) = 2/3, 0-1 loss:
[To maximize the chance you’ll predict correctly whether somebody has cancer, the Bayes decision rule looks up x on this chart and picks the curve with the highest probability. In this example, that means you pick cancer when x is left of the optimal decision boundary, and no cancer when x is to the right.]
[The area under each curve is 1.]
Draw this figure by hand (cancerposterior.png)
Decision Theory; Generative and Discriminative Models 33 Define risk as before, replacing summations with integrals.
R(r) = E[L(r(X), Y)]
= P(Y=1) L(r(x),1)f(X=x|Y=1)dx+
P(Y = −1)
For Bayes decision rule, Bayes risk is the area under minimum of functions above. [Shade it.]
Assuming L(z,y) = 0 for z = y:
R(r∗) =
[If you want to use an asymmetrical loss function, just scale the curves vertically in the figure above.]
If L is 0-1 loss, [then the risk has a particularly nice interpretation:] R(r) = P(r(x) is wrong) [which makes sense, because R is the expected loss.]
L(r(x),−1) f(X = x|Y = −1)dx
min L(−y, y) f (X = x|Y = y) P(Y = y) dx y=±1
0.5 }
isovalue
-3
-3 -2 -1 0 1 2 3
qda3d.pdf, qdacontour.pdf [Two different views of the same 2D Gaussians. Note the Bayes optimal decision boundary, which is white at right.]
[Notice that the accuracy of the probabilities is most important near the decision boundary. Far away from the decision boundary, a bit of error in the probabilities probably wouldn’t change the classification.]
[You can also have multi-class classifiers, choosing among three or more classes. The Bayesian approach is a particularly convenient way to generate multi-class classifiers, because you can simply choose whichever class has the greatest posterior probability. Then the decision boundary lies wherever two or more classes are tied for the highest probability.]
and the Bayes optimal decision boundary is {x : P(Y = 1|X = x) =
decision fn
3
2
1
0
-1
-2
34 Jonathan Richard Shewchuk 3 WAYS TO BUILD CLASSIFIERS
(1) Generative models (e.g., LDA) [We’ll learn about LDA next lecture.]
– Assume sample points come from probability distributions, different for each class.
– Guess form of distributions
– For each class C, fit distribution parameters to class C points, giving P(X|Y = C)
– For each C, estimate P(Y = C)
– Bayes’ Theorem gives P(Y|X)
– If 0-1 loss, pick class C that maximizes P(Y = C|X = x) [posterior probability]
equivalently, maximizes P(X = x|Y = C) P(Y = C)
(2) Discriminative models (e.g., logistic regression) [We’ll learn about logistic regression in a few weeks.]
– Model P(Y|X) directly
(3) Find decision boundary (e.g., SVM)
– Model r(x) directly (no posterior)
Advantage of (1 & 2): P(Y|X) tells you probability your guess is wrong [This is something SVMs don’t do.]
Advantage of (1): you can diagnose outliers: P(X) is very small Disadvantages of (1): often hard to estimate distributions accurately;
real distributions rarely match standard ones.
[What I’ve written here doesn’t actually define the phrases “generative model” or “discriminative model.” The proper definitions accord with the way statisticians think about models. A generative model is a full probabilistic model of all variables, whereas a discriminative model provides a model only for the target variables that we want to predict.]
[It’s important to remember that we rarely know precisely the value of any of these probabilities. There is usually error in all of these probabilities. In practice, generative models are most popular when you have phenomena that are well approximated by the normal distribution, and you have a lot of sample points, so you can approximate the shape of the distribution well.]
Gaussian Discriminant Analysis, including QDA and LDA 35 7 Gaussian Discriminant Analysis, including QDA and LDA
GAUSSIAN DISCRIMINANT ANALYSIS
Fundamental assumption: each class comes from normal distribution (Gaussian). 2 1 ∥x−μ∥2
X∼N(μ,σ ): f(x)= (√2πσ)d exp − 2σ2 [μ&x=vectors;σ=scalar;d=dimension]
For each class C, suppose we estimate mean μC, variance σ2C, and prior πC = P(Y = C). Given x, Bayes decision rule r∗(x) returns class C that maximizes f (X = x|Y = C) πC. ln ω is monotonically increasing for ω > 0, so it is equivalent to maximize
√ ∥x−μC∥2
QC(x)=ln ( 2π)d fC(x)πC =− 2σ2 −dlnσC +lnπC
C
↑ quadratic in x. ↑ normal PDF, estimates f (X = x|Y = C)
[In a 2-class problem, you can also incorporate an asymmetrical loss function the same way we incorporate the prior πC. In a multi-class problem, it gets more difficult, because the penalty for guessing wrong might depend not just on the wrong guess, but also on the true class.]
Quadratic Discriminant Analysis (QDA)
Suppose only 2 classes C, D. Then
– In 1D, B.d.b. may have 1 or 2 points. – In d-D, B.d.b. is a quadric.
C ifQ (x)−Q (x)>0, r∗(x) = C D
D otherwise.
Decision fn is quadratic in x. Bayes decision boundary is QC(x) − QD(x) = 0.
[Pick the class with the biggest posterior probability]
3
2
1
0
-1
-2
-3
-3 -2 -1 0 1 2 3
qda3d.pdf,qdacontour.pdf [ThesameexampleIshowedduringthepreviouslecture.]
[Solutions to a quadratic equation] [In 2D, that’s a conic section]
36 Jonathan Richard Shewchuk
[You’re probably familiar with the Gaussian distribution where x and μ are scalars, but as I’ve written it, it applies equally well to a multi-dimensional feature space with isotropic Gaussians. Then x and μ are vectors, but the variance σ is still a scalar. Next lecture we’ll look at anisotropic Gaussian distributions where the variance is different along different directions.]
[QDA works very naturally with more than 2 classes.]
[The feature space gets partitioned into regions. In two or more dimen- sions, you typically wind up with multiple decision boundaries that adjoin each other at joints. It looks like a sort of Voronoi diagram. In fact, it’s a special kind of Voronoi diagram called a multiplicatively, additively weighted Voronoi diagram.]
[You might not be satisfied with just knowing how each point is classified. One of the great things about QDA is that you can also determine the probability that your classification is correct. Let’s work that out.]
multiplicative.pdf
To recover posterior probabilities in 2-class case, use Bayes: P(Y = C|X) = f(X|Y = C)πC
f(X|Y = C)πC + f(X|Y = D)πD
P(Y = C|X = x)
= eQC(x) + eQD(x) = 1 + eQD(x)−QC(x) = s(QC(x) − QD(x)), where
⇐ logistic fn aka sigmoid fn s(x)
1.0 0.8 0.6 0.4 0.2
recall eQC(x) = ( √2π)d fC(x) πC eQC(x) 1
[by definition of QC]
s(γ) =
1 1+e−γ
[recall QC − QD is the decision fn]
[The logistic function. Write
logistic.pdf
beside it:] s(0) = 1, s(∞) → 1, s(−∞) → 0, 2
monotonically increasing.
[We interpret s(0) = 1 as saying that on 2
the decision boundary, there’s a 50% chance of class C and a 50% chance of class D.]
-4 -2 0 2 4
x
Gaussian Discriminant Analysis, including QDA and LDA 37 Linear Discriminant Analysis (LDA)
[LDA is a variant of QDA with linear decision boundaries. It’s less likely to overfit than QDA.]
Fundamental assumption: all the Gaussians have same variance σ. [The equations simplify nicely in this case.]
(μC − μD) · x ∥μC∥2 − ∥μD∥2 QC(x)−QD(x)= σ2 − 2σ2 +lnπC −lnπD
w·x +α
[The quadratic terms in QC and QD canceled each other out!]
Now it’s a linear classifier! Choose C that maximizes linear discriminant fn
μC·x ∥μC∥2
σ2 − 2σ2 + ln πC [this works for any number of classes]
In 2-class case: decision boundary is w · x + α = 0 posterior is P(Y = C|X = x) = s(w · x + α)
[The effect of “w · x + α” is to scale and translate the logistic fn in x-space. It’s a linear transformation.] P(x)
1.0 0.8 0.6 0.4 0.2
-3 -2 -1
1 2 3
x
[Two Gaussians (red) and the logistic function (black). The logistic function is the right Gaussian divided by the sum of the Gaussians. Observe that even when the Gaussians are 2D, the logistic function still looks 1D.]
lda1d.pdf
[When you have many classes, their LDA decision boundaries form a classical Voronoi diagram if the priors πC are equal. All the Gaussians have the same width.]
1 μC +μD IfπC =πD = 2 ⇒ (μC −μD)·x−(μC −μD)· 2 =0
This is the centroid method!
voronoi.pdf
38 Jonathan Richard Shewchuk
MAXIMUM LIKELIHOOD ESTIMATION OF PARAMETERS (Ronald Fisher, circa 1912)
[To use Gaussian discriminant analysis, we must first fit Gaussians to the sample points and estimate the class prior probabilities. We’ll do priors first—they’re easier, because they involve a discrete distribution. Then we’ll fit the Gaussians—they’re less intuitive, because they’re continuous distributions.]
Let’s flip biased coins! Heads with probability p; tails w/prob. 1 − p.
10 flips, 8 heads, 2 tails. [Let me ask you a weird question.] What is the most likely value of p?
Binomial distribution: X ∼ B(n, p)
Probability of 8 heads in 10 flips:
written as a fn L(p) of distribution parameter(s), this is the likelihood fn.
Maximum likelihood estimation (MLE): A method of estimating the parameters of a statistical model by picking the params that maximize [the likelihood function] L.
. . . is one method of density estimation: estimating a PDF [probability density function] from data.
[Let’s phrase it as an optimization problem.]
P[X = x] = Our example: n = 10,
n x
[this is the probability of getting exactly x heads in n coin flips] P[X=8]=45p (1−p) = L(p)
px (1 − p)n−x
8 2 def
Find p that maximizes L(p).
0.30 0.25 0.20 0.15 0.10 0.05
0.2 0.4 0.6 0.8 1.0
[Graph of L(p) for this example.]
Solve this example by setting derivative = 0: dL = 360p7(1 − p)2 − 90p8(1 − p) = 0
binomlikelihood.pdf
dp
⇒ 4(1−p)−p=0 ⇒ p=0.8
[It shouldn’t seem surprising that a coin that comes up heads 80% of the time is the coin most likely to produce 8 heads in 10 flips.]
[Note: d2L −18.9 < 0 at p = 0.8, confirming it’s a maximum.] dp2
[Here’s how this applies to prior probabilities. Suppose our data set is 10 sample points, and 8 of them are of class C and 2 are not. Then our estimated prior for class C will be πC = 0.8.]
Gaussian Discriminant Analysis, including QDA and LDA 39 Likelihood of a Gaussian
Given sample points X1, X2, . . . , Xn, find best-fit Gaussian.
[Now we want a normal distribution instead of a binomial distribution. If you generate a random point from
a normal distribution, what is the probability that it will be exactly at the mean of the Gaussian?]
[Zero. So it might seem like we have a problem here. With a continuous distribution, the probability of generating any particular point is zero. But we’re just going to ignore that and do “likelihood” anyway.]
Likelihood of generating these points is L(μ,σ;X1,...,Xn) = f(X1) f(X2)··· f(Xn).
The log likelihood l(·) is the ln of the likelihood L(·).
[How do we maximize this?]
Maximizing likelihood l(μ,σ;X1,...,Xn)
⇔ maximizing log likelihood.
= ln f(X1)+ln f(X2)+...+ln f(Xn)
n ∥ X i − μ ∥ 2 √ = − 2σ2 −dln 2π−dlnσ
i=1 ln of normal PDF
Wanttoset∇μl=0, ∂l =0 ∂σ
n X i − μ
=0 ⇒
1 n
[Thehatsˆmean“estimated”]
∇μl= ∂ln∥Xi−μ∥2−dσ2
μˆ = n
=0 ⇒ σˆ=dn ∥Xi−μ∥
σ2 ∂σ= σ3
i=1
Xi
2 1n 2
i=1
i=1
i=1
We don’t know μ exactly, so substitute μˆ for μ to compute σˆ .
I.e., we use mean & variance of points in class C to estimate mean & variance of Gaussian for class C.
ForQDA: estimateconditionalmeanμˆC &conditionalvarianceσˆ2C ofeachclassCseparately[asabove] & estimate the priors:
nC
πˆC = D nD ⇐ total sample points in all classes
For LDA: same means & priors; one variance for all classes: σˆ 2 = 1 ∥Xi − μˆC∥2
dn C {i:yi=C}
[πˆC is the coin flip parameter]
⇐ pooled within-class variance
[Notice that although LDA is computing one variance for all the data, each sample point contributes with respect to its own class’s mean. This gives a very different result than if you simply use the global mean! It’s usually smaller than the global variance. We say “within-class” because we use each point’s distance from its class’s mean, but “pooled” because we then pool all the classes together.]
40 Jonathan Richard Shewchuk
8 Eigenvectors and the Anisotropic Multivariate Normal Distribution
EIGENVECTORS
[I don’t know if you were properly taught about eigenvectors here at Berkeley, but I sure don’t like the way they’re taught in most linear algebra books. So I’ll start with a review. You all know the definition of an eigenvector:]
Given square matrix A, if Av = λv for some vector v 0, scalar λ, then v is an eigenvector of A and λ is the eigenvalue of A associated w/v.
[But what does that mean? It means that v is a magical vector that, after being multiplied by A, still points in the same direction, or in exactly the opposite direction.]
[For most matrices, most vectors don’t have this property. So the ones that do are special, and we call them eigenvectors.]
[Clearly, when you scale an eigenvector, it’s still an eigenvector. Only the direction matters, not the length. Let’s look at a few consequences.]
Theorem: if v is eigenvector of A w/eigenvalue λ, then v is eigenvector of Ak w/eigenvalue λk
Proof: A2v = A(λv) = λ2v, etc. Theorem: moreover, if A is invertible,
[k is a +ve integer; we will use Theorem later]
Draw this figure by hand (eigenvectors.png)
then v is eigenvector of A−1 w/eigenvalue 1/λ
Proof:A−1v=A−1(1Av)= 1v [lookatthefiguresabove,butgofromrighttoleft.]
[Stated simply: When you invert a matrix, the eigenvectors don’t change, but the eigenvalues get inverted. When you square a matrix, the eigenvectors don’t change, but the eigenvalues get squared.]
[Those theorems are pretty obvious. The next theorem is not obvious at all.]
λλ
2
6
5
7 2 18.16.14. 12. 8. 6.
-1 5
5
-1 3.
-2
77 6 4 6
-2 -1 0 1 2
-2
10.
6. 8. 12.14.16.18.
1
z-space
x-space
7. 13. 15. 17. 19.
Eigenvectors and the Anisotropic Multivariate Normal Distribution 41
Spectral Theorem: every real, symmetric n × n matrix has real eigenvalues and
n eigenvectors that are mutually orthogonal, i.e., v⊤i v j = 0 for all i j
[This takes about a page of math to prove.
One minor detail is that a matrix can have more than n eigenvector directions. If two eigenvectors happen to have the same eigenvalue, then every linear combination of those eigenvectors is also an eigenvector. Then you have infinitely many eigenvector directions, but they all span the same plane. So you just arbitrarily pick two vectors in that plane that are orthogonal to each other. By contrast, the set of eigenvalues is always uniquely determined by a matrix, including the multiplicity of the eigenvalues.]
We can use them as a basis for Rn. Quadratic Forms
[My favorite way to visualize a symmetric matrix is to graph something called the quadratic form, which shows how applying the matrix affects the length of a vector. The following example uses the same two eigenvectors and eigenvalues as above.]
∥z∥2 ∥A−1 x∥2
∥z∥2
⇐ quadratic; isotropic; isosurfaces are spheres
⇐ quadratic form of the matrix A−2 (A symmetric)
anisotropic; isosurfaces are ellipsoids
∥A−1 x∥2
3/4 5/4
A= 5/4 3/4 7 2.
15 2
= z⊤z
= x⊤ A−2 x
6 3
10.
119. 4. 3. 17.
15. 13. 11.
circlebowl.pdf, ellipsebowl.pdf, circles.pdf, ellipses.pdf
x = Az
−→
7.
0 09.5. 5.
2.
9. 11.
[Both figures at left are plots of ∥z∥2, and both figures at right are plots of ∥A−1x∥2.
(Draw the stretch direction (1, 1) with eigenvalue 2 and the shrink direction (1, −1) with
eigenvalue − 1 on the ellipses at bottom right.)] 2
1.
4.
-2 -1 0 1 2
42 Jonathan Richard Shewchuk
[The matrix A maps the circles on the left to the ellipses on the right. They’re stretching along the direction
with eigenvalue 2, and shrinking along the direction with eigenvalue −1/2. Let’s prove that.] ∥A−1x∥2 = 1 is an ellipsoid with axes v1,v2,...,vn and
radii λ1,λ2,...,λn
because if vi has length λi, ∥A−1vi∥2 = ∥ 1 vi∥2 = 1 ⇒ vi lies on the ellipsoid
λi
Special case: A is diagonal ⇔ eigenvectors are coordinate axes
⇔ ellipsoids are axis-aligned [Draw axis-aligned isocontours for a diagonal metric.]
A symmetric matrix M is positive definite positive semidefinite
indefinite invertible
if w⊤ Mw > 0 for all w 0. ⇔ all eigenvalues positive if w⊤ Mw ≥ 0 for all w. ⇔ all eigenvalues nonnegative
if +ve eigenvalue & −ve eigenvalue if no zero eigenvalue
pos semidefinite indefinite
pos definite
posdef.pdf, possemi.pdf, indef.pdf
[Examples of quadratic forms for positive definite, positive semidefinite, and indefinite ma- trices. Positive eigenvalues correspond to axes where the curvature goes up; negative eigen- values correspond to axes where the curvature goes down. (Draw the eigenvector directions, and draw the flat trough in the positive semidefinite bowl.)]
What does this tell us about x⊤ A−2 x?
[We’ve been visualizing the quadratic form of a matrix A−2. The eigenvalues of A−2 are the inverse squares of the eigenvalues of A, so A−2 cannot have a negative eigenvalue. Moreover, A−2 cannot have a zero eigenvalue, because A cannot have an infinite eigenvalue (and A−2 does not exist if A has a zero eigenvalue.) Therefore, A−2 is positive definite (if it exists).]
What about the isosurfaces of x⊤Mx for a +ve definite M?
[If M is positive definite, the contour plot of M’s quadratic form has ellipsoidal isosurfaces whose radii are determined by the eigenvalues of M−1/2, which are the inverse square roots of the eigenvalues of M. We will use these ideas to define Gaussian distributions, and for that, we’ll need a strictly positive definite matrix.]
[If M is only positive semidefinite, but not positive definite, the isosurfaces are cylinders instead of ellipsoids. These cylinders have ellipsoidal cross sections spanning the directions with nonzero eigenvalues, but they run in straight lines along the directions with zero eigenvalues.]
Eigenvectors and the Anisotropic Multivariate Normal Distribution 43 Building a Quadratic
[There are a lot of applications where you’re given a matrix, and you want to extract the eigenvectors and eigenvalues. But when you’re learning the math, I think it’s more intuitive to go in the opposite direction. Suppose you have an ellipsoid isosurface in mind. Suppose you pick the ellipsoid axes and the radius along each axis, and you want to create the matrix whose quadratic form will have isosurfaces matching the ellipsoid of your dreams.]
Choose n mutually orthogonal unit n-vectors v1, . . . , vn [so they specify an orthonormal coordinate system]
Let V = [v1 v2 . . . vn] ⇐ n × n matrix Observe: V⊤V = I
[off-diagonal 0’s because the vectors are orthogonal] [diagonal 1’s because they’re unit vectors]
⇒ V⊤=V−1 ⇒ VV⊤=I
V is orthonormal matrix: acts like rotation (or reflection)
Choose some radii λi:
λ1 0…0
0λ 0 2
Let Λ = … [diagonal matrix of eigenvalues]
… . ..
0 0 . . . λ n
Defn. of “eigenvector”: AV = VΛ
[This is the same definition of eigenvector I gave you at the start of the lecture, but this is how we express it in matrix form, so we can cover all the eigenvectors in one statement.]
⇒ AVV⊤ = VΛV⊤ [which leads us to …]
Theorem: A = VΛV⊤ = n λ v v⊤ has chosen eigenvectors/values i=1 i i i
outer product: n × n matrix, rank 1
This is a matrix factorization called the eigendecomposition.
Λ is the diagonalized version of A.
V⊤ rotates the ellipsoid to be axis-aligned.
Paraboloid with specified axes and radii: ∥A−1 x∥ = 1 or x⊤ A−2 x = 1.
[This completes our task of specifying a paraboloid whose isosurfaces are ellipsoids with specified axes and radii.]
Observe: A2 = VΛV⊤VΛV⊤ = VΛ2V⊤ A−2 = VΛ−2V⊤
[This is another way to see that squaring a matrix squares its eigenvalues without changing its eigenvectors. It also suggests a way to define a matrix square root.]
Given a symmetric PSD matrix Σ, we can find a symmetric square root A = Σ1/2:
compute eigenvectors/values of Σ
take square roots of Σ’s eigenvalues
reassemble matrix A [with the same eigenvectors as Σ but changed eigenvalues]
[The first step of this algorithm—computing the eigenvectors and eigenvalues of a matrix—is much harder than the remaining two steps.]
[every real, symmetric matrix has one]
44 Jonathan Richard Shewchuk
ANISOTROPIC GAUSSIANS
[Let’s revisit the multivariate Gaussian distribution, with different variances along different directions.] X ∼ N (μ, Σ) [X and μ are d-vectors. X is random variable with mean μ.]
11 f(x)=(√2π)d√|Σ|exp −2(x−μ)⊤Σ−1(x−μ)
↑ determinant of Σ Σ is the d × d SPD covariance matrix.
Σ−1 is the d × d SPD precision matrix.
Write f(x) = n(q(x)), where q(x) = (x − μ)⊤ Σ−1 (x − μ) ↑↑
R → R, exponential Rd → R, quadratic
[Now q(x) is a function we understand—it’s just a quadratic bowl centered at μ, the quadratic form of the precision matrix Σ−1. The other function n(·) is a simple, monotonic, convex function, an exponential of the negation of its argument. This mapping n(·) does not change the isosurfaces.]
Principle: given monotonic n : R → R, isosurfaces of n(q(x)) are same as q(x) (different isovalues).
2
1 19. 17. 15.
11. 0 9.
-1 3.
18.16.14. 12. 10.
4.
6.
8.
4.
8.
6.
2.
2.
n(x) 0.15
0.10
→→
2
1
0
0.036
0.36
0.324
0.144
-2
0.036
13.
5.
1.
3.
9. 11.
13. 15. 17. 19.
0.144
7.
0.216
0.072
-2 -1 0 1 2
-2 -1 0 1 2
10.
12. 14.16.18.
-1
-2
0.18
7.
0.252 0.108
5.
0.05
01234
n(·)
0.108
q(x)
ellipsebowl.pdf, ellipses.pdf, exp.pdf, gauss3d.pdf, gausscontour.pdf
f (x) = n(q(x))
[(Show this figure on a separate “whiteboard” for easy reuse next lecture.) A paraboloid (left) becomes a bivariate Gaussian (right) after you compose it with the univariate Gaussian (center).]
x
0.072
0.288
Eigenvectors and the Anisotropic Multivariate Normal Distribution 45
[One of the main ideas is that if you understand the isosurfaces of a quadratic function, then you understand the isosurfaces of a Gaussian, because they’re the same. The differences are in the isovalues—in particular, the Gaussian achieves its maximum at the mean, and decreases to zero as you move infinitely far away from the mean.]
q(x) is the squared distance from Σ−1/2x to Σ−1/2μ. Consider the metric
−1/2 −1/2 ⊤ −1 d(x,μ)=Σ x−Σ μ= (x−μ) Σ (x−μ)= q(x).
[So we think of the precision matrix as a “metric tensor” which defines a metric, a sort of warped distance from x to the mean μ.]
covariance:
Let R, S be random variables—column vectors or scalars Cov(R,S)=E[(R−E[R])(S −E[S])⊤]=E[RS⊤]−μRμ⊤S Var(R) = Cov(R, R)
If R is a vector, covariance matrix for R is
Var(R1) Cov(R1, R2) . . . Cov(R1, Rd)
Cov(R ,R ) Var(R ) Cov(R ,R ) 2122d
Var(R) = . .. [symmetric; each Ri is scalar]
. ..
. . .
Cov(Rd, R1) Cov(Rd, R2) . . . Var(Rd)
For a Gaussian R ∼ N (μ, Σ), one can show Var(R) = Σ.
[. . . by integrating the expectation in anisotropic spherical coordinates. It’s a painful integral.]
[An important point is that statisticians didn’t just arbitrarily decide to call Σ a covariance matrix. Rather, statisticians discovered that if you find the covariance of the normal distribution by integration, it turns out that the covariance is Σ. This is a happy fact; it’s rather elegant.]
Ri , R j independent ⇒ Cov(Ri , R j ) = 0 [the reverse implication is not generally true, but . . . ] Cov(Ri,Rj) = 0 AND multivariate normal dist. ⇒ Ri, Rj independent
all features pairwise independent ⇒ Var(R) is diagonal [the reverse is not generally true, but . . . ] Var(R) is diagonal AND joint normal
⇔ axis-aligned Gaussian; squared radii on diagonal of Σ = Var(R) ⇔ f(x) = f(x1) f(x2)··· f(xd)
multivariate univariate Gaussians
[So when the features are independent, you can write the multivariate Gaussian PDF as a product of uni- variate Gaussian PDFs. When they aren’t, you can do a change of coordinates to the eigenvector coordinate system, and write it as a product of univariate Gaussian PDFs in eigenvector coordinates.]
46 Jonathan Richard Shewchuk
9 Anisotropic Gaussians, Maximum Likelihood Estimation, QDA, and LDA
ANISOTROPIC GAUSSIANS
[Recall from our last lecture the probability density function of the multivariate normal distribution in its full generality.]
11 NormalPDF:f(x)=(√2π)d√|Σ|exp −2(x−μ)⊤Σ−1(x−μ)
↑ determinant of Σ
Write f(x) = n(q(x)), where q(x) = (x − μ)⊤ Σ−1 (x − μ) ↑↑
[x,μared-vectors]
R → R, exponential Rd → R, quadratic
[The covariance matrix Σ and its symmetric square root and its inverse all play roles in our intuition about
the multivariate normal distribution.] Σ = VΓV⊤ covariance matrix
↑ eigenvalues of Σ are variances along the eigenvectors, Γii = σ2i
Σ1/2 = VΓ1/2V⊤ maps spheres to ellipsoids (Σ1/2 was A in last lecture) √
↑ eigenvalues of Σ1/2 are Gaussian widths / ellipsoid radii / standard deviations, 2 6 5 7 2 18.16.14. 12. 8. 6.
7 2.
Γii = σi
(x−μ)
15 2
0
-1 5
119. 17. 15.
11. 0 9.
-1 3.
4. 3.
1
7.
77 -2 6 4 6
-2 -1 0 1 2
-2
10.
6. 8. 12.14.16.18.
Σ−1 = VΓ−1V⊤ precision matrix (metric tensor)
[↑ quadratic form of Σ−1 defines contours]
[Recall from last lecture
that the isocontours of the
multivariate normal
2
1 19. 17. 15.
11. 9.
18.16.14. 12. 10.
4.
8.
6.
n(x) 0.15
0.10
→ 0.05 01234
n(x)
2
1
0
-1
0.108
0.18
0.036
0.36
0.324
0.144
distribution are the same as
the isocontours of the quadratic form of the precision matrix Σ−1.]
2.
6 3
Σ1/2 −→
10.
13.
3.
9. 11.
13. 15. 17. 19.
0.144
5
7.
0.216
0.072
0 5. 5.
-1 3.
-2
1. 4. 6. 8.
10.
12. 14.16.18.
-2
0.036
-2-1012
q(x)
-2 -1 0 1 2
f (x) = n(q(x))
2.
7.
13.
5.
1.
7.
4.
-2 -1 0 1 2
2.
q(x) = (x−μ) Σ
5.
9. 11.
13. 15. 17. 19.
←− quadratic form
⊤ −1
x
→
0.108
0.072
0.252
0.288
Anisotropic Gaussians, Maximum Likelihood Estimation, QDA, and LDA 47 Maximum Likelihood Estimation for Anisotropic Gaussians
Given sample points X1, . . . , Xn and classes y1, . . . , yn, find best-fit Gaussians.
Xi is a column vector. [To fit our definition of the Gaussian distribution f (x).]
[Once again, we want to fit the Gaussian that maximizes the likelihood of generating the sample points in a specified class. This time I won’t derive the maximum-likelihood Gaussian; I’ll just tell you the answer.]
⇐ conditionalcovarianceforptsinclassC
[where nC is the number of points in class C.]
Prior πˆC, mean μˆC: same as before
[πˆC is number of points in class C ÷ total sample points; μˆC is mean of sample points in class C.]
[Maximum likelihood estimation takes these points and outputs this Gaussian].
Σˆ C is positive semidefinite, but not always definite!
[If there are some zero eigenvalues, the standard version of QDA just doesn’t work. We can try to fix it by eliminating the zero-variance dimensions (eigenvectors). Homeworks 2 and 3 suggest two ways to do that.]
For LDA:
Σˆ = 1 (Xi − μˆC) (Xi − μˆC)⊤ ⇐ pooled within-class covariance matrix n C i:yi=C
For QDA:
ΣˆC = 1 (Xi −μˆC)(Xi −μˆC)⊤
nC i:yi=C outer product matrix
maxlike.jpg
48 Jonathan Richard Shewchuk
[Let’s revisit QDA and LDA and see what has changed now that we know anisotropic Gaussians. The short answer is “not much has changed, but the graphs look cooler.” By the way, capital X once again represents a random variable.]
QDA
Choosing C that maximizes f (X = x|Y = C) πC is equivalent to maximizing the quadratic discriminant fn QC(x)=ln(√2π)d fC(x)πC=−1(x−μC)⊤Σ−1(x−μC)−1ln|ΣC|+lnπC
↑
Gaussian PDF for C
[This works for any number of classes. In a multi-class problem, you just pick the class with the greatest quadratic discriminant for x.]
2 classes:
Decision fn QC(x) − QD(x) is quadratic, but may be indefinite
⇒ Bayes decision boundary is a quadric.
Posterior is P(Y = C|X = x) = s(QC(x) − QD(x)) where s(·) is logistic fn
fC(x) & fD(x)
3 3
22
11
00
-1-1
-2 -2
QC −QD
s(x) 1.0
0.8 s(QC −QD) 0.6
-3-3
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
-30.3
-3 -2 -1 0 1 2 3
2C2
qdaaniso3d.pdf, qdaanisocontour.pdf, qdaanisodiff3d.pdf, qdaanisodiffcontour.pdf,
logistic.pdf, qdaanisoposterior3d.pdf, qdaanisoposteriorcontour.pdf
[(Show this figure on a separate “whiteboard.”) An example where the decision boundary is a hyperbola—which is not possible with isotropic Gaussians. At left, two anisotropic Gaussians. Center left, the difference QC − QD. After applying the logistic function to this difference we obtain the posterior probabilities at right, which tells us the probability our prediction is correct. Observe that we can see the decision boundary in both contour plots: itisQC−QD =0ands(QC−QD)=0.5.Wedon’tneedtoapplythelogisticfunctiontofind the decision boundary, but we do need to apply it if we want the posterior probabilities.]
→→30.1 0.4
0.2
-4 -2 0 2 4
x
20.6
10.3
0 0.2
0.4 0.9
0.7
0.9
0.5
0.5
0.1
0.4
0.8
0.7
0.2
-10.8
-2
0.6
Anisotropic Gaussians, Maximum Likelihood Estimation, QDA, and LDA 49
[When you have many classes, their QDA decision boundaries form an anisotropic Voronoi diagram. Interestingly, a cell of this diagram might not be connected.]
LDA
One Σˆ for all classes.
[Once again, the quadratic terms cancel each other out so the decision function is linear and the decision boundary is a hyperplane.]
QC(x)−QD(x)=(μC−μD)⊤Σ−1x−μ⊤CΣ−1μC−μ⊤DΣ−1μD +lnπC−lnπD 2
aniso.pdf
Choose class C that maximizes the linear discriminant fn
μ⊤C Σ−1 x − 1 μ⊤C Σ−1 μC + ln πC 2
2 classes: Decision boundary is w⊤ x + α = 0 Posterior is P(Y = C|X = x) = s(w⊤ x + α)
[works for any # of classes]
w⊤ x +α
[Note that we use a linear solver to efficiently compute μ⊤C Σ−1 just once, so the classifier can evaluate test points quickly.]
50
Jonathan Richard Shewchuk
fC(x) & fD(x) 222
QC − QD 111
0 0 0 -1 -1 -1 -2 -2 -2 -3
3
3
→ 0.6
0.6
-3 -3 0.3 0.9 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
ldaaniso3d.pdf, ldaanisocontour.pdf, ldaanisodiff3d.pdf, ldaanisodiffcontour.pdf,
logistic.pdf, ldaanisoposterior3d.pdf, ldaanisoposteriorcontour.pdf
[(Show this figure on a separate “whiteboard.”) In LDA, the decision boundary is always a hyperplane. Note that Mathematica messed up the top left plot a bit; there should be no red in the left corner, nor blue in the right corner.]
oo oo o o oo
o o oo oo
o o
oo
o o
o
o
o oo
o•
oo o•
• o oo
o oooo oooo o oooo ooooooo
o oo oo oo
ooo ooo o
ooo o•
ooo •
o ooo o
ooo •o ooo
oo oo o o oooo
o oo o o
o o ooooooo o
oooo•oooooo o oooo•o
oo
oooooooooo oo
o o oo ooo oooooo oo
ooo
oooo•ooo
ooo
•o o o ooo
o ooooo
ooooo oo
oo oooooooooo
ooo o•
ooo o o o o o
oooooooo oo•o
ooo o o o oo
o
•oooo oooo oo
oo oo o o
oooo o ooo oo o oooooo
oo ooo o •ooo ooo oo oo o
o o oooo oooo
o
oo o ooo oo ooo oo o oo
o oo ooo o o
o ooooooo •oo
oo ooo
o ooo o oooo o o oo •oo o ooooo oo
oooo oooo oo•oo
•ooooo
o o• ooooo
o oo ooooo oo ooooo
o oooooo oooooo•o
oo ooooo
oo oooooo
oo o o
•o
oo•ooo o oo
o o o
o ooooo oo ooo
ooo o oo o oo
o oo oo oooo o
oo o
o
o
ooo o
o
o
LDAdata.pdf (ESL, Figure 4.11) [An example of LDA with messy data. The points are not sampled from perfect Gaussians, but LDA still works reasonably well.]
s(x) 1.0
0.8
0.4
0.2
-4 -2 0 2 4
ooo
o
o
x
0.4
0.7
0.20.8
s(QC − QD) →30.1
0.5
Notes:
Anisotropic Gaussians, Maximum Likelihood Estimation, QDA, and LDA 51 – LDA often interpreted as projecting points onto normal w; cutting the line in half.
– For 2 classes,
– LDA has d + 1 parameters (w, α);
– QDA has d(d+3) + 1 params; 2
project.png
– QDA more likely to overfit. [The danger is much bigger when the dimension d is large.]
−4 −3 −2 −1 0 1 2
X2
−4 −3 −2 −1 0 1 2
−4 −2 0 2 4 −4 −2 0 2 4
ldaqda.pdf (ISL, Figure 4.9) [In these examples, the Bayes optimal decision boundary is purple (and dashed), the QDA decision boundary is green, the LDA decision boundary is black (and dotted). When the optimal boundary is linear, as at left, LDA gives a more stable fit whereas QDA may overfit. When the optimal boundary is curved, as at right, QDA often gives you a better fit.]
– With features, LDA can give nonlinear boundaries; QDA nonquadratic.
– We don’t get true optimum Bayes classifier
– estimate distributions from finite data
– real-world data not perfectly Gaussian
– Changing priors or loss = adding constants to discriminant fns
[So it’s very easy. In the 2-class case, it’s equivalent to changing the isovalue . . . ]
– Posterior gives decision boundaries for 10% probability, 50%, 90%, etc.
– choosing isovalue = probability p is same as choosing asymmetrical loss p for false positive, 1−pforfalsenegative; ORaschoosingπC =1−p,πD =p.
[LDA & QDA are the best method in practice for many applications. In the STATLOG project, either LDA or QDA were in the top three classifiers for 10 out of 22 datasets. But it’s not because all those datasets are Gaussian. LDA & QDA work well when the data can only support simple decision boundaries such as linear or quadratic, because Gaussian models provide stable estimates. See ESL, Section 4.3]
52 Jonathan Richard Shewchuk Some Terms
Let X be n × d design matrix of sample pts Each row i of X is a sample pt Xi⊤.
[Now I’m using capital X as a matrix instead of a random variable vector. I’m treating Xi as a column vector to match the standard convention for multivariate distributions like the Gaussian, but Xi⊤ is a row of X.]
centering X: subtracting μ⊤ from each row of X. X → X ̇
[μ⊤ is the mean of all the rows of X. Now the mean of all the rows of X ̇ is zero.]
Let R be uniform distribution on sample pts. Sample covariance matrix is Var(R) = 1X ̇⊤X ̇.
n
[This is the simplest way to remember how to compute a covariance matrix for QDA. Imagine you have a design matrix XC that contains only the sample points of class C; then you have ΣˆC = 1 X ̇⊤X ̇C.]
nC C
[When we have points from an anisotropic Gaussian distribution, sometimes it’s useful to perform a linear
transformation that maps them to an axis-aligned distribution, or maybe even to an isotropic distribution.]
decorrelating X ̇: applying rotation Z = X ̇V, where Var(R) = VΛV⊤ [rotates the sample points to the eigenvector coordinate system]
Then Var(Z) = Λ. [Z has diagonal covariance. If Xi ∼ N(μ, Σ), then approximately, Zi ∼ N(0, Λ).] [Proof: Var(Z) = (1/n)V⊤X ̇⊤X ̇V = V⊤Var(R)V = V⊤VΛV⊤V = Λ.]
sphering X ̇: applying transform W = X ̇ Var(R)−1/2 [Recall that Σ−1/2 maps ellipsoids to spheres.] whitening X: centering + sphering, X → W
Then W has covariance matrix I. [If Xi ∼ N(μ, Σ), then approximately, Wi ∼ N(0, I).]
[Whitening input data is often used with other machine learning algorithms, like SVMs and neural networks. The idea is that some features may be much bigger than others—for instance, because they’re measured in different units. SVMs penalize violations by large features more heavily than they penalize small features. Whitening the data before you run an SVM puts the features on an equal basis.]
[One nice thing about discriminant analysis is that whitening is built in.]
[Incidentally, what we’ve done here—computing a sample covariance matrix and its eigenvectors/values— is about 75% of an important unsupervised learning method called principal components analysis, or PCA, which we’ll learn later in the semester.]
whiten.jpg
Regression, including Least-Squares Linear and Logistic Regression 53
10 Regression, including Least-Squares Linear and Logistic Regression REGRESSION aka Fitting Curves to Data
Classification: given point x, predict class (often binary) Regression: given point x, predict a numerical value
[Classification gives a discrete prediction, whereas regression gives us a quantitative prediction, usually on a continuous scale.]
[We’ve already seen an example of regression in Gaussian discriminant analysis. QDA and LDA don’t just give us a classifier; they also give us the probability that a particular class label is correct. So QDA and LDA implicitly do regression on probability values.]
– Choose form of regression fn h(x; p) with parameters p (h = hypothesis) – like decision fn in classification [e.g., linear, quadratic, logistic in x]
– Choose a cost fn (objective fn) to optimize
– usually based on a loss fn; e.g., risk fn = expected loss
Some regression fns:
(1) linear:h(x;w,α)=w·x+α
(2) polynomial [equivalent to linear regression with added polynomial features]
(3) logistic: h(x; w, α) = s(w · x + α) recall: logistic fn s(γ) = 1 1+e−γ
[The last choice is interesting. You’ll recall that LDA produces a posterior probability function with this expression. So the logistic function seems to be a natural form for modeling certain probabilities. If we want to model posterior probabilities, sometimes we use LDA; but alternatively, we could skip fitting Gaussians to points, and instead just try to directly fit a logistic function to a set of probabilities.]
Some loss fns: let z be prediction h(x); y be true label
(A) L(z, y) = (z − y)2
(B) L(z,y)=|z−y|
(C) L(z,y)=−ylnz−(1−y)ln(1−z)
squared error
absolute error logisticloss,akacross-entropy:y∈[0,1],z∈(0,1)
mean loss [you can leave out the “ 1 ”]
n
maximum loss
weighted sum [some points are more important than others] l2 penalized/regularized
l1 penalized/regularized
Some cost fns to minimize:
(a) J(h) = 1 n L(h(X ),y )
ni=1 ii (b) J(h) = maxni=1 L(h(Xi), yi)
(c) J(h) = n ω L(h(X ),y ) i=1i ii
(d) J(h) = 1 n L(h(Xi), yi) + λ∥w∥2 n i=1
(e) J(h) = 1 n L(h(Xi), yi) + λ∥w∥l1 n i=1
Some famous regression methods:
Least-squares linear regr.: Weighted least-squ. linear: Ridge regression:
Lasso:
Logistic regr.:
Least absolute deviations: Chebyshev criterion:
(1) + (A) + (a) (1)+(A)+(c) (1)+(A)+(d) (1)+(A)+(e) (3)+(C)+(a) (1)+(B)+(a) (1)+(B)+(b)
quadratic cost; minimize w/calculus
quadratic program
convex cost; minimize w/gradient descent
linear program
[I have given you several choices of regression function form, several choices of loss function, and several choices of objective function. These are interchangeable parts where you can snap one part out and replace it with a different one. But the optimization algorithm and its speed depend crucially on which parts you pick. Let’s consider some examples.]
54 Jonathan Richard Shewchuk LEAST-SQUARES LINEAR REGRESSION (Gauss, 1801)
Linear regression fn (1) + squared loss fn (A) + cost fn (a).
n Findw,αthatminimizes (Xi ·w+α−yi)2
i=1
• • • •
• • • •• •
••• •
•••• • •••
•• ••• ••
• • • • •
••
•• •• • •• • •
•
X1
[An example of linear regression.]
linregress.pdf (ISL, Figure 3.4)
X is n × d design matrix of sample pts y is n-vector of scalar labels
•
•
X2 •
Convention:
X11 X12 … X1j … X1d
21 22 2j 2 . .
y1 XXXX
2d y
. .
X X X X←pointX⊤ . i1 i2 ij id . i . .
.
XX X X
Usually n > d. [But not always.]
Recall fictitious dimension trick [from Lecture 3]: rewrite h(x) = x · w + α as
w1
n1 n2 nj nd
[x x 1]·w.
122 α
yn ↑↑ feature column X∗ j y
NowXisann×(d+1)matrix; wisa(d+1)-vector. [We’veaddedacolumnofall-1’stotheendofX.] [We rewrite the optimization problem above:]
= RSS(w), for residual sum of squares
Find w that minimizes ∥Xw − y∥2
Regression, including Least-Squares Linear and Logistic Regression 55 Optimize by calculus:
minimize RSS(w) = w⊤X⊤Xw − 2y⊤Xw + y⊤y ∇RSS = 2X⊤Xw−2X⊤y=0
⇒ X⊤X w = X⊤y ⇐ the normal equations [w unknown; X & y known]
(d+1)×(d+1) (d+1)−vectors If X⊤X is singular, problem is underconstrained
[because the sample points all lie on a common hyperplane. Notice that X⊤X is always positive semidefinite.] We use a linear solver to find w = (X⊤X)−1X⊤ y [never actually invert the matrix!]
X+, the pseudoinverse of X, (d+1)×n
[We never compute X+ directly, but we are interested in the fact that w is a linear transformation of y.]
[X is usually not square, so X can’t have an inverse. However, every X has a pseudoinverse X+, and if X⊤X is invertible, then X+ is a “left inverse.”]
Observe: X+X = (X⊤X)−1X⊤X = I ⇐ (d + 1) × (d + 1) [which explains the name “left inverse”] Observe:thepredictedvaluesofyareyˆi =w·Xi ⇒ yˆ=Xw=XX+y=Hy
where H = XX+ is called the hat matrix because it puts the hat on y
n×n
[Ideally, H would be the identity matrix and we’d have a perfect fit, but if n > d + 1, then H is singular.]
Interpretation as a projection:
– yˆ = Xw ∈ Rn is a linear combination of columns of X (one column per feature)
– For fixed X, varying w, Xw is subspace of Rn spanned by columns
y
yˆ = X w
– Minimizing ∥yˆ − y∥2 finds point yˆ nearest y on subspace ⇒ projects y orthogonally onto subspace
[the vertical line is the direction of projection and the error vector]
– Error is smallest when line is perpendicular to subspace: X⊤(Xw − y) = 0
⇒ the normal equations!
– Hat matrix H does the projecting. [H is sometimes called the projection matrix.]
Advantages:
– Easy to compute; just solve a linear system.
– Unique, stable solution. [. . . except when the problem is underconstrained.]
Disadvantages:
– Very sensitive to outliers, because errors are squared!
– Fails if X⊤X is singular. [Which means the problem is underconstrained, has multiple solutions.]
[Apparently, least-squares linear regression was first posed and solved in 1801 by the great mathematician Carl Friedrich Gauss, who used least-squares regression to predict the trajectory of the planetoid Ceres. A paper he wrote on the topic is regarded as the birth of modern linear algebra.]
Figure in n-dimensional space (1 dim/sample pt) NOT d-dimensional feature space
⇐ subspace spanned by X’s columns (at most d + 1 dimensions)
56 Jonathan Richard Shewchuk LOGISTIC REGRESSION (David Cox, 1958)
Logistic regression fn (3) + logistic loss fn (C) + cost fn (a). Fits “probabilities” in range (0, 1).
Usually used for classification. The input yi’s can be probabilities, but in most applications they’re all 0 or 1.
QDA, LDA: generative models
logistic regression: discriminative model
[We’ve learned from LDA that in classification, the posterior probabilities are often modeled well by a logistic function. So why not just try to fit a logistic function directly to the data, skipping the Gaussians?]
With X and w including the fictitious dimension; α is w’s last component . . .
L(z) L(z) 44
33
L(z, 0) 2 L(z, 0.7) 2 11
zz 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
[Plots of the loss L(z, y) for y = 0 (left) and y = 0.7 (right). As you might guess, the left function is minimized at z = 0, and the right function is minimized
Find w that minimizes
n
i=1
J=−
y lns(X ·w)+(1−y)ln(1−s(X ·w)) iiii
logloss0.pdf, loglosspt7.pdf
at z = 0.7. These loss functions are always convex.] J(w) is convex! Solve by gradient descent.
[To do gradient descent, we’ll need to compute some derivatives.] s′(γ) = d 1 = e−γ
dγ1+e−γ (1+e−γ)2 = s(γ) (1 − s(γ))
s(x) 1.0
0.8 0.6 0.4 0.2
-4 -2 0 2 4
s (x) 0.25
0.20 0.15 0.10 0.05
xx -4 -2 0 2 4
[Plots of s(γ) (left) and s′(γ) (right).]
logistic.pdf, dlogistic.pdf
Regression, including Least-Squares Linear and Logistic Regression 57
Let si = s(Xi · w) ∇wJ = −
(y − s(Xw))
Gradient descent rule: w ← w + ε X⊤(y − s(Xw))
yi 1−yi s∇si−1−s∇si
ii
yi 1−yi
s−1−s si(1−si)Xi
= −
= −(yi−si)Xi
= −X
⊤
where s(Xw) = . .
[applies s component-wise to Xw]
ii
Stochastic gradient descent: w ← w + ε (yi − s(Xi · w)) Xi
Works best if we shuffle points in random order, process one by one. For very large n, sometimes converges before we visit all points!
[This looks a lot like the perceptron learning rule. The only difference is that the “−si” part is new.] Starting from w = 0 works well in practice.
[An example of logistic regression.]
problogistic.png, by “mwascom” of Stack Overflow
s1 s2
.
s n
http://stackoverflow.com/questions/28256058/plotting-decision-boundary-of-logistic-regression
58 Jonathan Richard Shewchuk
11 More Regression; Newton’s Method; ROC Curves LEAST-SQUARES POLYNOMIAL REGRESSION
Replace each Xi with feature vector Φ(Xi) with all terms of degree 0 . . . p e.g., Φ(Xi) = [X2 Xi1Xi2 X2 Xi1 Xi2 1]⊤
i1 i2
[Notice that we’ve added the fictitious dimension “1” here, so we don’t need to add it again to do linear or logistic regression. This basis covers all polynomials quadratic in Xi1 and Xi2.]
Can also use non-polynomial features (e.g., edge detectors). Otherwise just like linear or logistic regression.
Log. reg. + quadratic features = same form of posteriors as QDA. Very easy to overfit!
[Here are some examples of polynomial overfitting, to show the importance of choosing the polynomial degree very carefully. At left, we have sampled points from a degree-3 curve (black) with added noise. We show best-fit polynomials of degrees 2, 4, 6, and 8 found by regression of the black points. The degree-4 curve (green) fits the true curve (black) well, whereas the degree-2 curve (red) underfits and the degree-6 and 8 curves (blue, yellow) overfit the noise and oscillate. The oscillations in the yellow degree-8 curve are a characteristic problem of polynomial interpolation.]
[At upper right, a degree-20 curve shows just how insane high-degree polynomial oscillations can get. It takes a great deal of densely spaced data to tame the oscillations in a high degree curve, and there isn’t nearly enough data here.]
[At lower right, somebody has regressed a degree-4 curve to U.S. census population numbers. The curve doesn’t oscillate, but can you nevertheless see a flaw? This shows the difficulty of extrapolation outside the range of the data. As a general rule, extrapolation is much harder than interpolation. The k-nearest neighbor classifier is one of the few that does extrapolation decently without occasionally returning crazy values.]
overunder.png, degree20.png, UScensusquartic.png
More Regression; Newton’s Method; ROC Curves 59
[From Mehta, Wang, Day, Richardson, Bukov, Fisher, and Schwab, “A High-Bias, Low-Variance Introduction to Machine Learning for Physicists.]
[This example shows that a fitted degree-10 polynomial (green) can be tamed by using a very large amount of training data (right), even if the training data is noisy. The training data was generated from a different degree-10 polynomial, with noise added. On the left, we see that it even does decent extrapolation for a short distance, albeit only because the original data was also from a degree-10 polynomial.]
WEIGHTED LEAST-SQUARES REGRESSION
Linear regression fn (1) + squared loss fn (A) + cost fn (c).
[The idea of weighted least-squares is that some sample points might be more trusted than others, or there might be certain points you want to fit particularly well. So you assign those more trusted points a higher weight. If you suspect some points of being outliers, you can assign them a lower weight.]
Assign each sample pt a weight ωi; collect them in n × n diagonal matrix Ω.
Greater ωi → work harder to minimize |yˆi − yi|2 recall: yˆ = Xw [yˆi is predicted label for Xi] n
= ωi (Xi · w − yi)2 i=1
[As with ordinary least-squares regression, we find the minimum by setting the gradient to zero, which leads us to the normal equations.]
Solve for w in normal equations5: X⊤ΩXw = X⊤Ωy
NEWTON’S METHOD
Iterative optimization method for smooth fn J(w).
Often much faster than gradient descent. [We’ll use Newton’s method for logistic regression.]
Idea: You’re at point v. Approximate J(w) near v by quadratic fn. Jump to its unique critical pt. Repeat until bored.
5Once again, you can interpret this method as a projection of y. Ω1/2yˆ is the point nearest Ω1/2y on the d-dimensional subspace spanned by the columns of Ω1/2X. If you stretch the n-dimensional space by applying the linear transformation Ω1/2, yˆ is an orthogonal projection of y onto the stretched subspace.
order10extrap.pdf
Find w that minimizes (Xw − y)⊤Ω(Xw − y)
60
Jonathan Richard Shewchuk
-2
50 40 30 20 10
-10 -20
2 4 -2
50 40 30 20 10
-10 -20
2 4 -2
50 40 30 20 10
-10 -20
2 4
newton1.pdf, newton2.pdf, newton3.pdf [Three iterations of Newton’s method in one- dimensional space. We seek the minimum of the blue curve, J. Each brown curve is a local quadratic approximation to J. Each iteration, we jump to the bottom of the brown parabola.]
[Steps taken by Newton’s method in two-dimensional space.] ∇J(w) = ∇J(v) + (∇2 J(v)) (w − v) + O(∥w − v∥2)
where ∇2 J(v) is the Hessian matrix of J at v. Find critical pt w by setting ∇J(w) = 0:
w = v − (∇2 J(v))−1 ∇J(v)
[This is an iterative update rule you can repeat until it converges to a solution. As usual, we probably don’t want to compute a matrix inverse directly. It is faster to solve a linear system of equations, typically by Cholesky factorization or the conjugate gradient method.]
Newton’s method:
pick starting point w repeat until convergence
e ← solution to linear system (∇2 J(w)) e = −∇J(w) w←w+e
Taylor series about v:
newton2D.png
More Regression; Newton’s Method; ROC Curves 61 Warning: Doesn’t know difference between minima, maxima, saddle pts.
Starting pt must be “close enough” to desired critical pt.
[If the objective function J is actually quadratic, Newton’s method needs only one step to find the exact
solution. The closer J is to quadratic, the faster Newton’s method tends to converge.]
[Newton’s method is superior to blind gradient descent for some optimization problems for several reasons. First, it tries to find the right step length to reach the minimum, rather than just walking an arbitrary distance downhill. Second, rather than follow the direction of steepest descent, it tries to choose a better descent direction.]
[Nevertheless, it has some major disadvantages. The biggest one is that computing the Hessian can be quite expensive, and it has to be recomputed every iteration. It can work well for low-dimensional weight spaces, but you would never use it for a neural network, because there are too many weights. Newton’s method also doesn’t work for most nonsmooth functions. It particularly fails for the perceptron risk function, whose Hessian is zero, except where the Hessian is not even defined.]
LOGISTIC REGRESSION (continued)
[Let’s use Newton’s method to solve logistic regression faster.]
′ii. . .
Recall:s(γ)=s(γ)(1−s(γ)), s =s(X ·w), s= ,
n
∇w J = − (yi − si) Xi = −X⊤(y − s)
s n
i=1
[Now let’s derive the Hessian too, so we can use Newton’s method.]
0 s(1−s) 0 22 2⊤⊤
n
∇wJ(w)= si(1−si)XiXi =X ΩX whereΩ= . .
. . .. . . .
0 0 … sn(1−sn)
Ω is +ve definite ∀w ⇒ X⊤ΩX is +ve semidefinite ∀w ⇒
[The logistic regression cost function is convex, so Newton’s method finds a globally optimal point if it converges at all.]
i=1
Newton’s method:
w←0
repeat until convergence
e ← solution to normal equations (X⊤ΩX) e = X⊤(y − s) w←w+e
Recall: Ω, s are fns of w
[Notice that this looks a lot like weighted least squares, but the weight matrix Ω and the right-hand-side vector y − s change every iteration. So we call it . . . ]
An example of iteratively reweighted least squares.
s1 s2
s1(1−s1)
J(w) is convex.
0 …
0
62 Jonathan Richard Shewchuk
[We need to be very careful with the analogy, though. The weights don’t have the same meaning they had when we learned weighted least-squares regression, because there is no Ω on the right-hand side of (X⊤ΩX) e = X⊤(y − s). Contrary to what you’d expect, a small weight in Ω causes the Newton iteration to put more emphasis on a point when it computes e.]
If point i is misclassified and far from decision boundary (so si(1 − si) is close to zero), point i has large influence! [That is, it contributes a lot to the step e. Correctly classified points far from the decision boundary contribute little to e because they’re happy where they are.]
[Here’s one more idea for speeding up logistic regression.] Idea: If n very large, save time by using a random subsample of the pts per iteration. Increase sample size as you go.
[The principle is that the first iteration isn’t going to take you all the way to the optimal point, so why waste time looking at all the sample points? Whereas the last iteration should be the most accurate one.]
LDA vs. Logistic Regression
Advantages of LDA:
– For well-separated classes, LDA stable; log. reg. surprisingly unstable
– > 2 classes easy & elegant; log. reg. needs modifying (softmax regression)
– LDA slightly more accurate when classes nearly normal, especially if n is small
Advantages of log. reg.:
– More emphasis on decision boundary
[Correctly classified points far from the decision boundary have a small effect on logistic regression, whereas misclassified points and points near the decision boundary have more say. By contrast, LDA gives all the sample points equal weight when fitting Gaussians to them.]
0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 x1 x1
[Logistic regression vs. LDA for a linearly separable data set with a very narrow margin. Logistic regression (center) always succeeds in separating linearly separable classes, because the cost function approaches zero for a correct linear separator.
logregvsLDAuni.pdf
x1
In this example, LDA (right) misclassifies some of the training points.]
– Less sensitive to misclassified points very far from decision boundary [whereas LDA and SVMs both get more and more distorted as one crazy misclassified point moves farther and farther from the decision boundary.]
– More robust on some non-Gaussian distributions (e.g., dists. w/large skew)
– Naturally fits labels between 0 and 1 [usually probabilities]
[When you use logistic regression with quadratic features, you get a quadric decision boundary, just as you do with QDA. Based on what I’ve said here, do you think logistic regression with quadratic features gives you exactly the same classifier as QDA?]
0.0 0.5 1.0
1.5 2.0
x2 0.0 0.5 1.0
1.5 2.0
x2 0.0 0.5 1.0
1.5 2.0
More Regression; Newton’s Method; ROC Curves 63 ROC CURVES (for test sets)
ROC Curve
0.0 0.2 0.4 0.6 0.8 1.0
[This is a ROC curve. That stands for receiver operating characteristics, which is an awful name but we’re stuck with it for historical reasons.
A ROC curve is a way to evaluate your classifier after it is trained.
It is made by running a classifier on the test set or validation set.
It shows the rate of false positives vs. true positives for a range of settings.
We assume there is a knob we can turn to trade off these two types of error. For our purposes, that knob is the posterior probability threshold for Gaussian discriminant analysis or logistic regression.
However, neither axis of this plot is that knob.]
x-axis: “false positive rate = % of −ve classified as +ve”
y-axis: “true positive rate = % of +ve classified as +ve aka sensitivity”
“false negative rate”: vertical distance from curve to top [1− sensitivity]
“specificity”: horizontal distance from curve to right [1− false positive rate; “true negative rate”]
[You generate this curve by trying every probability threshold; for each threshold, measure the false positive & true positive rates and plot a point.]
upper right corner: “always classify +ve (Pr ≥ 0)”
lower left corner: “always classify −ve (Pr > 1)”
diagonal: “random classifiers”
[A rough measure of a classifier’s effectiveness is the area under the curve. For a classifier that is always correct, the area under the curve is one. For the random classifier, the area under the curve is 1/2, so you’d better do better than that.]
[IMPORTANT: In practice, the trade-off between false negatives and false positives is usually negotiated by choosing a point on this plot, based on real test data, and NOT by taking the choice of threshold that’s best in theory.]
ROC.pdf
0.0 0.2 0.4 0.6 0.8 1.0
64 Jonathan Richard Shewchuk
12 Statistical Justifications; the Bias-Variance Decomposition
STATISTICAL JUSTIFICATIONS FOR REGRESSION
[So far, I’ve talked about regression as a way to fit curves to points. Recall that early in the semester I divided machine learning into 4 levels: the application, the model, the optimization problem, and the optimization algorithm. My last two lectures about regression were at the bottom two levels: optimization. The cost functions that we optimize seem somewhat arbitrary. Today, let’s take a step up to the second level, the model. I will describe some models, how they lead to those optimization problems, and how they contribute to underfitting or overfitting.]
Typical model of reality:
– sample points come from unknown prob. distribution: Xi ∼ D – y-values are sum of unknown, non-random fn + random noise:
∀Xi, yi =g(Xi)+εi, εi ∼D′, D′ hasmeanzero
[We are positing that reality is described by a function g. We don’t know g, but g is not a random variable; it represents a consistent relationship between X and y that we can estimate. We add to g a random variable ε, which represents measurement errors and all the other sources of statistical error when we measure real- world phenomena. Notice that the noise is independent of X. That’s a pretty big assumption, and often it does not apply in practice, but that’s all we’ll have time to deal with this semester. Also notice that this model leaves out systematic errors, like when your measuring device adds one to every measurement, because we usually can’t diagnose systematic errors from data alone.]
Goal of regression: find h that estimates g.
Ideal approach: choose h(x) = EY [Y|X = x] = g(x) + E[ε] = g(x)
[If this expectation exists at all, it partly justifies our model of reality. We can retroactively define g to be
this expectation.]
[Draw figure showing example g, distribution for a fixed x.]
Least-Squares Regression from Maximum Likelihood
Suppose εi ∼ N(0, σ2); then yi ∼ N(g(Xi), σ2) Recall that log likelihood of normal PDF is
[We treat g as a “distribution parameter.” MLE tells us to choose a g that minimizes (yi − g(Xi))2.]
[So if the noise is normally distributed, maximum likelihood justifies using the least-squares cost function.] [However, I’ve told you in previous lectures that least-squares is very sensitive to outliers. If the error is truly normally distributed, that’s not a big deal, especially when you have a lot of sample points. But in the real world, the distribution of outliers often isn’t normal. Outliers might come from wrongly measured measurements, data entry errors, anomalous events, or just not having a normal distribution. When you have a heavy-tailed distribution, for example, least-squares isn’t a good choice.]
(yi − μ)2 lnf(yi)=− 2σ2 −constant
⇐μ=g(Xi) l(g;X,y)=ln(f(y1)f(y2)···f(yn))=lnf(y1)+…+lnf(yn)=− 1 (yi−g(Xi))2−constant
2σ2 Takeaway: Max likelihood on “parameter” g ⇒ estimate g by least-squares regression
Log likelihood is l(h)
R(h) = n
Statistical Justifications; the Bias-Variance Decomposition 65
Empirical Risk
The risk for hypothesis h is expected loss R(h) = E[L] over all x ∈ Rd, y ∈ R.
Discriminative model: we don’t know X’s dist. D. How can we minimize risk?
[If we have a generative model, we can estimate the joint probability distribution for X and Y and derive the expected loss. That’s what we did for Gaussian discriminant analysis. But today I’m assuming we don’t have a generative model, so we don’t know those probabilities. Instead, we’re going to approximate the distribution in a very crude way: we pretend that the sample points are the distribution.]
Empirical distribution: the discrete uniform distribution over the sample pts Empirical risk: expected loss under empirical distribution
i=1
[The hat on the R indicates it’s only a cheap approximation of the true, unknown statistical risk we really want to minimize. Often, this is the best we can do. For many but not all distributions, it converges to the true risk in the limit as n → ∞. Choosing h that minimizes Rˆ is called empirical risk minimization.]
Takeaway: this is why we [usually] minimize the sum of loss fns.
Logistic Loss from Maximum Likelihood
What cost fn should we use for probabilities?
Actual probability pt Xi is in the class is yi; predicted prob. is h(Xi).
Imagine β duplicate copies of Xi: yi β are in the class, (1 − yi) β are not.
[The size of β isn’t very important, but imagine that yi β and (1 − yi) β are both integers for all i.]
[If we use maximum likelihood estimation to choose the weights most likely to generate this sequence of samples and labels, we get the following likelihood.]
n
Likelihood is L(h; X, y) = h(Xi)yi β(1 − h(Xi))(1−yi) β
i=1
= ln L(h)
Max log likelihood ⇒ minimize L(h(Xi), yi).
[So the principle of maximum likelihood explains where the weird logistic loss function comes from.]
ˆ 1n
L(h(Xi), yi)
y lnh(X)+(1−y)ln(1−h(X))
= β
= −β logistic loss fn L(h(Xi), yi).
iiii i
66 Jonathan Richard Shewchuk
THE BIAS-VARIANCE DECOMPOSITION
There are 2 sources of error in a hypothesis h:
bias: variance:
error due to inability of hypothesis h to fit g perfectly e.g., fitting quadratic g with a linear h
error due to fitting random noise in data
e.g., we fit linear g with a linear h, yet h g.
Model:
Now h is a random variable; i.e., its weights are random
Xi ∼ D, εi ∼ D′, yi = g(Xi) + εi [remember that D′ has mean zero] fit hypothesis h to X, y
Consider arbitrary pt z ∈ Rd (not necessarily a sample pt!) & γ = g(z) + ε, ε ∼ D′
[So z is arbitrary, whereas γ is random.]
Note: E[γ] = g(z); Var(γ) = Var(ε) [the mean comes from g, and the variance comes from ε]
Risk fn when loss = squared error: R(h) = E[L(h(z), γ)]
↑ take expectation over possible training sets X, y & values of γ
[Stop and take a close look at this expectation. Remember that the hypothesis h is a random variable. We are taking a mean over the probability distribution of hypotheses. That seems pretty weird if you’ve never seen it before. But remember, the training data X and y come from probability distributions. We use the training data to choose weights, so the weights that define h also come from some probability distribution. It might be hard to work out what that distribution is, but it exists. This “E[·]” is integrating the loss over all possible values of the weights.]
= E[(h(z) − γ)2]
= E[h(z)2] + E[γ2] − 2 E[γ h(z)] [Observe that γ and h(z) are independent]
= Var(h(z)) + E[h(z)]2 + Var(γ) + E[γ]2 − 2E[γ] E[h(z)]
= (E[h(z)] − E[γ])2 + Var(h(z)) + Var(γ)
= (E[h(z)] − g(z))2 + Var(h(z))
+ Var(ε )
irreducible error
bias2 of method
variance of method
[This is called the bias-variance decomposition of the risk function. Let’s look at an intuitive interpretation of these three parts.]
Statistical Justifications; the Bias-Variance Decomposition 67 Bias, Variance, Noise
-./0″1$/23./)4#567)$/)#(89
!”#$
%#&”#'() ,
*+”$)
!”
[In this example, we’re trying to fit a sine wave with lines, which obviously aren’t going to be accurate. At left, we have generated 50 different hypotheses (lines). At upper right, the red line is the expected hypothesis—an average over infinitely many hypotheses. The black curve illustrates test points on the true function g. We see that most test points have a large bias (difference between the black and red curves), because lines don’t fit sine waves well. However, some of the test points happen to have a small bias—where the sine wave crosses the red line. At center right, the variance is the expected squared difference between a random black line and the red line. At lower right, the irreducible error is the expected squared difference between a random test point and the sine wave.]
This is pointwise version [of the bias-variance decomposition.]
Mean version: let z ∼ D be random variable; take mean over D of bias2, variance.
bvn.pdf
[So you can decompose one test point’s error into these three components, or you can decompose the error of the hypothesis over its entire range into three components, which tells you roughly how big they’ll be on a large test set.]
68
Jonathan Richard Shewchuk
[Now
– – –
– – – – – – –
– –
I will write down a list of consequences of what we’ve just learned.]
Underfitting = too much bias
Most overfitting caused by too much variance
Training error reflects bias but not variance; test error reflects both
[which is why low training error can fool you when you’ve overfitted]
For many distributions, variance → 0 as n → ∞
If h can fit g exactly, for many distributions bias → 0 as n → ∞
If h cannot fit g well, bias is large at “most” points
Adding a good feature reduces bias; adding a bad feature rarely increases it
Adding a feature usually increases variance [don’t add a feature unless it reduces bias more] Can’t reduce irreducible error [hence its name]
Noise in test set affects only Var(ε);
noise in training set affects only bias & Var(h)
For real-world data, g is rarely knowable [& noise model might be wrong]
[so we can’t actually put numbers to the bias-variance decomposition on real-world data]
But we can test learning algs by choosing g & making synthetic data
0 20 40 60 80 100 2 5 10 20 2 5 10 20 [At left, a data set is fit with splines having various degrees of freedom. The synthetic data is taken from the black curve with added noise. At center, we plot training error (gray) and test error (red) as a function of the number of degrees of freedom. At right, we plot the squared test error as a sum of squared bias (blue) and variance (orange). As the number of degrees of freedom increases, the training and test errors both decrease up to degree 6 because the bias decreases, but for
higher degrees the test error increases because the variance increases.]
splinefit.pdf, biasvarspline.pdf (ISL, Figures 2.9 and 2.12)
2 4 6 8 10 12
Mean Squared Error
0.0 0.5 1.0 1.5 2.0 2.5
0.0 0.5 1.0 1.5 2.0 2.5
Statistical Justifications; the Bias-Variance Decomposition 69 Example: Least-Squares Linear Reg.
For simplicity, no fictitious dimension.
[This implies that our linear regression function has to be zero at the origin.]
Model: g(z) = v⊤z (reality is linear)
[So we could fit g perfectly with a linear h if not for the noise in the training set.] Let e be noise n-vector, ei ∼ N(0, σ2)
Training labels: y = Xv + e
[X & y are the inputs to linear regression. We don’t know v or e.]
Lin. reg. computes weights
w = X+y = X+(Xv + e) = v + X+e [We want w = v, but the noise in y becomes noise in w.]
noise in weights
BIAS is E[h(z)] − g(z) = E[w⊤z] − v⊤z = E[z⊤X+e] = z⊤E[X+]E[e] = 0
Warning: This does not mean h(z) − g(z) is always 0!
Sometimes +ve, sometimes −ve, mean over training sets is 0. [Those deviations from the mean are captured in the variance.]
[When the bias is zero, a perfect fit is possible. But when a perfect fit is possible, not all learning methods give you a bias of zero; here it’s a benefit of the squared error loss function. With a different loss function, we might have a nonzero bias even fitting a linear h to a linear g.]
VARIANCE is Var(h(z)) = Var(w⊤z) = Var(z⊤v + z⊤X+e) = Var(z⊤X+e)
[This is the dot product of a vector zT X+ with an isotropic, normally distributed vector e. The dot product reduces it to a one-dimensional Gaussian along the direction zT X+, so this variance is just the variance of the 1D Gaussian times the squared length of the vector zT X+.]
2⊤+2 2⊤⊤−1⊤ ⊤−1 = σzX=σz(XX) XX(XX) z
= σ2z⊤(X⊤X)−1z If we choose coordinate system so E[X] = 0,
then X⊤X → nCov(D) as n → ∞, so one can show that for z ∼ D, Var(h(z)) ≈ σ2 d
n
[where d is the dimension—the number of features per sample point.]
Takeaways: Bias can be zero when hypothesis function can fit the real one! [This is a nice property of the squared error loss function.]
Variance portion of RSS (overfitting) decreases as 1/n (sample points), increases as d (features)
or O(dp) if you use degree-p polynomials.
[I’ve used linear regression because it’s a relatively simple example. But the bias-variance trade-off applies to many learning algorithms, including classification as well as regression. But for most learning algorithms, the math gets a lot more complicated than this, if you can do it at all. Sometimes there are multiple competing bias-variance models and no consensus on which is the right one.]
70 Jonathan Richard Shewchuk
13 Shrinkage: Ridge Regression, Subset Selection, and Lasso
RIDGE REGRESSION aka Tikhonov Regularization
(1) + (A) + l2 penalized mean loss (d).
where w′ is w with component α replaced by 0.
Find w that minimizes ∥Xw − y∥2 + λ ∥w′∥2
= J(w) X has fictitious dimension but we DON’T penalize α.
Adds a regularization term, aka a penalty term, for shrinkage: to encourage small ∥w′∥. Why?
– Guarantees positive definite normal eq’ns; always unique solution.
[Standard least-squares linear regression yields singular normal equations when the sample points lie on a common hyperplane in feature space.] E.g., when d > n.
[The cost function J(w) with and without regularization.]
[At left, we see a quadratic form for a positive semidefinite cost function associated with least-squares regression. This cost function has many minima, and the regression problem is said to be ill-posed. By adding a small penalty term, we obtain a positive definite quadratic form (right), which has one unique minimum. The term “regularization” implies that we are turning an ill-posed problem into a well-posed problem.]
[That was the original motivation, but the next has become more important in machine learning . . . ]
– Reduces overfitting by reducing variance. Why?
Imagine: 500×1 − 500×2 is best fit for well-separated points with yi ∈ [0, 1]. Small change in x ⇒ big change in y!
[Given that all the y values in the data are small and the x values are not, it’s a sure sign of overfitting if tiny changes in x cause huge changes in y.]
So we penalize large weights.
[This use of regularization is closely related to the first one. When you have large variance and a lot of overfitting, it implies that your problem is close to being ill-posed, even though technically it might be well-posed.]
ridgequad.png
Shrinkage: Ridge Regression, Subset Selection, and Lasso 71
β2 β^
β1
squares solution. The red ellipses are the isocontours of ∥Xw − y∥2. The isocontours of ∥w′∥ are circles centered at the origin (blue). The solution lies where a red isocontour just touches a blue isocontour tangentially. As λ increases, the solution will occur at a more outer red isocontour and a more inner blue isocontour. This helps to reduce overfitting.]
Setting ∇J = 0 gives normal eq’ns (X⊤X + λI′) w = X⊤y
where I′ is identity matrix w/bottom right set to zero. [Don’t penalize the bias term α.] Algorithm: Solve for w. Return h(z) = w⊤z.
Increasing λ ⇒ more regularization; smaller ∥w′∥
Recall [from the previous lecture] our data model y = Xv + e, where e is noise. Variance of ridge regr. is Var(z⊤(X⊤X + λI′)−1X⊤e).
As λ → ∞, variance → 0, but bias increases.
1e−01 1e+01 1e+03
λ
[Plot of bias2 & variance as λ increases.]
[So, as usual for the bias-variance trade-off, the test error as a function of λ is a U-shaped curve. We find the
bottom by validation.]
λ is a hyperparameter; tune by (cross-)validation.
Ideally, features should be “normalized” to have same variance.
Alternative: use asymmetric penalty by replacing I′ w/other diagonal matrix. [For example, if you use polynomial features, you could use different penalties for monomials of different degrees.]
[In this plot of weight space, βˆ (read as “wˆ ”) is the least-
ridgeterms.pdf (ISL, Figure 6.7)
ridgebiasvar.pdf (ISL, Figure 6.5)
Mean Squared Error
0 10 20 30 40 50 60
72 Jonathan Richard Shewchuk Bayesian Justification for Ridge Reg.
Assign a prior probability on w′: w′ ∼ N(0, σ2). Apply MLE to the posterior prob.
[This prior probability says that we think weights close to zero are more likely to be correct.]
Bayes’ Theorem: posterior f (w|X, y) =
f(y|X,w)× prior f(w′) L(w) f(w′) f (y|X) = f (y|X)
ln L(w) + ln f (w′) − const
⇒ Minimize ∥Xw − y∥2 + λ ∥w′∥2
This method (using likelihood, but maximizing posterior) is called maximum a posteriori (MAP). [A prior probability on the weights is another way to understand regularizing ill-posed problems.]
FEATURE SUBSET SELECTION
[Some of you may have noticed as early as Homework 1 that you can sometimes get better performance on a spam classifier simply by dropping some useless features.]
All features increase variance, but not all features reduce bias.
Idea: Identify poorly predictive features, ignore them (weight zero).
Less overfitting, smaller test errors.
2nd motivation: Inference. Simpler models convey interpretable wisdom. Useful in all classification & regression methods.
Sometimes it’s hard: Different features can partly encode same information.
Combinatorially hard to choose best feature subset.
Alg: Best subset selection. Try all 2d − 1 nonempty subsets of features. [Train one classifier per subset.] Choose the best classifier by (cross-)validation. Slow.
[Obviously, best subset selection isn’t feasible if we have a lot of features. But it gives us an “ideal” algorithm to compare practical algorithms with. If d is large, there is no algorithm that’s guaranteed to find the best subset and that runs in acceptable time. But heuristics often work well.]
Heuristic 1: Forward stepwise selection.
Start with null model (0 features); repeatedly add best feature until validation errors start increasing (due to overfitting) instead of decreasing. At each outer iteration, inner loop tries every feature & chooses the best by validation. Requires training O(d2) models instead of O(2d).
Not perfect: e.g., won’t find the best 2-feature model if neither of those
features yields the best 1-feature model. [That’s why it’s a heuristic.]
Heuristic 2: Backward stepwise selection.
Start with all d features; repeatedly remove feature whose removal gives best reduction in validation error.
Also trains O(d2) models.
Additional heuristic: Only try to remove features with small weights. Q: small relative to what?
Maximize log posterior =
= −const ∥Xw − y∥2 − const ∥w′∥2 − const
Recall: variance of least-squ. regr. is proportional to σ2(X⊤X)−1
wi ⊤ −1
z-score of weight wi is zi = σ √vi where vi is ith diagonal entry of (X X) .
Small z-score hints “true” wi could be zero.
[Forward stepwise is a better choice when you suspect only a few features will be good predictors; e.g., spam. Backward stepwise is better when most features are important. If you’re lucky, you’ll stop early.]
Shrinkage: Ridge Regression, Subset Selection, and Lasso 73 LASSO (Robert Tibshirani, 1996)
Regression w/regularization: (1) + (A) + l1 penalized mean loss (e).
“Least absolute shrinkage and selection operator”
[This is a regularized regression method similar to ridge regression, but it has the advantage that it often naturally sets some of the weights to zero.]
Find w that minimizes ∥Xw − y∥2 + λ ∥w′∥1
Recall ridge regr.: isosurfaces of ∥w′∥2 are hyperspheres.
The isosurfaces of ∥w′∥1 are cross-polytopes.
The unit cross-polytope is the convex hull of all the positive & negative unit coordinate vectors.
d where ∥w′∥1 = |wi|
i=1
(Don’t penalize α.)
[Draw this figure by hand
[You get larger and smaller cross-polytope isosurfaces by scaling these.]
β2 ^ β2 β^ β
β1
]
[Isocontours of the terms of the objective function for the Lasso appear at left. Compare with the ridge regression isocontours at right.]
[The red ellipses are the isocontours of ∥Xw − y∥2, and the least-squares solution lies at their center. The isocontours of ∥w′∥1 are diamonds centered at the origin (blue). The solution lies where a red isocontour just touches a blue diamond. What’s interesting here is that in this example, the red isocontour touches just the tip of the diamond. So the weight w1 gets set to zero. That’s what we want to happen to weights that don’t have enough influence. This doesn’t always happen—for instance, the red isosurface could touch a side of the diamond instead of a tip of the diamond.]
[When you go to higher dimensions, you might have several weights set to zero. For example, in 3D, if the red isosurface touches a sharp vertex of the cross-polytope, two of the three weights get set to zero. If it touches a sharp edge of the cross-polytope, one weight gets set to zero. If it touches a flat side of the cross-polytope, no weight is zero.]
crosspolys.png
β1
lassoridge.pdf
74 Jonathan Richard Shewchuk
20 50 100 200 500 2000 5000
λ
[Weights as a function of λ.]
[This shows the weights for a typical linear regression problem with about 10 variables. You can see that as lambda increases, more and more of the weights become zero. Only four of the weights are really useful for prediction; they’re in color. Statisticians used to choose λ by looking at a chart like this and trying to eyeball a spot where there aren’t too many predictors and the weights aren’t changing too fast. But nowadays they prefer validation.]
Sometimes sets some weights to zero, especially for large λ.
Algs: subgradient descent, least-angle regression (LARS), forward stagewise
[Lasso can be reformulated as a quadratic program, but it’s a quadratic program with 2d constraints, because a d-dimensional cross-polytope has 2d facets. In practice, special-purpose optimization methods have been developed for Lasso. I’m not going to teach you one, but if you need one, look up the last two of these algorithms. LARS is built into the R Programming Language for statistics.]
[As with ridge regression, you should probably normalize the features first before applying Lasso.]
lassoweights.pdf (ISL, Figure 6.6)
Standardized Coefficients
−200 0 100 200 300 400
14 Decision Trees DECISION TREES
Nonlinear method for classification and regression.
Uses tree with 2 node types:
– internal nodes test feature values (usually just one) & branch accordingly – leaf nodes specify class h(x)
Decision Trees 75
Outlook (x1)
no
yes
check x3
yes
sunny
overcast
yes
rain
100 x2
75 50 25
Humidity (x2)
> 75% no
≤ 75% yes
[Draw this by hand.
> 20
Deciding whether to go out for a picnic.]
– Cuts x-space into rectangular cells
– Works well with both categorical and quantitative features – Interpretable result (inference)
– Decision boundary can be arbitrarily complicated
no
yes
≤ 20
0
x1
sunny overcast rain
−2 −1 0 1 2
X1
−2 −1 0 1 2
X1
Wind (x3)
dectree.pdf
−2 −1 0 1 2
X1
−2 −1 0 1 2
X1
[Comparison of linear classifiers (left) vs. deci-
treelinearcompare.pdf (ISL, Figure 8.7)
sion trees (right) on 2 examples.]
X2
X2
−2 −1 0 1 2
−2 −1 0 1 2
X2
X2
−2 −1 0 1 2
−2 −1 0 1 2
76 Jonathan Richard Shewchuk
Consider classification first. Greedy, top-down learning heuristic:
[This algorithm is more or less obvious, and has been rediscovered many times. It’s naturally recursive. I’ll show how it works for classification first; later I’ll talk about how it works for regression.]
Let S ⊆ {1,2,…,n} be set of sample point indices. Top-level call: S = {1,2,…,n}.
GrowTree(S )
if(yi =Cforalli∈S andsomeclassC)then{
return new leaf(C) [We say the leaves are pure] } else {
choose best splitting feature j and splitting value β (*)
Sl ={i:Xij <β} [Oryoucoulduse≤and>] Sr ={i:Xij ≥β}
return new node( j, β, GrowTree(S l ), GrowTree(S r ))
}
(*) How to choose best split?
– Try all splits. [All features, and all splits within a feature.] – ForasetS,letJ(S)bethecostofS.
– Choose the split that minimizes J(Sl) + J(Sr); or, |S |J(S )+|Sr|J(Sr)
the split that minimizes weighted average l l . |Sl|+|Sr|
[Here, I’m using the vertical brackets | · | to denote set cardinality.]
How to choose cost J(S )?
[I’m going to start by suggesting a mediocre cost function, so you can see why it’s mediocre.]
Idea 1 (bad):
Label S with the class C that labels the most points in S . J(S)←#ofpointsinS notinclassC.
20 C 10 D
20 C 10 D
10 C 9 D
J(Sl) = 9
J(Sl) = 5 ]
J(Sr) = 5
x1
J(S) = 10
J(Sr) = 1
[Draw this by hand.
x2
10 C 1 D
Problem: J(S l) + J(S r) = 10 for both splits, but left split much better. Weighted avg prefers right split!
[There are many different splits that all have the same total cost. We want a cost function that better distin- guishes between them.]
10 C 5 D
10 C 5 D
badcost.pdf
Decision Trees
77
Idea 2 (good): Measure the entropy.
Let Y be a random class variable, and suppose P(Y = C) = pC. The surprise of Y being class C is − log2 pC.
– event w/prob. 1 gives us zero surprise.
– event w/prob. 0 gives us infinite surprise!
[An idea from information theory.] [Always nonnegative.]
[In information theory, the surprise is equal to the expected number of bits of information we need to transmit which events happened to a recipient who knows the probabilities of the events. Often this means using fractional bits, which may sound crazy, but it makes sense when you’re compiling lots of events into a single message; e.g., a sequence of biased coin flips.]
The entropy of an index set S is the average surprise
H(S) = − pC log2 pC, where pC = |{i ∈ S : yi = C}|.
C |S |
If all points in S belong to same class? H(S) = −1log2 1 = 0.
Half class C, half class D? H(S ) = −0.5 log2 0.5 − 0.5 log2 0.5 = 1.
n points, all different classes? H(S ) = − log 1 = log n. 2n2
[The proportion of points in S that are in class C.]
[The entropy is the expected number of bits of information we need to transmit to identify the class of a sample point in S chosen uniformly at random. It makes sense that it takes 1 bit to specify C or D when each class is equally likely. And it makes sense that it takes log2 n bits to specify one of n classes when each class is equally likely.]
H(S) 1.0
0.8 0.6 0.4 0.2
p
0.0 0.2
[Plot of the entropy H(pC) when there are only two classes. The probability
of the second class is pD = 1 − pC, so we can plot the entropy with just one dependent variable. If you have > 2 classes, you would need a multidimensional chart to graph the entropy, but the entropy is still strictly concave.]
0.4 0.6
0.8 1.0
entropy.pdf
78 Jonathan Richard Shewchuk WeightedavgentropyaftersplitisHafter = |Sl|H(Sl)+|Sr|H(Sr).
|Sl|+|Sr| Choose split that maximizes information gain H(S ) − Hafter.
x1
H(Sl)=−10 lg10 − 9 lg 9 0.998 19 19 19 19
Hafter = 0.793
[Draw this by hand. ]
[Which is just the same as minimizing Hafter.] H(S) = −20 lg 20 − 10 lg 10 0.918
30 30 30 30
10 C 9 D
10 C 1 D
H(Sr)=−10 lg10 − 1 lg 1 0.439 11 11 11 11
info gain = 0.125 Info gain always positive except when one child is empty or
for all C, P(yi = C|i ∈ Sl) = P(yi = C|i ∈ Sr). [Recall the graph of the entropy.]
[Which is the case for the second split we considered.]
H(pC) 1
0.5
0
0
entropy: strictly concave
% misclassified: concave, not strict 50%
%(parent) = %after
infogain.pdf
H(parent) }info gain
Hafter
pC 0% [Draw this by hand on entropy.pdf.
pC
0.2 0.4 0.6 0.8 1
0
0.2 0.4 0.6 0.8 1
]
20 C 10 D
[Suppose we pick two points on the entropy curve, then draw a line segment connecting them. Because the entropy curve is strictly concave, the interior of the line segment is strictly below the curve. Any point on that segment represents a weighted average of the two entropies for suitable weights. If you unite the two sets into one parent set, the parent set’s value pC is the weighted average of the children’s pC’s. Therefore, the point directly above that point on the curve represents the parent’s entropy. The information gain is the vertical distance between them. So the information gain is positive unless the two child sets both have exactly the same pC and lie at the same point on the curve.]
[On the other hand, for the graph on the right, plotting the % misclassified, if we draw a line segment connecting two points on the curve, the segment might lie entirely on the curve. In that case, uniting the two child sets into one, or splitting the parent set into two, changes neither the total misclassified sample points nor the weighted average of the % misclassified. The bigger problem, though, is that many different splits will get the same weighted average cost; this test doesn’t distinguish the quality of different splits well.]
[By the way, the entropy is not the only function that works well. Many concave functions work fine, including the simple polynomial p(1 − p).]
concave.png
Decision Trees 79
More on choosing a split:
– For binary feature xi: children are xi = 0 & xi = 1.
– If xi has 3+ discrete values: split depends on application.
[Sometimes it makes sense to use multiway splits; sometimes binary splits.]
– If xi is quantitative: sort xi values in S ; try splitting between each pair of unequal consecutive values.
[We can radix sort the points in linear time, and if n is huge we should.]
Clever bit: As you scan sorted list from left to right, you can update entropy in O(1) time per point!6 [This is important for obtaining a fast tree-building time.] [DrawarowofC’sandX’s;showhowweupdatethe#ofC’sand#ofX’sineachofSl andSr aswe scan from left to right.]
XCXXC 0C 2C 1C 1C 1C 1C 1C 1C
1X 2X 1X 2X 2X 1X 3X 0X
Algs & running times:
– Classify test point: Walk down tree until leaf. Return its label.
Worst-case time is O(tree depth).
For binary features, that’s ≤ d. [Quantitative features may go deeper.] Usually (not always) ≤ O(log n).
– Training: For binary features, try O(d) splits at each node.
For quantitative features, try O(n′d) splits; n′ = points in node
Either way ⇒ O(n′d) time at this node
[Quantitative features are asymptotically just as fast as binary features because of our clever way of computing the entropy for each split.]
Each point participates in O(depth) nodes, costs O(d) time in each node.
Running time ≤ O(nd depth).
[As nd is the size of the design matrix X, and the depth is often logarithmic, this is a surprisingly reasonable running time.]
scan.pdf
6Let C be the number of class C sample points to the left of a potential split and c be the number to the right of the split. Let
D be the number of class not-C points to the left of the split and d be the number to the right of the split. Update C, c, D, and d
at each split (in O(1) time per split) as you move from left to right. At each potential split, calculate the entropy of the left set as
− C log C − D log D andtheentropyoftherightsetas− c log c − d log d . Note: log0isundefined,butthis C+D 2 C+D C+D 2 C+D c+d 2 c+d c+d 2 c+d
formula works if we use the convention 0 log 0 = 0.
Itfollowsthattheweightedaverageofthetwoentropiesis−1 Clog C +Dlog D +clog c +dlog d ,wheren′ =
n′ 2 C+D 2 C+D 2 c+d 2 c+d
C + D + c + d is the total number of sample points stored in this treenode. Choose the split that minimizes this weighted average.
80 Jonathan Richard Shewchuk
15 More Decision Trees, Ensemble Learning, and Random Forests
DECISION TREES (continued)
[Last lecture, I taught you the vanilla algorithms for building decision trees and using them to classify test points. There are many variations on this basic algorithm; I’ll discuss a few now.]
Multivariate Splits
Find non-axis-aligned splits with other classification algs or by generating them randomly.
[An example where an ordinary decision tree needs many splits to ap- proximate a diagonal linear decision boundary, but a single multivariate split takes care of it.]
[Here you can use other classification algorithms such as SVMs, logistic regression, and Gaussian discrim- inant analysis. Decision trees permit these algorithms to find more complicated decision boundaries by making them hierarchical.]
May gain better classifier at cost of worse interpretability or speed.
[Standard decision trees are very fast because that they check only one feature at each treenode. But if there are hundreds of features, and you have to check all of them at every level of the tree to classify a point, it slows down classification a lot. So it sometimes pays to consider methods like forward stepwise selection when you’re learning so that when you classify, you only need to check a few features at each treenode.] Can limit # of features per split: forward stepwise selection, Lasso.
multivariate.pdf
Decision Tree Regression
Creates a piecewise constant regression fn.
X2 ≤ t2
X1 ≤ t3
X2 ≤ t4
R1 R2 R3
More Decision Trees, Ensemble Learning, and Random Forests
81
X1 ≤ t1 |
R2
R1
R5
R4
t4
X2
R4
R5
X2 regresstree.pdf,regresstreefn.pdf(ISL,Figure8.3) [Decisiontreeregression.]
CostJ(S)= 1 i∈S(yi −μS)2,whereμS isthemeanlabelyi forsampleptsi∈S. |S|
[So if all the points in a node have the same y-value, then the cost is zero.]
[We choose the split that minimizes the weighted average of the costs of the children after the split.]
Stopping Early
[The basic version of the decision tree algorithm keeps subdividing treenodes until every leaf is pure. We don’t have to do that; sometimes we prefer to stop subdividing treenodes earlier.]
Why?
– Limit tree depth (for speed)
– Limit tree size (big data sets)
– Complete tree may overfit
– Given noise or overlapping distributions, purity of leaves is counterproductive; better to estimate
posterior probs
[When you have overlapping class distributions, refining the tree down to one sample point per leaf is absolutely guaranteed to overfit. It’s better to stop early, then classify each leaf node by taking a vote of its sample points. Alternatively, you can use the points to estimate a posterior probability for each leaf, and return that. If there are many points in each leaf, the posterior probabilities might be reasonably accurate.]
t2
R3
X1
t1
t3
X1
82 Jonathan Richard Shewchuk
[In the decision tree at left, each leaf has multiple classes. Instead of returning the majority class, each leaf could return a posterior probability histogram, as illustrated at right.]
How? Select stopping condition(s):
– Next split doesn’t reduce entropy/error enough (dangerous; pruning is better) – Most of node’s points (e.g., > 95%) have same class [to deal with outliers] – Node contains few sample points (e.g., < 10)
– Cell’s edges are all tiny
– Depth too great [risky if there are still many points in the cell]
– Use validation to compare
[The last is the slowest but most effective way to know when to stop: use validation to decide whether splitting the node is a win on the validation data. But if your goal is to avoid overfitting, it’s generally even more effective to grow the tree a little too large and then use validation to prune it back. We’ll talk about that next.]
Leaves with multiple points return
– a majority vote or class posterior probs (classification) or – an average (regression).
Pruning
Grow tree too large; greedily remove each split whose removal improves validation performance. More reliable than stopping early.
[We have to do validation once for each split that we’re considering removing. But you can do that pretty cheaply. What you don’t do is reclassify every sample point from scratch. Instead, you keep track of which points in the validation set end up at which leaf. When you are deciding whether to remove a split, you just look at the validation points in the two leaves you’re thinking of removing, and see how they will be reclassified and how that will change the error rate. You can do this very quickly.]
[The reason why pruning often works better than stopping early is because often a split that doesn’t seem to make much progress is followed by a split that makes a lot of progress. If you stop early, you’ll never find out. Pruning is a simple idea, but it’s highly recommended when you have enough time to build and prune the tree.]
leaf.pdf
More Decision Trees, Ensemble Learning, and Random Forests
83
R1
R3
R2
Training Cross−Validation Test
2 4 6 8 10 Tree Size
238
117.5
1
[At left, a plot of decision tree size vs. errors for baseball hitter data. At right, the best decision tree has three leaves.
Players’ salaries: R1 = $165,174, R2 = $402,834, R3 = $845,346.]
prunehitters.pdf, prunedhitters.pdf (ISL, Figures 8.5 & 8.2)
[In this example, a 10-node decision tree was constructed to predict the salaries of baseball players, based on their years in the league and average hits per season. Then the tree was pruned by validation. The best decision tree on the validation data turned out to have just three leaves.]
ENSEMBLE LEARNING
Decision trees are fast, simple, interpretable, easy to explain, invariant under scaling/translation, robust to irrelevant features.
But not the best at prediction. [Compared to previous methods we’ve seen.] High variance.
[For example, suppose we take a training data set, split it into two halves, and train two decision trees, one on each half of the data. It’s not uncommon for the two trees to turn out very different. In particular, if the two trees pick different features for the very first split at the top of the tree, then it’s quite common for the trees to be completely different. So decision trees tend to have high variance.]
[So let’s think about how to fix this. As an analogy, imagine that you are generating random numbers from some distribution. If you generate just one number, it might have high variance. But if you generate n numbers and take their average, then the variance of that average is n times smaller. So you might ask yourself, can we reduce the variance of decision trees by taking an average answer of a bunch of decision trees? Yes we can.]
1 4.5
24
Years
Mean Squared Error
0.0 0.2 0.4 0.6 0.8 1.0
Hits
84 Jonathan Richard Shewchuk
[James Surowiecki’s book “The Wisdom of Crowds” and Pene- lope the cow. Surowiecki tells us this story . . . ]
[A 1906 county fair in Plymouth, England had a contest to guess the weight of an ox. A scientist named Francis Galton was there, and he did an experiment. He calculated the median of everyone’s guesses. The median guess was 1,207 pounds, and the true weight was 1,198 pounds, so the error was less than 1%. Even the cattle experts present didn’t estimate it that accurately.]
[NPR repeated the experiment in 2015 with a cow named Penelope whose photo they published online. They got 17,000 guesses, and the average guess was 1,287 pounds. Penelope’s actual weight was 1,355 pounds, so the crowd got it to within 5 percent.]
[The main idea is that sometimes the average opinion of a bunch of idiots is better than the opinion of one expert. And so it is with learning algorithms. We call a learning algorithm a weak learner if it does better than guessing randomly. And we combine a bunch of weak learners to get a strong one.]
[Incidentally, James Surowiecki, the author of the book, guessed 725 pounds for Penelope. So he was off by 87%. He’s like a bad decision tree who wrote a book about how to benefit from bad decision trees.]
We can take average of output of
– different learning algs
– same learning alg on many training sets [if we have tons of data]
– bagging: same learning alg on many random subsamples of one training set – random forests: randomized decision trees on random subsamples
[These last two are the most common ways to use averaging, because usually we don’t have enough training data to use fresh data for every learner.]
[Averaging is not specific to decision trees; it can work with many different learning algorithms. But it works particularly well with decision trees.]
Regression algs: take median or mean output
Classification algs: take majority vote OR average posterior probs
[Apology to readers: I show some videos in this lecture, which cannot be included in this report.]
[Show averageaxis.mov] [Here’s a simple classifier that takes an average of “stumps,” trees of depth 1. Observe how good the posterior probabilities look.]
[Show averageaxistree.mov] [Here’s a 4-class classifier with depth-2 trees.]
wisdom.jpg, penelope.jpg
More Decision Trees, Ensemble Learning, and Random Forests 85
[The Netflix Prize was an open competition for the best collaborative filtering algorithm to predict user ratings for films, based on previous ratings. It ran for three years and ended in 2009. The winners used an extreme ensemble method that took an average of many different learning algorithms. In fact, a couple of top teams combined into one team so they could combine their methods. They said, “Let’s average our models and split the money,” and that’s what happened.]
Use learners with low bias (e.g., deep decision trees).
High variance & some overfitting are okay. Averaging reduces the variance! [Each learner may overfit, but each overfits in its own unique way.] Averaging sometimes reduces the bias & increases flexibility;
e.g., creating nonlinear decision boundary from linear classifiers.
Hyperparameter settings usually different than 1 learner.
[Because averaging learners reduces their variance. But averaging rarely reduces bias as much as it reduces variance, so you want to get the bias nice and small before you average.]
# of trees is another hyperparameter.
Bagging = Bootstrap AGGregatING (Leo Breiman, 1994)
[Leo Breiman was a statistics professor right here at Berkeley. He did his best work after he retired in 1993. The bagging algorithm was published the following year, and then he went on to co-invent random forests as well. Unfortunately, he died in 2005.]
[Leo Breiman]
[Bagging is a randomized method for creating many different learners from the same data set. It works well with many different learning algorithms. One exception seems to be k-nearest neighbors; bagging mildly degrades it.]
Given n-point training sample, generate random subsample of size n′ by sampling with replacement. Some points chosen multiple times; some not chosen.
breiman.gif
86 Jonathan Richard Shewchuk
134689
↙↘
636119 884918
If n′ = n, ∼ 63.2% are chosen. [On average; this fraction varies randomly.]
Build learner. Points chosen j times have greater weight:
[If a point is chosen j times, we want to treat it the same way we would treat j different points all bunched up infinitesimally close together.]
– Decision trees: j-time point has j × weight in entropy.
– SVMs: j-time point incurs j × penalty to violate margin. – Regression: j-time point incurs j × loss.
Repeat until T learners. Metalearner takes test point, feeds it into all T learners, returns average/majority output.
Random Forests
Random sampling isn’t random enough!
[With bagging, often the decision trees look very similar. Why is that?]
One really strong predictor → same feature split at top of every tree.
[For example, if you’re building decision trees to identify spam, the first split might always be “viagra.” Random sampling might not change that. If the trees are very similar, then taking their average doesn’t reduce the variance much.]
Idea:
At each treenode, take random sample of m features (out of d). Choose best split from m features.
[We’re not allowed to split on the other d − m features!] Different random sample for each treenode.
√
m ≈ d works well for classification; m ≈ d/3 for regression.
[So if you have a 100-dimensional feature space, you randomly choose 10 features and pick the one of those 10 that gives the best split. But m is a hyperparameter, and you might get better results by tuning it for your particular application. These values of m are good starting guesses.]
Smaller m → more randomness, less tree correlation, more bias
[One reason this works is if there’s a really strong predictor, only a fraction of the trees can choose that pre- dictor as the first split. That fraction is m/d. So the split tends to “decorrelate” the trees. And that means that when you take the average of the trees, you’ll have less variance.]
[You have to be careful, though, because you don’t want to dumb down the trees too much in your quest for decorrelation. Averaging works best when you have very strong learners that are also diverse. But it’s hard to create a lot of learners that are very different yet all very smart. The Netflix Prize winners did it, but it was a huge amount of work.]
Sometimes test error reduction up to 100s or even 1,000s of decision trees! Disadvantage: loses interpretability/inference.
[But the compensation is it’s more accurate than a single decision tree.]
Variation: generate m random multivariate splits (oblique lines, quadrics); choose best split.
[You have to be a bit clever about how you generate random decision boundaries; I’m not going to discuss that today. I’ll just show lots of examples.]
[Show treesidesdeep.mov] [Lots of good-enough conic random decision trees.] [Show averageline.mov]
More Decision Trees, Ensemble Learning, and Random Forests 87
[Show averageconic.mov]
[Show square.mov] [Depth 2; look how good the posterior probabilities look.] [Show squaresmall.mov] [Depth 2; see the uncertainty away from the center.] [Show spiral2.mov] [Doesn’t look like a decision tree at all, does it?]
[Show overlapdepth14.mov] [Overlapping classes. This example overfits!] [Show overlapdepth5.mov] [Better fit.]
[Random forest classifiers for 4-class spiral data. Each forest takes the average of 400 trees. The top row uses trees of depth 4. The bottom row uses trees of depth 12. From left to right, we have axis-aligned splits, splits with lines with arbitrary rotations, and splits with conic sections. Each split is chosen to be the best of 500 random choices.]
[Random forest classifiers for the same data. Each forest takes the average of 400 trees. In these examples, all the splits are axis-aligned. The top row uses trees of depth 4. The bottom row uses trees of depth 12. From left to right, we choose each split from 1, 5, or 50 random choices. The more choices, the less bias and the better the classifier.]
500.pdf
randomness.pdf
88 Jonathan Richard Shewchuk 16 The Kernel Trick
KERNELS
Recall: with d input features, degree-p polynomials blow up to O(dp) features.
[When d is large, this gets computationally intractable really fast.
As I said in Lecture 4, if you have 100 features per feature vector and you want to use degree-4 decision functions, then each lifted feature vector has a length of roughly 4 million.]
Today, magically, we use those features without computing them!
Observation: In many learning algs,
– the weights can be written as a linear combo of sample points, &
– we can use inner products of Φ(x)’s only ⇒ don’t need to compute Φ(x)!
n
Suppose w = X⊤a = aiXi for some a ∈ Rn.
i=1
Substitute this identity into alg. and optimize n dual weights a (aka dual parameters) instead of d + 1 (or d p )
primal weights w.
Kernel Ridge Regression
CenterXandysotheirmeansarezero: Xi ←Xi −μX, yi ←yi −μy This lets us replace I′ with I in normal equations:
(X⊤X + λI)w = X⊤y
[To dualize ridge regression, we need the weights to be a linear combination of the sample points. Unfortu- nately, that only happens if we penalize the bias term wd+1 = α, as these normal equations do. Fortunately, when we center X and y, the “expected” value of the bias is zero. The actual bias won’t usually be exactly zero, but it will often be close enough that we won’t do much harm by penalizing the bias.]
Suppose a is a solution to (XX⊤ + λI)a = y.
Then X⊤y = X⊤XX⊤a + λX⊤a = (X⊤X + λI)X⊤a
Therefore, w = X⊤a is a solution to the normal equations, and w is a linear combo of sample points!
a is a dual solution; solves the dual form of ridge regression:
[We obtain this dual form by substituting w = X⊤a into the original ridge regression cost function.]
Training: Solve (XX⊤ + λI)a = y for a. Testing: Regression fn is
n
h(z) = w⊤z = a⊤Xz = ai (Xi⊤z) ⇐ weighted sum of inner products
i=1
Find a that minimizes ∥XX⊤a − y∥2 + λ∥X⊤a∥2
Let k(x, z) = x⊤z be kernel fn.
[Later, we’ll replace x and z with Φ(x) and Φ(z), and that’s where the magic will happen.]
LetK=XX⊤ ben×nkernelmatrix.NoteKij =k(Xi,Xj).
K is singular if n > d + 1. [And sometimes even if it’s not.]
In that case, probably no solution if λ = 0. Dual ridge reg. alg:
∀i,j, Kij ←k(Xi,Xj) Solve(K+λI)a=y fora for each test pt z
h(z) ← n a k(X ,z) i=1i i
Does not use Xi directly! Only k.
The Kernel Trick 89
[Then we must choose a positive λ. But that’s okay.] ⇐O(n2d)time
⇐O(n3)time
⇐ O(nd) time
[This will become important soon.]
Dual: solve n × n linear system, O(n3 + n2d) time
Primal: ” d×d ” ” ,O(d3 +d2n)time
We prefer dual when d > n. [Moreover, if we add polynomial terms as new features, the d in the primal running time increases, but we will see that the d in the kernelized dual running time does not increase.]
[Important: dual ridge regression produces the same predictions as primal ridge regression!]
The Kernel Trick (aka Kernelization)
[Here’s the magic part. We will see that we can compute a polynomial kernel that involves many monomial terms without actually computing those terms.]
The polynomial kernel of degree p is k(x, z) = (x⊤z + 1)p
Theorem: (x⊤z + 1)p = Φ(x)⊤Φ(z) where Φ(x) contains every monomial in x of degree 0 . . . p. Example for d = 2, p = 2:
(x⊤z + 1)2 =
= [x1 x2 2x1x2 2×1 2×2 1] [z1
= Φ(x)⊤Φ(z)
x12z21 + x2z2 + 2x1z1 x2z2 + 2x1z1 + 2x2z2 + 1 22√√√22√√√⊤
[Notice the factors of √2. If you try a higher polynomial degree p, you’ll see a wider variety of these constants. We have no control of the constants that appear in Φ(x), but they don’t matter excessively much, because the primal weights w will scale themselves to compensate. Even though we won’t be directly computing the primal weights . . . they still implicitly exist.]
Key win: compute Φ(x)⊤Φ(z) in O(d) time instead of O(dp), even though Φ(x) has length O(dp). Kernel ridge regression replaces Xi with Φ(Xi):
Let k(x, z) = Φ(x)⊤Φ(z) But don’t compute Φ(x) or Φ(z); compute k(x, z) = (x⊤z + 1)p !
[I think what we’ve done here is pretty mind-blowing: we can now do polynomial regression with an expo- nentially long, high-order polynomial in less time than it would take even to write out the final polynomial. The running time is sublinear, actually much smaller than linear, in the length of the Φ vectors.]
z2 2z1z2 2z1 2z2 1] [This is how we’re defining Φ.]
90
Jonathan Richard Shewchuk
Kernel Perceptrons
Featurized perceptron alg:
w ← y1 Φ(X1)
while some yi Φ(Xi) · w < 0
w ← w + ε yi Φ(Xi) for each test pt z
h(z) ← w · Φ(z)
Let Φ(X) be n × D matrix with rows Φ(Xi)⊤,
Dualizewithw=Φ(X)⊤a.Thenthecode“ai ←ai+εyi”hassameeffectas“w←w+εyiΦ(Xi)”
[So the dual weight ai records what multiple of sample point i the perceptron algorithm has added to w.] Φ(Xi) · w = (Φ(X)w)i = (Φ(X) Φ(X)⊤a)i = (Ka)i
Dual perceptron alg:
a ← [y1 0 . . . 0]⊤ ∀i,j, Kij ←k(Xi,Xj) while some yi (Ka)i < 0
ai ←ai +εyi for each test pt z
h(z) ← n a k(X , z) j=1j j
[starting point is arbitrary, but can’t be zero] ⇐O(n2d)time(kerneltrick)
⇐O(1)time;updateKainO(n)time
⇐ O(nd) time [kernel trick]
[starting point is arbitrary, but can’t be zero]
D = length of Φ(·), K = Φ(X) Φ(X)⊤
[A big deal is that the running times depend on the original dimension d, not on the length D of Φ(·)!] OR we can compute w = Φ(X)⊤a once in O(nD) time & evaluate test pts in O(D) time/pt
[. . . which is a win if the numbers of training points and test points both exceed D/d.]
Kernel Logistic Regression
[The stochastic gradient descent step for logistic regression is just a small modification of the step for perceptrons. But recall that we’re no longer looking for misclassified sample points. Instead, we apply the gradient descent rule to sample points in a stochastic, random order—or, alternatively, to all the points at once. Also recall that our starting point is zero.]
Stochastic gradient descent step:
ai ← ai + ε (yi − s((Ka)i)) [where s is the logistic function]
[Just like with perceptrons, every time you update one dual weight ai, you can update Ka in O(n) time so you don’t have to compute it from scratch on the next iteration. If you prefer batch gradient descent . . . ]
Batch gradient descent step: a ← a + ε (y − s(Ka))
For each test pt z: n
[If you’re using logistic regression as a classifier and you don’t care about the posterior probabilities, you can skip the logistic function and just compute the summation, like in the perceptron algorithm.]
j j
h(z)←s a k(X,z) j = 1
⇐ applying s component-wise to vector Ka
The Kernel Trick 91
The Gaussian Kernel
[Mind-blowing as the polynomial kernel is, I think our next trick is even more mind-blowing. Since we can now do fast computations in spaces with exponentially large dimensions, why don’t we go all the way and generate feature vectors in infinite-dimensional space?]
Gaussian kernel, aka radial basis fn kernel: there exists a Φ(x) such that ∥x−z∥2
k(x, z) = exp − 2σ2 [This kernel takes O(d) time to compute.]
[In case you’re curious, here’s the feature vector that gives you this kernel, for the case where you have only
one input feature per sample point.] e.g., for d = 1,
x2xx2 x3 ⊤ Φ(x) = exp −2σ2 1, σ√1!, σ2 √2!, σ3 √3!,...
[Thisisaninfinitevector,andΦ(x)·Φ(z)isaseriesthatconvergestok(x,z). Nobodyactuallyusesthisvalue of Φ(x) directly, or even cares about it; they just use the kernel function k(·, ·).]
[At this point, it’s best not to think of points in a high-dimensional space. It’s no longer a useful intuition. Instead, think of the kernel k as a measure of how similar or close together two points are to each other.]
Key observation: hypothesis h(z) = n a k(X , z) is a linear combo of Gaussians centered at sample pts. j=1j j
[The dual weights are the coefficients of the linear combination.] [The Gaussians are a basis for the hypothesis.]
[A hypothesis h that is a linear combination of Gaussians centered at four sample points, two with positive weights and two with negative weights. If you use ridge regression with a Gaussian kernel, your “linear” regression will look something like this.]
gausskernel.pdf
92 Jonathan Richard Shewchuk
Very popular in practice! Why?
– Gives very smooth h [In fact, h is infinitely differentiable; it’s C∞-continuous.]
– Behaves somewhat like k-nearest neighbors, but smoother
– Oscillates less than polynomials (depending on σ)
– k(x, z) interpreted as a similarity measure. Maximum when z = x; goes to 0 as distance increases. – Sample points “vote” for value at z, but closer points get weightier vote.
[The “standard” kernel x · z assigns more weight to sample point vectors that point in roughly the same direction as z. By contrast, the Gaussian kernel assigns more weight to sample points near z.]
Choose σ by (cross-)validation. σ trades off bias vs. variance:
larger σ → wider Gaussians & smoother h → more bias & less variance
..................................................................... ..................................................................... . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
•
......................
. . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
•
...............................................
. . . . .o. . . . . . . . . . . . . . . . .
•
...... ..................................................................... ..................................................................... . . . . . . . . . . . . . . . . . . . . . . . .o. . .o. . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . o. . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
•
oo
.o. . . . . o. . . . . . . . .
. . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
•
•
............................
.........................................
. . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . ..................................................................... . . . . . . . . . . . . . . . . . . . . . . . .o. . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . o. . . . . . . . . . o. . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
............................................. . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
. . . . . . . . . . . . . . . . . . . . . . . o. . o. . o. . . . . . . . o. . . . . . o. . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . o. . . . . . . . . . o. . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . .o. . . . . . o. . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
...................... ...................... ...................... ..T.r.a.in..in.g..E..rr.o.r.:.0..1.6.0.. ...................... ...................... ...................... ...................... . .T.e.s.t. E. .r.ro. .r:. . . . .0..2.1.8. . ...................... ...................... ...................... ......................
•
•
.
. . . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
oo .....................................................................
. . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
•
•
. . . . .o. . . . . . . . . . . . . . . . . . .o. .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . . o. . . . . . . . . . . . .o. . . . . . . . . . . . . .
oo•
. . . . . . . . . . . . . . . . . . . . . o. . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
.............
. . . . . . . . . . . . . . o. . . . . . . . . . . . . . .o. . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. .
. .
. .
. .
. .
. .
o. .
. .
•
. . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . .o. . . . . . . . . . .
•
.........
•
.......
ooo
. . . . . . . . . . . . . . . . .o. . . . . . . . .
. . . . . . . . . . . . . . . . . . o. . . . . . . . . o. .o. . . . . . . . . . . o. . . . . . . o. . . . o. . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o o. o. . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
•
. . . . . . . . o. . . . . . . . . . . . . .o. . .
. . . .o. . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.............................
oo
. . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . .
. . . . .o. . . . . . . . . . . . . . . . . . . . .o. . . . . . . . •
. . o o. . . . . . . . . . . . . . o. . o. . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . o .
oo
. . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . .
....................
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . o. . . o. . . . . . . . . . . . . o. . . . . . . .
. . . . . . . . . . . . . . . . . . .o. . . .o. . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . .
. . . . . . . . o. . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o.o. . . . .o. . . . . . . . . .o. . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . o. . . . . .o. . . . . . . . . . .o. . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . o. . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . .
•
..................
. . . . . . . . . . . . . o. . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
.........................
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . o. . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .o. .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..................................................................... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . o o. . . . o. . . . . . . . . . . . . . . . . . . . . o o. . . . . . . . . . . . o. . . . . . . . o. . o. . . . o. . . . . . .o. . . . o. . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .o.
. . . . . o. . . . . . . . . . . o. o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . o o. . . . o. . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . .o. . . . . o. . . . o. . . . . . . . . . . . . . . .o. .o. .o. . . . . . . . . . . . . . . . . .
.................. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . o. . . o. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . ...........................................
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . o. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . o. . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
................... ..................................................................... .....................................................................
•
..................................................
•
. . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . o.o. . . . . o. . . .o. . . . . . . . . . . . . . . . . .
•
.............. •
........................
o o. . . . . . o. . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . o. . . . . . .
•
. . o. . . . . . . . . •
•
•
. . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . .
. . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
...........................
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. o. o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..................................................................... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . .
o .....................................................................
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . o. . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . ..................................................................... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
................................. •
. . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . ............................................... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . ............................................... ............................................... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . .
•
•
.......
Bayes Error: 0.210
...............
................................
•
•
. . . . . . . . . . . . . . .o. . . . . . . . . . . . . . . .
................
............................................... ............................................... ............................................... ............................................... ...............................................
•
o
gausskernelsvm.pdf (ESL, Figure 12.3)
[The decision boundary (solid black) of a soft- margin SVM with a Gaussian kernel. Observe that in this example, it comes reasonably close to the Bayes optimal decision boundary (dashed purple). The dashed black curves are the boundaries of the margin. The small black disks are the support vectors that lie on the
margin boundary.]
[By the way, there are many other kernels that, like the Gaussian kernel, are defined directly as kernel functions without worrying about Φ. But not every function can be a kernel function. A function is qualified only if it always generates a positive semidefinite kernel matrix, for every sample. There is an elaborate theory about how to construct valid kernel functions. However, you probably won’t need it. The polynomial and Gaussian kernels are the two most popular by far.]
0
1
Neural Networks 93
17 Neural Networks NEURAL NETWORKS
Can do both classification & regression.
[They tie together several ideas from the course: perceptrons, logistic regression, ensembles of learners, and stochastic gradient descent. They also tie in the idea of lifting sample points to a higher-dimensional feature space, but with a new twist: neural nets can learn features themselves.]
[I want to begin by reminding you of the story I told you at the beginning of the semester, about Frank Rosenblatt’s invention of perceptrons in 1957. Remember that he held a press conference where he predicted that perceptrons would be “the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence.”]
[Perceptron research continued until something monumental happened in 1969. Marvin Minsky, one of the founding fathers of AI, and Seymour Papert published a book called “Perceptrons.” Sounds good, right? Well, part of the book was devoted to things perceptrons can’t do. And one of those things is XOR.]
x1
XOR 0 1 001 110
[Think of the four outputs here as sample points in two-dimensional space. Two of them are in class 1, and two of them are in class 0. We want to find a linear classifier that separates the 1’s from the 0’s. Can we do it? No.]
[So Minsky and Papert were basically saying, “Frank. You’re telling us this machine is going to be conscious of its own existence but it can’t do XOR?”] []
[The book had a devastating effect on the field. After its publication, almost no research was done on neural net-like ideas for a decade, a time we now call the “AI Winter.” Shortly after the book was published, Frank Rosenblatt died.] []
[One thing I don’t understand is why the book was so fatal when there are several almost obvious ways to get around the XOR problem. Here’s the easiest.]
If you add one new quadratic feature, x1x2, XOR is linearly separable in 3D.
[Draw this by hand. ]
[Now we can find a plane that cuts through the cube obliquely and separates the 0’s from the 1’s.]
x2
1
0
xorcube.pdf
94 Jonathan Richard Shewchuk
[However, there’s an even more powerful way to do XOR. The idea is to design linear classifiers whose output is the input to other linear classifiers. That way, you should be able to emulate arbitrarily logical circuits. Suppose I put together some linear decision functions like this.]
x
y
A linear combo of a linear combo is a linear combo . . . only works for linearly separable points.
[We need one more idea to make neural nets. We need to add some sort of nonlinearity between the linear combinations. Let’s call these boxes that compute linear combinations “neurons.” If a neuron runs the linear combination it computes through some nonlinear function before sending it on to other neurons, then the neurons can act somewhat like logic gates. The nonlinearity could be as simple as clamping the output so it can’t go below zero. And that’s what people usually use in practice these days.]
[The traditional choice has been to use the logistic function. The logistic function can’t go below zero or above one, which is nice because it can’t ever get huge and oversaturate the other neurons it’s sending information to. The logistic function is also smooth, which means it has well-defined gradients and Hessians we can use in gradient descent.]
[With logistic functions between the linear combinations, here’s a two-level perceptron that computes the XOR function.]
linear combo
z
linear combo
linear combo
[Draw this by hand. ] [If I interpret the output as 1 if z is positive or 0 if z is negative, can I do XOR with this?]
x
y
s(30 − 20x − 20y) NAND
w
v
s(20v + 20w − 30)
AND
x⊕y
s(20x + 20y − 10) OR
[Draw this by hand. ] [Note that the logistic function at the output is optional; we could just take the sign of the output instead.]
lincombo.pdf
xorgates.pdf
Network with 1 Hidden Layer
Layer1weights: m×(d+1)matrixV Layer2weights: k×(m+1)matrixW
Vi⊤isrowi Wi⊤isrowi
Recall [logistic function] s(γ) =
For vector v, s(v) =
1 . Other nonlinear fns can be used. 1+e−γ
s(v )
2 ,s(v)= 1.
[We apply s to a vector component-wise.]
. .
1 . .
Neural Networks 95
Inputlayer: Hidden units: Output layer:
x1,...,xd ;xd+1 =1 h1,...,hm ; hm+1 = 1 z1, . . . , zk
s(v1)
s(v2)
s(v1)
h = s(Vx) 1
z = s(Wh) = s(Ws1(Vx))
i ijj
[Draw this by hand.
]
...thatis,h=s Vx j = 1
[We might have more than one output so that we can build multiple classifiers that share hidden units. One of the interesting advantages of neural nets is that if you train multiple classifiers simultaneously, sometimes some of them come out better because they can take advantage of particularly useful hidden units that first emerged to support one of the other classifiers.]
[We can add more hidden layers, and for image recognition tasks it’s common to have 12 to 40 hidden layers. There are many variations you can experiment with—for instance, you can have connections that go forward more than one layer.]
1hiddenlayer.pdf
d+1
96 Jonathan Richard Shewchuk Training
Usually stochastic or batch gradient descent.
Pick loss fn L(z, y) e.g., L(z, y) = ∥z − y∥2 ↑↑
predictions truelabels (couldbevectors) CostfnisJ(h)= 1 n L(h(X),Y)
n i=1
[I’m using a capital Y here because now Y is a matrix with one row for each sample point and one column
ii
for each output of the neural net. Sometimes there is just one output, but many neural net applications have more.]
[Now we want to find the weights matrices V and W that minimize J.]
Usually there are many local minima!
[The cost function for a neural net is, generally, not even close to convex. For that reason, it’s possible to wind up in a bad minimum. We’ll talk later about some clever ways to coax neural nets into better minima.]
[Now let me ask you this. Suppose we start by setting all the weights to zero, and then we do gradient descent on the weights. What will go wrong?]
[This neural network has a symmetry: there’s really no difference between one hidden unit and any other hidden unit. The gradient descent algorithm has no way to break the symmetry between hidden units. You can get stuck in a situation where all the weights out of an input unit have the same value, and all the weights into an output unit have the same value, and they have no way to become different from each other. To avoid this problem, and in the hopes of finding a better local minimum, we start with random weights.]
Let w be a vector containing all the weights in V & W. Batch gradient descent:
w ← vector of random weights repeat
w ← w − ε ∇J(w)
[We’ve just rewritten all the weights as a vector for notational convenience. When you actually write the
code, for the sake of speed, you should probably operate directly on the weight matrices V and W.]
[It’s important to make sure the random weights aren’t too big, because if a unit’s output gets too close to zero or one, it can get “stuck,” meaning that a modest change in the input values causes barely any change in the output value. Stuck units tend to stay stuck because in that operating range, the gradient s′(·) of the logistic function is close to zero.]
[Instead of batch gradient descent, we can use stochastic gradient descent, which means we use the gradient of one sample point’s loss function at each step. Typically, we shuffle the points in a random order, or just pick one randomly at each step.]
[The hard part of this algorithm is computing the gradient. If you simply derive one derivative for each weight, you’ll find that for a network with multiple layers of hidden units, it takes time linear in the number of edges in the neural network to compute a derivative for one weight. Multiply that by the number of weights. We’re going to spend the rest of this lecture learning to improve the running time to linear in the number of edges.]
Naive gradient computation: O(edges2) time Backpropagation: O(edges) time
Neural Networks 97 Computing Gradients for Arithmetic Expressions
[Let’s see what it takes to compute the gradient of an arithmetic expression. It turns into repeated applica- tions of the chain rule from calculus.]
[Draw this by hand. Draw the black diagram first. Then the goal (upper right). Then the green and red expressions, from left to right, leaving out the green arrows. Then the green arrows, starting at the right side of the page and moving left. Lastly, write the text at the bottom. (Use the same procedure for the next two figures.)]
gradients.pdf
98 Jonathan Richard Shewchuk
[What if a unit’s output goes to more than one unit? Then we need to understand a more complicated version
of the chain rule. Let’s try it with an expression that’s similar to what you’ll encounter in a neural net.]
[Draw this by hand. ] [Here we’re using a standard rule of multivariate calculus:]
∂ L(z1(τ), z2(τ)) = ∂L ∂z1 + ∂L ∂z2 = ∇zL · ∂ z ∂τ ∂z1 ∂τ ∂z2 ∂τ ∂τ
[Observe that we’re doing dynamic programming here. We’re computing the solutions of subproblems, then using each solution to compute the solutions of several bigger problems.]
gradientspartial.pdf
Neural Networks 99
The Backpropagation Alg.
[Backpropagation is a dynamic programming algorithm for computing the gradients we need to do neural net stochastic gradient descent in time linear in the number of weights.]
Vi⊤ is row i of weight matrix V [and likewise for rows of W] Recall s′(γ) = s(γ) (1 − s(γ))
hi =s(Vi·x),so ∇Vi hi =s′(Vi·x)x=hi(1−hi)x
zj =s(Wj·h),so ∇Wj zj =s′(Wj·h)h=zj(1−zj)h ∇hzj =zj(1−zj)Wj
[Here is the arithmetic expression for the same neural network I drew for you three illustrations ago. It looks very different when you depict it like this, but don’t be fooled; it’s exactly the same network I started with. But now we treat the weights V and W as the inputs, rather than the point x.]
[Draw this by hand. ]
gradbackprop2.pdf
100 Jonathan Richard Shewchuk 18 Neurobiology; Variations on Neural Networks
NEUROBIOLOGY
[The field of artificial intelligence started with some wrong premises. The early AI researchers attacked problems like chess and theorem proving, because they thought those exemplified the essence of intelligence. They didn’t pay much attention at first to problems like vision and speech understanding. Any four-year-old can do those things, and so researchers underestimated their difficulty.]
[Today, we know better. Computers can effortlessly beat four-year-olds at chess, but they still can’t play with toys well. We’ve come to realize that rule-based symbol manipulation is not the primary defining mark of intelligence. Even rats do computations that we’re hard pressed to match with our computers. We’ve also come to realize that these are different classes of problems that require very different styles of computation. Brains and computers have very different strengths and weaknesses, which reflect their different computing styles.]
[Neural networks are partly inspired by the workings of actual brains. Let’s take a look at a few things we know about biological neurons, and contrast them with both neural nets and traditional computation.]
– CPUs: largely sequential, nanosecond gates, fragile if gate fails superior for arithmetic, logical rules, perfect key-based memory
– Brains: very parallel, millisecond neurons, fault-tolerant
[Neurons are continually dying. You’ve probably lost a few since this lecture started. But you probably didn’t notice. And that’s interesting, because it points out that our memories are stored in our brains in a diffuse representation. There is no one neuron whose death will make you forget that 2 + 2 = 4. Artificial neural nets often share that resilience. Brains and neural nets seem to superpose memories on top of each other, all stored together in the same weights, sort of like a hologram.]
[In the 1920’s, the psychologist Karl Lashley conducted experiments to identify where in the brain memories are stored. He trained rats to run a maze, and then made lesions in different parts of the cerebral cortex, trying to erase the memory trace. Lashley failed; his rats could still find their way through the maze, no matter where he put lesions. He concluded that memories are not stored in any one area of the brain, but are distributed throughout it. Neural networks, properly trained, can duplicate this property.]
superior for vision, speech, associative memory
[By “associative memory,” I mean noticing connections between things. One thing our brains are very good at is retrieving a pattern if we specify only a portion of the pattern.]
[It’s impressive that even though a neuron needs a few milliseconds to transmit information to the next neurons downstream, we can perform very complex tasks like interpreting a visual scene in a tenth of a second. This is possible because neurons run in parallel, but also because of their computation style.]
[Neural nets try to emulate the parallel, associative thinking style of brains, and they are among the best techniques we have for many fuzzy problems, including some problems in vision and speech. Not co- incidentally, neural nets are also inferior at many traditional computer tasks such as multiplying 10-digit numbers or compiling source code.]
Neurobiology; Variations on Neural Networks 101
neurons.pdf
– Neuron: A cell in brain/nervous system for thinking/communication
– Action potential or spike: An electrochemical impulse fired by a neuron to communicate w/other
neurons
– Axon: The limb(s) along which the action potential propagates; “output”
[Most axons branch out eventually, sometimes profusely near their ends.]
[It turns out that giant squids have a very large axon they use for fast water jet propulsion. The mathematics of action potentials was first characterized in these giant squid axons, and that work won a Nobel Prize in Physiology in 1963.]
– Dendrite: Smaller limbs by which neuron receives info; “input”
– Synapse: Connection from one neuron’s axon to another’s dendrite
[Some synapses connect axons to muscles or glands.]
– Neurotransmitter: Chemical released by axon terminal to stimulate dendrite
[When an action potential reaches an axon terminal, it causes tiny containers of neurotransmitter, called vesicles, to empty their contents into the space where the axon terminal meets another neuron’s dendrite. That space is called the synaptic cleft. The neurotransmitters bind to receptors on the dendrite and influence the next neuron’s body voltage. This sounds incredibly slow, but it all happens in 1 to 5 milliseconds.]
You have about 1011 neurons, each with about 104 synapses. []
102 Jonathan Richard Shewchuk
Analogies: [between artificial neural networks and brains]
– Output of unit ↔ firing rate of neuron
[An action potential is “all or nothing”—all action potentials have the same shape and size. The output of a neuron is not signified by voltage like the output of a transistor. The output of a neuron is the frequency at which it fires. Some neurons can fire at nearly 1,000 times a second, which you might think of as a strong “1” output. Conversely, some types of neurons can go for minutes without firing. But some types of neurons never stop firing, and for those you might interpret a firing rate of 10 times per second as a “0”.]
– Weight of connection ↔ synapse strength
– Positive weight ↔ excitatory neurotransmitter (e.g., glutamine)
– Negative weight ↔ inhibitory neurotransmitter (e.g., GABA, glycine) [Gamma aminobutyric acid.]
[A typical neuron is either excitatory at all its axon terminals, or inhibitory at all its terminals. It can’t
switch from one to the other. Artificial neural nets have an advantage here.]
– Linear combo of inputs ↔ summation
[A neuron fires when the sum of its inputs, integrated over time, reaches a high enough voltage. However, the neuron body voltage also decays slowly with time, so if the action potentials are coming in slowly enough, the neuron might not fire at all.]
– Logistic/sigmoid fn ↔ firing rate saturation
[A neuron can’t fire more than 1,000 times a second, nor less than zero times a second. This limits its ability to overpower downstream neurons. We accomplish the same thing with the sigmoid function.]
– Weight change/learning ↔ synaptic plasticity
[Donald] Hebb’s rule (1949): “Cells that fire together, wire together.”
[This doesn’t mean that the cells have to fire at exactly the same time. But if one cell’s firing tends to make another cell fire more often, their excitatory synaptic connection tends to grow stronger. There’s a reverse rule for inhibitory connections. And there are ways for neurons that aren’t even connected to grow connections.]
[There are simple computer learning algorithms based on Hebb’s rule. They can work, but they’re generally not nearly as fast or effective as backpropagation.]
[Backpropagation is one part of artificial neural networks for which there is no analogy in the brain. Brains probably do not do backpropagation.]
Neurobiology; Variations on Neural Networks 103
[The brain is very modular.]
[(The following items are all spoken, not written . . . )
• The part of our brain we think of as most characteristically human is the cerebral cortex, the seat of self-awareness, language, and abstract thinking.
But the brain has a lot of other parts that take the load off the cortex.
• Our brain stem regulates functions like heartbeat, breathing, and sleep.
• Our cerebellum governs fine coordination of motor skills. When we talk about “muscle memory,” much of that is in the cerebellum, and it saves us from having to consciously think about how to walk or talk or brush our teeth, so the cortex can focus on where to walk and what to say. [ ]
• Our limbic system is the center of emotion and motivation, and as such, it makes a lot of the big decisions. I sometimes think that 90% of the job of our cerebral cortex is to rationalize decisions that have already been made by the limbic system. [ ]
• Our visual cortex (in the occipital lobe) performs a lot of processing on the input from your eyes to change it into a more useful form. Neuroscientists and computer scientists are particularly interested in the visual cortex for several reasons. Vision is an important problem for computers. The visual cortex is one of the easier parts of the brain to study in lab animals. The visual cortex is largely a feedforward network with few neurons going backward, so it’s easier for us to train computers to behave like the visual cortex.]
brain.png
104 Jonathan Richard Shewchuk
[Although the brain has lots of specialized modules, one thing that’s interesting about the frontal lobe is that it seems to be made of general-purpose neural tissue that looks more or less the same everywhere, at least before it’s trained. If you experience damage to part of the frontal lobe early enough in life, while your brain is still growing, the functions will just relocate to a different part of the frontal lobe, and you’ll probably never notice the difference.]
[As computer scientists, our primary motivation for studying neurology is to try to get clues about how we can get computers to do tasks that humans are good at. But neurologists and psychologists have also been part of the study of neural nets from the very beginning. Their motivations are scientific: they’re curious how humans think, and how we can do the things we do.]
NEURAL NET VARIATIONS
[I want to show you a few basic variations on the standard neural network I showed you last class, and how some of these variations change backpropagation.]
Regression: usually linear output unit(s)—omit sigmoid fn.
[If you make that change, the gradient changes too, and you have to change the derivation of backprop. The derivation gets simpler, so I’ll leave it as an exercise.]
Classification: to choose from k ≥ 3 classes, use softmax fn. [Here we deploy k separate output units.] Let t = Wh be k-vector of linear combos in final layer.
etj Softmaxoutputiszj(t)=k .
i=1 eti
∂zj ∂zj
∂t =zj·(1−zj), ∂t =−zjzi,ji, ∇hzj=zj·(Wj−W⊤z).
j i
Each zj ∈ (0,1); their sum is 1.
[Interpret zj as the probability of the input belonging to class j. For example, in the digit recognition problem, we might have 10 softmax output units, one for each digit class.]
[If you have only 2 classes, just use one sigmoid output; it’s equivalent to 2-way softmax.]
Sigmoid Unit Saturation
Problem: When unit output s is close to 0 or 1 for most training points, s′ = s(1 − s) ≈ 0, so gradient descent changes s very slowly. Unit is “stuck.” Slow training & bad local minima.
s(x) 1.0
0.8 0.6 0.4 0.2
-4 -2 0 2 4
x
[Draw flat spots, “linear” region, & maximum curvature points (at s(λ) 0.21 and s(λ) 0.79) of the sigmoid function. Ideally, we would stay away from the flat spots.]
logistic.pdf
Neurobiology; Variations on Neural Networks 105
[Wikipedia calls this the “vanishing gradient problem.”]
[The more layers your network has, the more problematic this problem becomes. Most of the early attempts to train deep, many-layered neural nets failed.]
Mitigation: [None of these are complete cures.]
(1) Initial weight of edge into unit with fan-in η:
random with mean zero, std. dev. 1/ √η.
[The bigger the fan-in of a unit, the easier it is to saturate it. So we choose smaller random initial weights for units with bigger fan-in.]
(2) Set target values to 0.85 & 0.15 instead of 1 & 0.
[Recall that the sigmoid function can never be 0 or 1; it can only come close. If your target values are 1 & 0, you are pushing the output units into the flat spots! The numbers 0.15 and 0.85 are reasonable because the sigmoid function achieves its greatest curvature when its output is near 0.21 or 0.79. But experiment to find the best values.]
[Option (2) helps to avoid stuck output units, but not stuck hidden units. So . . . ]
(3) Modify backprop to add small constant (typically ∼ 0.1) to s′.
[This hacks the gradient so a unit can’t get stuck. We’re not doing steepest descent any more, because we’re not using the real gradient. But often we’re finding a better descent direction that will get us to a minimum faster. This hack originates with Scott Fahlman’s Quickprop algorithm.]
(4) Cross-entropy loss fn (aka logistic loss) instead of squared error.
For k-class softmax output, cross-entropy is L(z, y) = − ki yi ln zi. ↑ true labels
vectors
[Typically, people choose one label to be 1 and the others to be 0. But by idea (2) above, it might be
↑ prediction Stronglyrecommended:chooselabelssok y =1.
j=1 j
wiser to use less extreme values. From here on, we will assume that the target labels sum to 1.] Forsigmoidoutputs, L(z,y)=−ylnz−(1−y)ln(1−z)
Derivatives for k-class softmax backprop:
∂L = −yi
∂zi ∇WiL =
∇WL = (z−y)h⊤
kk kk ∂L∂zj yj
zi kk
∂L∂zj y yj i
∂z ∂t∇Witi=−zzi+ zzjzih=(zi−yi)h j=1 j i i j=1 j
z W − z z W = W⊤z − h ∂z∂thi zjj jii
j
y W = W⊤(z − y) jj
∇ L =
∇ t = −
j=1 ji=1 i j=1 j i=1
[Notice that the denominator of ∂L/∂zi cancels out the numerator zi in the softmax derivatives. This saves the unit from getting stuck when the softmax derivatives are small. It is related to the fact that the logistic loss goes to infinity as the predicted value zi approaches zero or one. The vanishing gradient of the sigmoid is compensated for by the huge gradient of the logistic loss.]
Forthesigmoidoutput,wealsohave∇WL=(z−y)h⊤ and∇hL=W⊤(z−y).
106 Jonathan Richard Shewchuk
[Like option (2), cross-entropy loss helps to avoid stuck output units, but not stuck hidden units.]
[Cross-entropy losses are only for sigmoid and softmax outputs. By contrast, for regression we typi- cally use linear outputs, which don’t saturate, so the squared error loss is better for them.]
[Now I will show you how to derive the backprop equations for a softmax output, the cross-entropy loss function, and l2 regularization—which helps to reduce overfitting, just like in ridge regression. Observe that because of the simplifications we made by combining derivatives, we don’t compute ∇z L explicitly, but we still have to backpropagate the value of z itself.]
[Draw this by hand.
(5) Replace sigmoids with ReLUs: rectified linear units.
ramp fn aka hinge fn: r(γ) = max{0, γ}
]
r(γ)
1γ≥0 0 γ<0
γ
r′(γ) =
[The derivative is not defined at zero, but we just pretend it is.]
Popular for many-layer networks with large training sets.
[One nice thing about ramp functions is that they and their gradients are very fast to compute. Com- puters compute exponentials slowly. Even though ReLUs are linear in each half of their range, they’re still nonlinear enough to easily compute functions like XOR.]
[Obviously, the gradient is sometimes zero, so you might wonder if ReLUs can get stuck too. Fortu- nately, it’s rare for a ReLU’s gradient to be zero for all the training data; it’s usually zero for just some sample points. But yes, ReLUs sometimes get stuck too; just not as often as sigmoids.]
[The output of a ReLU can be arbitrarily large, creating the danger that it might overwhelm units downstream. This is called the “exploding gradient problem,” and it is not a big problem in shallow networks, but it becomes a big problem in deep or recurrent networks.]
[Note that option (5) makes options (2)–(4) irrelevant.]
gradsoftmax5.pdf
Better Neural Network Training; Convolutional Neural Networks 107 19 Better Neural Network Training; Convolutional Neural Networks
[I’m going to talk about a bunch of heuristics that make gradient descent faster, or make it find better local minima, or prevent it from overfitting. I suggest you implement vanilla stochastic backprop first, and experiment with the other heuristics only after you get that working.]
Heuristics for Faster Training
[A big disadvantage of neural nets is that they take a long, long time to train compared to other classification methods we’ve studied. Here are some ways to speed them up. Unfortunately, you usually have to experi- ment with techniques and hyperparameters to find which ones will help with your particular application.]
1.4 1.2 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4
– –
(1)–(5) above. [Fix the unit saturation problem.]
Stochastic gradient descent: faster than batch on large, redundant data sets.
[Whereas batch gradient descent walks downhill on one cost function, stochastic descent takes a very short step downhill on one point’s loss function and then another short step on another point’s loss function. The cost function is the sum of the loss functions over all the sample points, so one batch step behaves similarly to n stochastic steps and takes roughly the same amount of time. But if you have many different examples of the digit “9”, they contain much redundant information, and stochastic
gradient descent learns the redundant information more quickly.]
0
y
1 01
2
batchvsstoch.pdf(LeCunetal.,“EfficientBackProp”) [Left:asimpleneuralnetwithonly three weights, and its 2D training data. Center: batch gradient descent makes only a little progress each epoch. (Epochs alternate between red and blue.) Right: stochastic descent decreases the error much faster than batch descent.]
One epoch presents every training example once. Training usually takes many epochs, but if sample is huge [and carries lots of redundant information] it can take less than one.
1.4 1.2 1 0.8 0.6 0.4 0.2 0 0.20.40.60.8 1 1.21.4
108
Jonathan Richard Shewchuk
– Normalizing the data.
– Center each feature so mean is zero.
– Then scale each feature so variance ≈ 1.
[The first step seems to make it easier for hidden units to get into a good operating region of the sigmoid or ReLU. The second step makes the objective function better conditioned, so gradient descent converges faster.]
normalize.jpg [A2Dexampleofdatanormalization.]
x2 6
4
2
-4-2 24
-2
-4
x1
illcondition.pdf [Skewed data often leads to an objective function with an ill-conditioned (highly eccentric) Hessian. Gradient descent in these functions can be painfully slow, as this figure shows. Normalizing helps by reducing the eccentricity. Whitening reduces the eccentricity even more, but it’s much more expensive. Another thing that helps with elon- gated troughs like this is momentum, which we’ll discuss shortly. It eventually gets us moving fast along the long axis.]
Better Neural Network Training; Convolutional Neural Networks 109
– “Centering” the hidden units helps too.
[This function ranges from −1 to 1 instead of from 0 to 1.]
[If you use tanh units, don’t forget that you also need to change backprop to replace s′ with the derivative of tanh. Also, good output target values change to roughly 0.7 and −0.7.]
– Different learning rate for each layer of weights.
Earlier layers have smaller gradients, need larger learning rate.
Replace sigmoids with tanh γ = eγ −e−γ = 2 s(2γ) − 1. eγ +e−γ
[In this illustration, the inputs are at the bottom, and the outputs at the top. The derivatives tend to be smaller at the earlier layers.]
– Emphasizing schemes.
[Neural networks learn the most redundant examples quickly, and the most rare examples slowly. So we try to emphasize the uncommon examples.]
– Present examples from rare classes more often, or w/bigger ε. – Same for misclassified examples.
[Be forewarned that emphasizing schemes can backfire if you have really bad outliers.]
– Second-order optimization.
[Unfortunately, Newton’s method is completely impractical, because the Hessian is too large and expensive to compute. There have been a lot of attempts to incorporate curvature information into neural net learning in cheaper ways, but none of them are popular yet.]
– Nonlinear conjugate gradient: works well for small nets + small data + regression. Batch descent only! → Too slow with redundant data.
– Stochastic Levenberg Marquardt: approximates a diagonal Hessian.
[The authors claim convergence is typically three times faster than well-tuned stochastic gradient descent. The algorithm is complicated.]
– Acceleration schemes: RMSprop, Adam, AMSGrad.
[These are quite popular. Look them up online if you’re curious.]
curvaturelayers.pdf
110 Jonathan Richard Shewchuk Heuristics for Avoiding Bad Local Minima
– (1)–(5) above. [Fix the unit saturation problem.]
– Stochastic gradient descent. “Random” motion gets you out of shallow local minima.
[Even if you’re at a local minimum of the cost function, each sample point’s loss function will be trying to push you in a different direction. Stochastic gradient descent looks like a random walk or Brownian motion, which can shake you out of a shallow minimum.]
– Momentum. Gradient descent changes “velocity” ∆w slowly. Carries us through shallow local minima to deeper ones.
∆w ← −ε ∇J(w) repeat
w ← w + ∆w
∆w ← −ε ∇J(w) + β ∆w
Good for both batch & stochastic. Choose hyperparameter β < 1.
[The hyperparameter β specifies how much momentum persists from iteration to iteration.]
[I’ve seen conflicting advice on β. Some researchers set it to 0.9; some set it close to zero. Geoff Hinton suggests starting at 0.5 and slowly increasing it to 0.9 or higher as the gradients get small.] [If β is large, you should usually choose ε small to compensate, but you might still use a large ε in the first line so the initial velocity is reasonable.]
[A problem with momentum is that once it gets close to a good minimum, it oscillates around the minimum. But it’s more likely to get close to a good minimum in the first place.]
Heuristics to Avoid Overfitting
– Ensemble of neural nets. Random initial weights + bagging.
[We saw how well ensemble learning works for decision trees. It works well for neural nets too. The combination of random initial weights and bagging helps ensure that each neural net comes out differently. Obviously, ensembles of neural nets are slow to train.]
– l2 regularization, aka weight decay.
Add λ ∥w∥2 to the cost/loss fn, where w is vector of all weights.
[w includes all the weights in matrices V and W, rewritten as a vector.]
[We do this for the same reason we do it in ridge regression: penalizing large weights reduces overfit-
ting by reducing the variance of the method.]
[With a neural network, it’s not clear whether penalizing the bias terms is bad or good. If you penalize
the bias terms, regularization has the effect of drawing each ReLU or sigmoid unit closer to the center
of its operating region. I would suggest to try both ways and use validation to decide whether you
should penalize the bias terms or not.]
Effect: −ε ∂J has extra term −2ελ wi ∂wi
Weight decays by factor 1 − 2ελ if not reinforced by training.
Better Neural Network Training; Convolutional Neural Networks 111
Neural Network - 10 Units, No Weight Decay
Neural Network - 10 Units, Weight Decay=0.02
..................................................................... ..................................................................... . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..................................................................... . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . o. . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..................................................................... ..................................................................... . . . . . . . . . . . . . . . . . . . . . . . .o. . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . o. . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . o. . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . .o. . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . .o. . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . o. . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o . . . . . . o o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . .o. . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . .o. o. . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . .o. . . . . . . . . . .o. . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . o. . . . . . . . . . . . . . . . . . . . o. . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o o. o. . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o . . o. . . . . . . . . . . . . . . . . . . . . o o. . . . . . . . . . . . . . o. . o. . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . .o. . . . . .o. . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . o. . . . . . . . . . . o. . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . .o. o. . . . . . . .o. . . . . .o. . . . . . . . o. . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . o. . . . . . . . . . . . .o. . . . . . o o. . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . .o. .o. . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . . .o. . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . o. . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . o. . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o o. . . . . . . . . . . . . o. . . . . . . o. . . . . . . . . . . . . . . . . o. . . . . . . ..................................................................... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o o. . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . o o. . . . . . o. . . . . . o. . . . . . . . o. . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . .o. o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . o. . . . . . . . . . . o. o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o o. . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. .o. . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o.o. .o. . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . o. . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..................................................................... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . .o. . o. . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . .o. . .o. . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..................................................................... ..................................................................... ..................................................................... . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . o.o. . . . . o. . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . ..................................................................... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. o. o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..................................................................... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . ..................................................................... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . .
o
............................................... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . ............................................... ............................................... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ............................................... ............................................... ............................................... ............................................... ............................................... ...............................................
...................... ...................... ...................... ..T.r.a.in..in.g..E..rr.o.r.:.0..1.6.0.. ...................... ...................... ...................... ...................... . .T.e.s.t. E. .r.ro. .r:. . . . .0..2.2.3. . ...................... ...................... ...................... ...................... Bayes Error: 0.210
..................................................................... ..................................................................... . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..................................................................... . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . o. . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..................................................................... ..................................................................... . . . . . . . . . . . . . . . . . . . . . . . .o. . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . o. . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . o. . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . .o. . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . .o. . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . o. . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o . . . . . . o o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . .o. . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . .o. o. . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . .o. . . . . . . . . . .o. . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . o. . . . . . . . . . . . . . . . . . . . o. . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o o. o. . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o . . o. . . . . . . . . . . . . . . . . . . . . o o. . . . . . . . . . . . . . o. . o. . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . .o. . . . . .o. . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . o. . . . . . . . . . . o. . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . .o. o. . . . . . . .o. . . . . .o. . . . . . . . o. . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . o. . . . . . . . . . . . .o. . . . . . o o. . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . .o. .o. . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . . .o. . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . o. . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . o. . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o o. . . . . . . . . . . . . o. . . . . . . o. . . . . . . . . . . . . . . . . o. . . . . . . ..................................................................... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o o. . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . .o o. . . . . . .o. . . . . o. . . . . . . . o. . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . .o. o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . o. . . . . . . . . . . o. o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o o. . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. .o. . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o.o. .o. . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . o. . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..................................................................... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . .o. . o. . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . .o. . .o. . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..................................................................... ..................................................................... ..................................................................... . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . o.o. . . . . o. . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . ..................................................................... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. o. o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..................................................................... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . ..................................................................... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . .
o
............................................... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . ............................................... ............................................... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ............................................... ............................................... ............................................... ............................................... ............................................... ...............................................
...................... ...................... ...................... ..T.r.a.in..in.g..E..rr.o.r.:.0..1.0.0.. ...................... ...................... ...................... ...................... . .T.e.s.t. E. .r.ro. .r:. . . . .0..2.5.9. . ...................... ...................... ...................... ...................... Bayes Error: 0.210
Write “10 hidden units + soft- max + cross-entropy loss”. [Examples of 2D classification without (left) and with (right)
weightdecayoff.pdf, weightdecayon.pdf (ESL, Figure 11.4)
Neural Network - 10 Units, Weight Decay=0.02
weight decay. Observe that the second example better approximates the Bayes optimal
boundary (dashed purple curve).]
– Dropout emulates an ensemble in one network.
..... ..... .....
dropout.pdf
[During training, we temporarily disable a random subset of the units, along with all the edges in and out of those units. It seems to work well to disable each hidden unit with probability 0.5, and to disable input units with a smaller probability. We do stochastic gradient descent and we frequently change which random subset of units is disabled. The authors claim that their method gives even better generalization than l2 regularization. It gives some of the advantages of an ensemble, but it’s faster to train.]
– Fewer hidden units.
[The number of hidden units is a hyperparameter you can use to adjust the bias-variance tradeoff. If there’s too few, you can’t learn well, but if there’s too many, you may overfit. l2 regularization and dropout make it safer to have too many hidden units, so it’s less critical to find just the right number.]
112 Jonathan Richard Shewchuk CONVOLUTIONAL NEURAL NETWORKS (ConvNets; CNNs)
[Convolutional neural nets have caused a big resurgence of interest in neural nets in the last decade. Often you’ll hear the buzzword deep learning, which refers to neural nets with many layers. Many of the most successful deep networks have been convolutional. Just last year, the ACM announced that the 2018 Alan M. Turing Award was awarded to Geoff Hinton, Yann LeCun, and Yoshua Bengio for their work on deep neural networks.]
Vision: inputs are large images. 200 × 200 image = 40,000 pixels.
If we connect them all to 40,000 hidden units → 1.6 billion connections. Neural nets are often overparametrized: too many weights, too little data.
[As a rule of thumb, if you have hugely many weights, you want a huge amount of data to train them. A bigger problem with having billions of weights is that the network becomes very slow to train or even to use.]
[Researchers have addressed these problems by taking inspiration from the neurology of the visual system. Remember that early in the semester, I told you that you can get better performance on the handwriting recognition task by using edge detectors. Edge detectors have two interesting properties. First, each edge detector looks at just one small part of the image. Second, the edge detection computation is the same no matter which part of the image you apply it to. So let’s apply these two properties to neural net design.]
ConvNet ideas:
(1) Local connectivity: A hidden unit (in early layer) connects only to small patch of units in previous layer.
[This improves the overparametrization problem, and speeds up both the forward pass and the training considerably.]
(2) Shared weights: Groups of hidden units share same set of input weights, called a mask aka filter aka kernel. [No relation to the kernels of Lecture 16.] We learn several masks.
[Each mask operates on every patch of image.]
Masks × patches = hidden units in first hidden layer.
If net learns to detect edges on one patch, every patch has an edge detector.
[Because the mask that detects edges is applied to every patch.]
ConvNets exploit repeated structure in images, audio.
Convolution: the same linear transformation applied to different parts of the input by shifting.
[Shared weights improve the overparametrization problem even more, because shared weights means fewer weights. It’s a kind of regularization.]
[But shared weights have another big advantage. Suppose that gradient descent starts to develop an edge detector. That edge detector is being trained on every part of every image, not just on one spot. And that’s good, because edges appear at different locations in different images. The location no longer matters; the edge detector can learn from edges in every part of the image.]
[In a neural net, you can think of hidden units as features that we learn, as opposed to features that you code up yourself. Convolutional neural nets take them to the next level by learning features from multiple patches simultaneously and then applying those features everywhere, not just in the patches where they were originally learned.]
[By the way, although local connectivity is inspired by our visual systems, shared weights obviously don’t happen in biology.]
Better Neural Network Training; Convolutional Neural Networks 113
[Show slides on computing in the visual cortex and ConvNets, available from the CS 189 web page at https://people.eecs.berkeley.edu/∼jrs/189/lec/cnn.pdf . Sorry, readers, there are too many images to include here. The narration is below.]
[Neurologists can stick needles into individual neurons in animal brains. After a few hours the neuron dies, but until then they can record its action potentials. In this way, biologists quickly learned how some of the neurons in the retina, called retinal ganglion cells, respond to light. They have interesting receptive fields, illustrated in the slides, which show that each ganglion cell receives excitatory stimulation from receptors in a small patch of the retina but inhibitory stimulation from other receptors around it.]
[The signals from these cells propagate to the V1 visual cortex in the occipital lobe at the back of your skull. The V1 cells proved much harder to understand. David Hubel and Torsten Wiesel of the Johns Hopkins University put probes into the V1 visual cortex of cats, but they had a very hard time getting any neurons to fire there. However, a lucky accident unlocked the secret and ultimately won them the 1981 Nobel Prize in Physiology.]
[Show video HubelWiesel.mp4, taken from YouTube: https://www.youtube.com/watch?v=IOHayh06LJ4 ] [The glass slide happened to be at the particular orientation the neuron was sensitive to. The neuron doesn’t
respond to other orientations; just that one. So they were pretty lucky to catch that.]
[The simple cells act as line detectors and/or edge detectors by taking a linear combination of inputs from retinal ganglion cells.]
[The complex cells act as location-independent line detectors by taking inputs from many simple cells, which are location dependent.]
[Later researchers showed that local connectivity runs through the V1 cortex by projecting certain images onto the retina and using radioactive tracers in the cortex to mark which neurons had been firing. Those images show that the neural mapping from the retina to V1 is retinatopic, i.e., locality preserving. This is a big part of the inspiration for convolutional neural networks!]
[Unfortunately, as we go deeper into the visual system, layers V2 and V3 and so on, we know less and less about what processing the visual cortex does.]
Architecture of LeNet5.
LeNet5.png
114 Jonathan Richard Shewchuk
[ConvNets were first popularized by the success of Yann LeCun’s “LeNet 5” handwritten digit recognition software. LeNet 5 has six hidden layers! Hidden layers 1 and 3 are convolutional layers in which groups of units share weights. Layers 2 and 4 are pooling layers that make the image smaller. These are just hardcoded max-functions with no weights and nothing to train. Layers 5 and 6 are just regular layers of hidden units with no shared weights. A great deal of experimentation went into figuring out the number of layers and their sizes. At its peak, LeNet 5 was responsible for reading the zip codes on 10% of US Mail. Another Yann LeCun system was deployed in ATMs and check reading machines and was reading 10 to 20% of all the checks in the US by the late 90’s. LeCun is one of the Turing Award winners I told you about earlier.]
[Show Yann LeCun’s video LeNet5.mov, illustrating LeNet 5.]
[When ConvNets were first applied to image analysis, researchers found that some of the learned masks are edge detectors or line detectors, similar to the ones that Hubel and Wiesel discovered! This created a lot of excitement in both the computer learning community and the neuroscience community. The fact that a neural net can naturally learn the same features as the mammalian visual cortex is impressive.]
[I told you two lectures ago that neural nets research was popular in the 60’s, but the 1969 book Perceptrons killed interest in them throughout the 70’s. They came back in the 80’s, but interest was partly killed off a second time in the 00’s by . . . guess what? By support vector machines. SVMs work well for a lot of tasks, they’re much faster to train, and they more or less have only one hyperparameter, whereas neural nets take a lot of work to tune.]
[Neural nets are now in their third wave of popularity. The single biggest factor in bringing them back is probably big data. Thanks to the internet, we now have absolutely huge collections of images to train neural nets with, and researchers have discovered that neural nets often give better performance than competing algorithms when you have huge amounts of data to train them with. In particular, convolutional neural nets are now learning better features than hand-tuned features. That’s a recent change.]
[One event that brought attention back to neural nets was the ImageNet Image Classification Challenge in 2012. The winner of that competition was a neural net, and it won by a huge margin, about 10%. It’s called AlexNet, and it’s surprisingly similarly to LeNet 5, in terms of how its layers are structured. However, there are some new innovations that led to their prize-winning performance, in addition to the fact that the training set had 1.4 million images: they used ReLUs, dropout, and GPUs for training.]
Architecture of AlexNet.
Figure 2: An illustration of the arc our CNN, explicitly showing the delineation of responsibilities
[IbfeytwoueewnatnhtetowloeGarPnUms.oOrenaebGoPutUdreuenpsntheeurlaylenre-ptwarotsrkast,ththeetroep’soaftwhehofilgeunrewhuinledtehregorathdeurartuencslathseslaatyBere-rpkaerltesy
at the bottom. The GPUs communicate only at certain layers. The network’s input is 150,528-dimensional, and just for you: CS 182.]
the number of neurons in the network’s remaining layers is given by 253,440–186,624–64,896–64,896–43,264– 4096–4096–1000.
neurons in a kernel map). The second convolutional layer takes as input the (response-normalized and pooled) output of the first convolutional layer and filters it with 256 kernels of size 5 ⇥ 5 ⇥ 48. The third, fourth, and fifth convolutional layers are connected to one another without any intervening pooling or normalization layers. The third convolutional layer has 384 kernels of size 3 3
alexnet.pdf hitecture of
connected to the (normalized, pooled) outputs of the second convolutional layer. The fourth
⇥⇥
Unsupervised Learning and Principal Components Analysis 115 20 Unsupervised Learning and Principal Components Analysis
UNSUPERVISED LEARNING
We have sample points, but no labels!
No classes, no y-values, nothing to predict. Goal: Discover structure in the data.
Examples:
– Clustering: partition data into groups of similar/nearby points.
– Dimensionality reduction: data often lies near a low-dimensional subspace (or manifold) in feature
space; matrices have low-rank approximations.
[Whereas clustering is about grouping similar sample points, dimensionality reduction is more about identifying a continuous variation from sample point to sample point.]
– Density estimation: fit a continuous distribution to discrete data.
[When we use maximum likelihood estimation to fit Gaussians to sample points, that’s density esti- mation, but we can also fit functions more complicated than Gaussians.]
PRINCIPAL COMPONENTS ANALYSIS (PCA) (Karl Pearson, 1901)
Goal: Given sample points in Rd, find k directions that capture most of the variation. (Dimensionality reduction.)
−1.0 −0.5 0.0 0.5 1.0
First principal component
[Example of 3D points projected to 2D by PCA.]
3dpca.pdf
Second principal component
−1.0 −0.5 0.0 0.5 1.0
Introduction
116 Principal ComponenJtosnaAthanaRliychsairsd Shewchuk
[The (high-dimensional) MNIST digits projected to 2D (from 784D). Two LaudreinmsevansdieornMsaatreenna’ntdeGneo↵uregyhHitnotofnu,JlMlyLRse2p00a8ra(MteCLthta-beS)NdEigits,butobservethattheOcdtoibgeirt3s00,2(0r1e4d)a4nd/33
1 (orange) are well on their way to being separated.]
Why?
– Find a small basis for representing variations in complex things, e.g., faces, genes. – Reducing # of dimensions makes some computations cheaper, e.g., regression.
– Remove irrelevant dimensions to reduce overfitting in learning algs.
Like subset selection, but the “features” aren’t axis-aligned; they’re linear combos of input features.
[Sometimes PCA is used as a preprocess before regression or classification for the last two reasons.]
Let X be n × d design matrix. [No fictitious dimension.]
From now on, assume X is centered: mean Xi is zero.
[As usual, we can center the data by computing the mean x-value, then subtracting the mean from each sample point.]
[Let’s start by seeing what happens if we pick just one principal direction.] Let w be a unit vector.
The orthogonal projection of point x onto vector w is x ̃ = (x · w) w Ifwnotunit,x ̃= x·w w
w
x ̃
[The idea is that we’re going to pick the best direction w, then project all the data down onto w so we can analyze it in a one-dimensional space. Of course, we lose a lot of information when we project down from d dimensions to just one. So, suppose we pick several directions. Those directions span a subspace, and we want to project points orthogonally onto the subspace. This is easy if the directions are orthogonal to each other.]
pcadigits.pdf
∥w∥2 x
Unsupervised Learning and Principal Components Analysis 117 Given orthonormal directions v1, . . . , vk, x ̃ = ki=1(x · vi) vi
[The word “orthonormal” implies they’re all mutually orthogonal and all have length 1.]
x
x ̃ v1
Often we want just the k principal coordinates x · vi in principal component space. [Often we don’t actually want the projected point in Rd.]
X⊤X is square, symmetric, positive semidefinite, d × d matrix. [As it’s symmetric, its eigenvalues are real.] Let 0 ≤ λ1 ≤ λ2 ≤ ... ≤ λd be its eigenvalues. [sorted]
Let v1, v2, . . . , vd be corresponding orthogonal unit eigenvectors. These are the principal components.
[. . . and the most important principal components will be the ones with the greatest eigenvalues. I will show you this in three different ways.]
PCA derivation 1: Fit a Gaussian to data with maximum likelihood estimation. Choose k Gaussian axes of greatest variance.
[A Gaussian fitted to sample points.]
Recall that MLE estimates a covariance matrix Σˆ = 1 X⊤X. [Presuming X is centered.]
n
PCA Alg:
– Center X.
– Optional: Normalize X. Units of measurement different?
– Yes: Normalize.
[Bad for principal components to depend on arbitrary choice of scaling.]
– No: Usually don’t.
[If several features have the same unit of measurement, but some of them have smaller variance
than others, that difference is usually meaningful.] – Compute unit eigenvectors/values of X⊤X.
v2
gaussfitpca.png
118
Jonathan Richard Shewchuk
– – –
Optional: choose k based on the eigenvalue sizes. Forthebestk-dimensionalsubspace,pickeigenvectorsvd−k+1,...,vd.
Compute the principal coordinates x · vi of training/test data.
[When we do this projection, we have two choices: we can un-center the input training data before projecting it, OR we can translate the test data by the same vector we used to translate the training data when we centered it.]
Scaled
−0.5 0.0 0.5
Unscaled
−0.5 0.0 0.5 1.0
uu
−100 −50 0 50 100 150
First Principal Component
UrbanPop
** *** *
*
** * * *Rape
* ** * **
*********
* *
* **
*** *
*
*
*
*** **
* **
* *
*
Assault
Murder
UrbanPop
Rape
* **** ** ** ***
******** *****
* ** * **
*
*
***
* M
rd*er* ** *Assa * *** *
−3 −2 −1 0 1 2 3
First Principal Component
[Projection of 4D data onto a 2D subspace. Each point represents one metropolitan area. Normalized data at left; unnormalized data at right. The arrows show the four original axes projected on the two principal components. When the data are not normalized, rare occurrences like murder have little influence on the principal directions. Which is better? It depends on whether you think that low-frequency events like murder and rape should have a disproportionate influence.]
ki=1 λd+1−i % of variability = d λ
i=1 i
[Plot of # of eigenvectors vs. percentage of sample variance captured for a 17D data set. In this example, just 3 eigenvectors capture 70% of the variance.]
normalize.pdf
variance.pdf
[If you are using PCA as a preprocess for a supervised learning algorithm, there’s a more effective way to choose k: (cross-)validation.]
Second Principal Component
−3 −2 −1 0 1 2 3
−0.5 0.0 0.5
Second Principal Component
−100 −50 0 50 100 150
−0.5 0.0 0.5 1.0
Unsupervised Learning and Principal Components Analysis 119
PCA derivation 2: Find direction w that maximizes sample variance of projected data
[In other words, when we project the data down, we don’t want it all to bunch up; we want to keep it as spread out as possible.]
[Points projected on a line. We wish to choose the orientation of the green line to maximize the sample variance of the blue points.]
[This fraction is a well-known construction called the Rayleigh quotient. When you see it, you should smell eigenvectors nearby. How do we maximize this?]
If w is an eigenvector vi, Ray. quo. = λi
→ of all eigenvectors, vd achieves maximum variance λd/n.
One can show vd beats every other vector too.
[Because every vector w is a linear combination of eigenvectors, and so its Rayleigh quotient will be a convex combination of eigenvalues. It’s easy to prove this, but I don’t have the time. For the proof, look up “Rayleigh quotient” in Wikipedia.]
[So the top eigenvector gives us the best direction. But we typically want k directions. After we’ve picked one direction, then we have to pick a direction that’s orthogonal to the best direction. But subject to that constraint, we again pick the direction that maximizes the sample variance.]
What if we constrain w to be orthogonal to vd? Then vd−1 is optimal.
[And if we need a third direction orthogonal to vd and vd−1, the optimal choice is vd−2. And so on.]
project.jpg
̃ ̃ ̃ 1 n w 2 1 ∥ X w ∥ 2 1 w ⊤ X ⊤ X w FindwthatmaximizesVar({X1,X2,...,Xn})=n Xi·∥w∥ =n ∥w∥2 =n w⊤w
i=1
Rayleigh quotient of X⊤X and w
120 Jonathan Richard Shewchuk PCA derivation 3: Find direction w that minimizes “projection error”
[This is an animated GIF; unfortunately, the animation doesn’t work in the PDF lecture notes. Find the direction of the black line for which the sum of squares of the lengths of the red lines is smallest.]
[You can think of this as a sort of least-squares linear regression, with one subtle but important change. In- stead of measuring the error in a fixed vertical direction, we’re measuring the error in a direction orthogonal to the principal component direction we choose.]
[Least-squares linear regression vs. PCA. In linear regression, the projection direction is always vertical; whereas in PCA, the projection direction is or- thogonal to the projection hyperplane. In both methods, however, we minimize the sum of
the squares of the projection distances.]
Minimizing projection error = maximizing variance.
[From this point, we carry on with the same reasoning as derivation 2.]
PCAanimation.gif
projlsq.png, projpca.png
Findwthatminimizes
n i=1
n 2n2 2 Xi·ww ̃2 Xi −Xi = Xi − w = ∥Xi∥ − Xi ·
i=1 ∥w∥2i=1 ∥w∥ = constant − n (variance from derivation 2).
Unsupervised Learning and Principal Components Analysis 121
[Illustration of the first two prin- cipal components of the single nucleotide polymorphism (SNP) matrix for the genes of var- ious Europeans. The input matrix has 2,541 people from these locations in Europe (right), and 309,790 SNPs. Each SNP is binary, so think of it as 309,790 dimensions of zero or one. The output (left) shows spots on the first two principal components where there was a high density of projected people from a particular national type. What’s amazing about this is
how closely the projected genotypes resemble the geography of Europe.]
Eigenfaces
X contains n images of faces, d pixels each.
[If we have a 200 × 200 image of a face, we represent it as a vector of length 40,000, the same way we represent the MNIST digit data.]
Face recognition: Given a query face, compare it to all training faces; find nearest neighbor in Rd.
[This works best if you have several training photos of each person you want to recognize, with different lighting and different facial expressions.]
Problem: Each query takes Θ(nd) time.
Solution: Run PCA on faces. Reduce to much smaller dimension d′.
Now nearest neighbor takes O(nd′) time.
[Possibly even less. We’ll talk about speeding up nearest-neighbor search at the end of the semester. If the dimension is small enough, you can sometimes do better than linear time.]
[If you have 500 stored faces with 40,000 pixels each, and you reduce them to 40 principal components, then each query face requires you to read 20,000 stored principal coordinates instead of 20 million pixels.]
europegenetics.pdf (Lao et al., Current Biology, 2008.)
122 Jonathan Richard Shewchuk
facerecaverage.jpg, facereceigen0.jpg, facereceigen119.jpg, facereceigen.jpg
the the eigenfaces. The “average face” is the mean used to center the data.]
[Images of
[Images of a face (left) projected onto the first 4 and 50 eigenvectors, with the average face added back. These last image is blurry but good enough for face recognition.]
eigenfaceproject.pdf
Unsupervised Learning and Principal Components Analysis 123 For best results, equalize the intensity distributions first.
[Image equalization.]
[Eigenfaces are not perfect. They encode both face shape and lighting. Ideally, we would have some way to factor out lighting and analyze face shape only, but that’s harder. Some people say that the first 3 eigenfaces are usually all about lighting, and you sometimes get better facial recognition by dropping the first 3 eigenfaces.]
[Show Blanz–Vetter face morphing video (morphmod.mpg).]
[Blanz and Vetter use PCA in a more sophisticated way for 3D face modeling. They take 3D scans of people’s faces and find correspondences between peoples’ faces and an idealized model. For instance, they identify the tip of your nose, the corners of your mouth, and other facial features, which is something the original eigenface work did not do. Instead of feeding an array of pixels into PCA, they feed the 3D locations of various points on your face into PCA. This works more reliably.]
facerecequalize.jpg
124 Jonathan Richard Shewchuk
21 The Singular Value Decomposition; Clustering
The Singular Value Decomposition (SVD) [and its Application to PCA]
Problems: Computing X⊤X takes Θ(nd2) time.
X⊤X is poorly conditioned → numerically inaccurate eigenvectors. [The SVD improves both these problems.]
[Earlier this semester, we learned about the eigendecomposition of a square, symmetric matrix. Unfortu- nately, nonsymmetric matrices don’t eigendecompose nearly as nicely, and non-square matrices don’t have eigenvectors at all. Happily, there is a similar decomposition that works for all matrices, even if they’re not symmetric and not square.]
Fact: If n ≥ d, we can find a singular value decomposition X = UDV⊤
X=U D V⊤=dδuv⊤ i=1 i i i
=
n×d n×d U⊤U = I
diagonal
d×d d×d
V⊤V = I orthonormal vi’s are
rank 1 outer product matrix
δ1 δ2 0
0 ... δd
u1
ud
right singular vectors of X orthonormal ui’s are left singular vectors of X
v1 vd
[Draw this by hand; write summation at the right last. ] Diagonal entries δ1, . . . , δd of D are nonnegative singular values of X.
[Some of the singular values might be zero. The number of nonzero singular values is equal to the rank of X. If X is a centered design matrix for sample points that all lie on a line, there is only one nonzero singular value. If the centered sample points span a subspace of dimension r, there are r nonzero singular values.]
[If n < d, an SVD still exists, but now U is square and V is not.]
Fact: vi is an eigenvector of X⊤X w/eigenvalue δ2i . Proof: X⊤X = VDU⊤UDV⊤ = VD2V⊤
which is an eigendecomposition of X⊤X.
[The columns of V are the eigenvectors of X⊤X, which is what we need for PCA. The SVD also tells us their eigenvalues, which are the squares of the singular values. By the way, that’s related to why the SVD is more numerically stable: the ratios between singular values are smaller than the ratios between eigenvalues. If n < d, V will omit some of the eigenvectors that have eigenvalue zero, but those are useless for PCA.]
svd.pdf
The Singular Value Decomposition; Clustering 125
Fact:
We can find the k greatest singular values & corresponding vectors in O(ndk) time.
[So we can save time by computing some of the singular vectors without computing all of them.] [There are approximate, randomized algorithms that are even faster, producing an approximate SVD in O(nd log k) time. These are starting to become popular in algorithms for very big data.] [ https://code.google.com/archive/p/redsvd/ ]
Important: Row i of UD gives the principle coordinates of sample point Xi (i.e., ∀j, Xi · vj = δjUij). [So we don’t need to explicitly compute the inner products Xi · vj; the SVD has already done it for us.] [Proof: XV = UDV⊤V = UD.]
CLUSTERING
Partition data into clusters so points in a cluster are more similar than across clusters. Why?
– Discovery: Find songs similar to songs you like; determine market segments – Hierarchy: Find good taxonomy of species from genes
– Quantization: Compress a data set by reducing choices
– Graph partitioning: Image segmentation; find groups in social networks
Barry Zito
!
!
!!!!!!!!!!!!!!! ! !!!
!!! !! !!!
! !
! ! !! ! !
! !! !
! ! !
!
! !!!!!!!!! !! !!!!!!! !!
! !! !! ! !!!!! ! !!
! !!!! !! ! !!!!!!
!!!!!!!! !! !! !!!!!!! !!! !
!
!
!
!! !!!
!!!! !! ! !
! !!!!!!!!! ! ! !!!!!!!! !!!!! !
!
!
!!!!!!!!!!!!!!!!!! !!
!
! !
!!! ! ! !! !
!!! !!!!! !! !!! !!! !! !! ! !! !! !!!!! ! ! ! !!! ! ! !
!
!
!!
!
!! ! !
! ! !
!!
! !
! ! !!!!!!! !! ! ! !
!
! !
!
!
! !
!
!
! !!! !
!
!!!! ! ! !! !! ! !!!!! !! ! !!!!!!!!!!
!!!!!!!!!! !! !! !!
!! !!!!!!!!!!! !!!!!
!
! !!!!
! !! !! !!! ! ! ! !!!!!!!!!!!! !
! !!!!!!!!!
!!! !!!!
! !! !! ! !!
!!
!!!!!! !!! ! ! !!!!!!!!!! !! ! ! !!! !!!!!!!!
!!!!!
! !
! !!!
!
!
!
! !
!
! !
! ! !! ! ! !
! !! !!!!!!!!!!!! !
! ! ! !!!!!!!!!!!! ! ! !!! !!!!!!!!! ! ! !!!!!!!!!!!!!
!!
! !!! !
! !! ! !!!! ! !! ! !!!!!!!!!!! ! !!!!!!!! !!!
! !
! !!!! !!!!! !! !
! ! ! ! !
!! !
! ! ! ! !! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! !! ! ! ! ! !! ! ! ! !! ! ! ! ! ! ! !!! !! ! ! !! !!!!!! !
! ! !!
!! !!!! !!! ! !!
!
!
! !! !!!!!!!!!!!! !!
!!!!!!! !!!!! !!!
! !!!!!!! ! !!! !! ! !!!!!!! ! !
!!!!!!!!!!!!! !!!!!!!!!!! ! ! !!!!!!!! !
!
!
!!!!!!!!!!!!!
! ! !!!!!!! ! !!! !
!
! !!!!!!! ! ! !!!!!!!!!! !! !
!! ! !!! ! !!!! !
! !!!!!!!!!!!!! ! ! ! !!!!! !!!! !!
!! ! !!!! ! !
! ! !! !!!!! !! ! !
!
! !!
! !! !! !!!
! ! !!! ! !
!! !! ! !! !!!! !! ! !
! !!!!!! !!!!!! !! !!!!!!
! !!!!!!!!!!! !! ! !
!! !
! !!!! !! ! !!!!!!!! !
!
!
! !
! ! ! !
!
!
! !!!!!!! ! ! !! !!!!!!! !!
!
!
! !
!
!!!!!
!! !!!!!!
!
!!! !!!! !
!!!!!! !! !!! !!! !!! !!!!!!!!
! !!!!!!!!!!!!!!!
! !!!!!!!! !!!
! ! !! !! !!! !
!!!!!!!!!
! ! ! !! ! ! ! ! !!!!
!! !!!! !!
! ! !!
! !!
0
50
100
!
!100
!50
150
!
!!!!! !
! !! !!!! !
!
! !!!!! !!!! ! !! ! !!!! !!!
! ! ! ! ! !! !! ! ! !! ! ! ! ! ! !! !!!!!!! !! ! !!!!! !! !
!
! !
! !
! ! !!!!!!! ! !!!!!!!!! !
! !!!!!! ! ! ! !
! !!!!!!!!!! ! ! ! ! ! !! !!!! !
!
!
! ! !!
! !
!! !
! ! !!!! !!! !!!!
! !!!!!!! ! ! !
! !!
! !!!!!!!!!!!
!!! !!! !!
! ! !!!!!!! !!! !!!!! !!!!!
! ! !! !
!
!
! !! !
!!! !!!!
!! ! !
! ! ! !! ! ! !! !!!!!!!! !! !!!!!!!!!!! !
!
150
! 60 65 70 75 80 85 90
Start Speed
4-Seam Fastball
2-Seam Fastball
Changeup
Slider
Curveball
Black
Red
Green
Blue
Light Blue
zito.pdf (from a talk by Michael Pane)
[k-means clusters that classify Barry Zito’s base- ball pitches. Here we discover that there really are distinct classes of baseball pitches.]
Back Spin
!150 !100 !50 0 50 100 150
Side Spin
126 Jonathan Richard Shewchuk k-Means Clustering aka Lloyd’s Algorithm (Stuart Lloyd, 1957)
Goal:
Partition n points into k disjoint clusters.
Assign each input point Xi a cluster label yi ∈ [1, k].
Cluster i’s mean is μi = 1 y =i Xj, given ni points in cluster i. ni j
[Sum of the squared distances from points to their cluster means.]
NP-hard. Solvable in O(nkn) time. [Try every partition.]
k-means heuristic: Alternate between
(1) yj’s are fixed; update μi’s
(2) μi’s are fixed; update yj’s
Halt when step (2) changes no assignments.
[So, we have an assignment of points to clusters. We compute the cluster means. Then we reconsider the assignment. A point might change clusters if some other’s cluster’s mean is closer than its own cluster’s mean. Then repeat.]
Step (1): Step (2):
One can show (calculus) the optimal μi is the mean of the points in cluster i. [This is easy calculus, so I leave it as a short exercise.]
The optimal y assigns each point Xj to the closest center μi.
[This should be even more obvious than step (1).]
[If there’s a tie, and one of the choices is for X j to stay in the same cluster as the previous iteration, always take that choice.]
Find y that minimizes
k 2 Xj − μi
i=1 yj=i
[An example of 2-means. Odd-numbered steps reassign the data points. Even-numbered steps compute new means.]
2means.png
The Singular Value Decomposition; Clustering 127
[This is an animated GIF of 4-means with many points. Unfortu- nately, the animation doesn’t work in the PDF lecture notes.]
Both steps decrease objective fn unless they change nothing.
[Therefore, the algorithm never returns to a previous assignment.]
Hence alg. must terminate. [As there are only finitely many assignments.]
[This argument says that Lloyd’s algorithm never loops forever. But it doesn’t say anything optimistic about the running time, because we might see O(kn) different assignments before we halt. In theory, one can actually construct point sets in the plane that take an exponential number of iterations, but those don’t come up in practice.]
Usually very fast in practice. Finds a local minimum, often not global.
[. . . which is not surprising, as this problem is NP-hard.]
[An example where 4-means clustering fails.]
Getting started:
– Forgy method: choose k random sample points to be initial μi’s; go to (2).
– Random partition: randomly assign each sample point to a cluster; go to (1).
– k-means++: like Forgy, but biased distribution. [Each center is chosen with a preference for points
far from previous centers.]
[k-means++ is a little more work, but it works well in practice and theory. Forgy seems to be better than random partition, but Wikipedia mentions some variants of k-means for which random partition is better.]
4meansanimation.gif
4meansbad.png
128 Jonathan Richard Shewchuk For best results, run k-means multiple times with random starts.
320.9 235.8 235.8
235.8 235.8 310.9
kmeans6times.pdf (ISL, Figure 10.7) [Clusters found by running 3-means 6 times on the same sample points, each time starting with a different random partition. The algorithm finds three different local minima.]
[Why did we choose that particular objective function to minimize? Partly because it is equivalent to mini- mizing the following function.]
Equivalent objective fn: the within-cluster variation
k1 2
Xj − Xm
[At the minimizer, this objective function is equal to twice the previous one. It’s a worthwhile exercise to show that—it’s harder than it looks. The nice thing about this expression is that it doesn’t include the means; it’s a function purely of the input points and the clusters we assign them to. So it’s more convincing.]
Normalize the data? [before applying k-means]
Same advice as for PCA. Sometimes yes, sometimes no.
[If some features are much larger than others, they will tend to dominate the Euclidean distance. So if you have features in different units of measurement, you probably should normalize them. If you have features in the same unit of measurement, you usually shouldn’t, but it depends on context.]
Find y that minimizes
i=1 ni yj=iym=i
The Singular Value Decomposition; Clustering 129
k-Medoids Clustering
Generalizes k-means beyond Euclidean distance. [Means aren’t optimal for other distance metrics.] Specify a distance fn d(x, y) between points x, y, aka dissimilarity.
Can be arbitrary; ideally satisfies triangle inequality d(x, y) ≤ d(x, z) + d(z, y).
[Sometimes people use the l1 norm or the l∞ norm. Sometimes people specify a matrix of pairwise distances between the input points.]
[Suppose you have a database that tells you how many of each product each customer bought. You’d like to cluster together customers who buy similar products for market analysis. But if you cluster customers by Euclidean distance, you’ll get a big cluster of all the customers who have only ever bought one thing. So Euclidean distance is not a good measure of dissimilarity. Instead, it makes more sense to treat each customer as a vector and measure the angle between two customers. If there’s a large angle between customers, they’re dissimilar.]
Replace mean with medoid, the sample point that minimizes total distance to other points in same cluster. [So the medoid of a cluster is always one of the input points.]
[One difficulty with k-means is that you have to choose the number k of clusters before you start, and there isn’t any reliable way to guess how many clusters will best fit the data. The next method, hierarchical clustering, has the advantage in that respect. By the way, there is a whole Wikipedia article on “Determining the number of clusters in a data set.”]
Hierarchical Clustering
Creates a tree; every subtree is a cluster. [So some clusters contain smaller clusters.]
Bottom-up, aka agglomerative clustering:
start with each point a cluster; repeatedly fuse pairs.
[Draw figure of points in the plane; pair clusters together until all points are in one top-level cluster.]
Top-down, aka divisive clustering:
start with all pts in one cluster; repeatedly split it.
[Draw figure of points in the plane; divide points into subsets hierarchically until each point is in its own subset.]
[When the input is a point set, agglomerative clustering is used much more in practice than divisive cluster- ing. But when the input is a graph, it’s the other way around: divisive clustering is more common.]
We need a distance fn for clusters A, B:
complete linkage:
single linkage:
average linkage:
centroidlinkage:
[The first three of these linkages work for any distance function, even if the input is just a matrix of distances between all pairs of points. The centroid linkage only really makes sense if we’re using the Euclidean distance. But there’s a variation of the centroid linkage that uses the medoids instead of the means, and medoids are defined for any distance function. Moreover, medoids are more robust to outliers than means.]
Greedy agglomerative alg.:
Repeatedly fuse the two clusters that minimize d(A, B)
d(A, B) = max{d(w, x) : w ∈ A, x ∈ B}
d(A, B) = min{d(w, x) : w ∈ A, x ∈ B}
d(A, B) = 1 w∈A x∈B d(w, x) |A| |B|
d(A,B)=d(μA,μB) whereμS ismeanofS
130 Jonathan Richard Shewchuk
Naively takes O(n3) time.
[But for complete and single linkage, there are more sophisticated algorithms called CLINK and SLINK, which run in O(n2) time. A package called ELKI has publicly available implementations.]
Dendrogram: Illustration of the cluster hierarchy (tree) in which the vertical axis encodes all the linkage distances.
[Example of a dendrogram cut into 1, 2, or 3 clusters.]
Cut dendrogram into clusters by horizontal line according to your choice of # of clusters OR intercluster distance.
[It’s important to be aware that the horizontal axis of a dendrogram has no meaning. You could swap some treenode’s left subtree and right subtree and it would still be the same dendrogram. It doesn’t mean anything that two leaves happen to be next to each other.]
dendrogram.pdf (ISL, Figure 10.9)
0 2 4 6 8 10
0 2 4 6 8 10
0 2 4 6 8 10
The Singular Value Decomposition; Clustering 131 Average Linkage Complete Linkage Single Linkage
linkages.pdf (ISL, Figure 10.12) [Comparison of average, complete (max), and single (min) linkages. Observe that the complete linkage gives the best-balanced dendrogram, whereas the single linkage gives a very unbalanced dendrogram that is sensitive to outliers (especially near the top of the dendrogram).]
[Probably the worst of these is the single linkage, because it’s very sensitive to outliers. Notice that if you cut this example into three clusters, two of them have only one sample point. It also tends to give you a very unbalanced tree.]
[The complete linkage tends to be the best balanced, because when a cluster gets large, the furthest point in the cluster is always far away. So large clusters are more resistant to growth than small ones. If balanced clusters are your goal, this is your best choice.]
[In most applications you probably want the average or complete linkage.]
Warning: centroid linkage can cause inversions where a parent cluster is fused at a lower height than its children.
[So statisticians don’t like it, but nevertheless, centroid linkage is popular in genomics.]
[As a final note, all the clustering algorithms we’ve studied so far are unstable, in the sense that deleting a few input points can sometimes give you very different results. But these unstable heuristics are still the most commonly used clustering algorithms. And it’s not clear to me whether a truly stable clustering algorithm is even possible.]
132 Jonathan Richard Shewchuk
22 Spectral Graph Clustering
SPECTRAL GRAPH CLUSTERING
Input: Weighted, undirected graph G = (V, E). No self-edges. wij =weightofedge(i,j)=(j,i);zeroif(i,j)E.
[Think of the edge weights as a similarity measure. A big weight means that the two vertices want to be in the same cluster. So the circumstances are the opposite of the last lecture on clustering. Then, we had a distance or dissimilarity function, so small numbers meant that points wanted to stay together. Today, big numbers mean that vertices want to stay together.]
Goal:
Cut G into 2 (or more) pieces Gi of similar sizes,
but don’t cut too much edge weight.
[That’s a vague goal. There are many ways to make this precise.
Here’s a typical goal, which we’ll solve approximately.]
e.g., Minimize the sparsity Cut(G1,G2) , aka cut ratio Mass(G1 ) Mass(G2 )
where Cut(G1, G2) = total weight of cut edges
Mass(G1) = # of vertices in G1 OR assign masses to vertices
[The denominator “Mass(G1) Mass(G2)” penalizes imbalanced cuts.]
minimum bisection
sparsest cut
[Four cuts. All edges have weight 1.
Upper left: the minimum bisection; a bisection is perfectly balanced.
Upper right: the minimum cut. Usually very unbalanced; not what we want. Lower left: the sparsest cut, which is good for many applications.
Lower right: the maximum cut; in this case also the maximum bisection.]
Sparsest cut, min bisection, max cut all NP-hard.
[Today we will look for an approximate solution to the sparsest cut problem.]
[We will turn this combinatorial graph cutting problem into algebra.]
minimum cut
maximum cut
graph.pdf
Spectral Graph Clustering 133 Let n = |V|. Let y ∈ Rn be an indicator vector:
1 vertexi∈G1, yi = −1 vertexi∈G2.
Then wi j
(yi − yj)2 4
=
w 0
i j
(i, j) is cut,
(i, j)isnotcut.
Cut(G ,G ) = w (yi − yj)2 12ij4
[This is quadratic, so let’s try to write it with a matrix.] = 1wijy2−2wijyiyj+wijy2
(i, j)∈E
n
(i, j)∈E
off-diagonal terms = y⊤Ly,
4
−wij, i j, whereLij= kiwik, i=j.
L is symmetric, n × n Laplacian matrix for G.
[Draw this by hand ]
[L is effectively a matrix representation of G. For the purpose of partitioning a graph, there is no need to distinguish edges of weight zero from edges that are not in the graph.]
[We see that minimizing the weight of the cut is equivalent to minimizing the Laplacian quadratic form y⊤Ly. This lets us turn graph partitioning into a problem in matrix algebra.]
[Usually we assume there are no negative weights, in which case Cut(G1,G2) can never be negative, so it follows that L is positive semidefinite.]
Define1=[1 1 ... 1]⊤;thenL1=0,so [It’seasytocheckthateachrowofLsumstozero.] 1 is an eigenvector of L with eigenvalue 0.
[If G is a connected graph and all the edge weights are positive, then this is the only zero eigenvalue. But if G is not connected, L has one zero eigenvalue for each connected component of G. It’s easy to prove, but time prevents me.]
4ij (i, j)∈E
1 2
= 4 −2wijyiyj + y wik i
i=1
ki diagonal terms
graphexample.png
134 Jonathan Richard Shewchuk
Bisection: exactly n/2 vertices in G1, n/2 in G2. Write 1⊤y = 0.
[So we have reduced graph bisection to this constrained optimization problem.] Minimum bisection:
← binary constraint ← balance constraint
Also NP-hard. We relax the binary constraint. → fractional vertices!
[A very common approach in combinatorial optimization algorithms is to relax some of the constraints so a discrete problem becomes a continuous problem. Intuitively, this means that you can put 1/3 of vertex 7 in graph G1 and the other 2/3 of vertex 7 in graph G2. You can even put −1/2 of vertex 7 in graph G1 and 3/2 of vertex 7 in graph G2. This sounds crazy, but the continuous problem is much easier to solve than the combinatorial problem. After we solve it, we will round the vertex values to +1/−1, and we’ll hope that our solution is still close to optimal.]
[But we can’t just drop the binary constraint. We still need some constraint to rule out the solution y = 0.] New constraint: y must lie on hypersphere of radius √n.
[Draw this by hand. ] [Instead of constraining y to lie at a vertex of the hyper-
cube, we constrain y to lie on the hypersphere through those vertices.] Relaxed problem:
Find y that minimizes y⊤Ly subjectto ∀i,yi =1oryi =−1
and 1⊤y = 0
Minimize y⊤Ly subject to y⊤y = n
and 1⊤y = 0
circle.pdf
5
y⊤Ly
= Minimize y⊤y = Rayleigh quotient of L & y
21 y2 y⊤Ly=0 y⊤Ly = 8
331 y⊤Ly=16 y⊤Ly = 24
1
y3
y1 1⊤y = 0
cylinder.pdf
[The isosurfaces of y⊤Ly are elliptical cylinders. The gray cross-section is
the hyperplane 1⊤y = 0. We seek the point that minimizes y⊤Ly, subject to the constraints that it lies on the gray cross-section and that it lies on a sphere centered at the origin.]
Spectral Graph Clustering
135
2
3 3
y2 51
v2
1
y⊤y = 3 v3
y⊤Ly = 16.6077 y⊤Ly = 12 y⊤Ly = 6
y3
[The same isosurfaces restricted to the hyperplane 1⊤y = 0. The solution is
constrained to lie on the outer circle.]
[You should remember this Rayleigh quotient from the lecture on PCA. As I said then, when you see a Rayleigh quotient, you should smell eigenvectors nearby. The y that minimizes this Rayleigh quotient is the eigenvector with the smallest eigenvalue. We already know what that eigenvector is: it’s 1. But that violates our balance constraint. As you should recall from PCA, when you’ve used the most extreme eigenvector and you need an orthogonal one, the next-best optimizer of the Rayleigh quotient is the next eigenvector.]
Let λ2 = second-smallest eigenvalue of L.
Eigenvector v2 is the Fiedler vector.
[It would be wonderful if every component of the Fiedler vector was 1 or −1, but that happens more or less never. So we round v2. The simplest way is to round all positive entries to 1 and all negative entries to −1. But in both theory and practice, it’s better to choose the threshold as follows.]
Spectral partitioning alg:
– Compute Fiedler vector v2 of L – Round v2 with a sweep cut:
= Sort components of v2.
= Try the n − 1 cuts between successive components. Choose min-sparsity cut.
[If we’re clever about updating the sparsity, we can try all these cuts in time linear in the number of edges in G.]
y1
endview.pdf
0.6
0.4
0.2
0.2
0.4
5 10 15 20
[Left: example of a graph partitioned by the sweep cut. Right: what the un-rounded Fiedler vector looks like.]
specgraph.pdf, specvector.pdf
136 Jonathan Richard Shewchuk
[One consequence of relaxing the binary constraint is that the balance constraint no longer forces an exact bisection. But that’s okay; we’re cool with a slightly off-balance constraint if it means we cut fewer edges. Even though our discrete problem was the minimum bisection problem, our relaxed, continuous problem will be an approximation of the sparsest cut problem. This is a bit counterintuitive.]
(right).]
Vertex Masses
lopsided.pdf
[A graph for which an off-balance cut (left) is sparser than a balanced one
[Sometimes you want the notion of balance to accord more prominence to some vertices than others. We can assign masses to vertices.]
Let M be diagonal matrix with vertex masses on diagonal.
New balance constraint: 1⊤ My = 0.
[This new balance constraint says that G1 and G2 should each have the same total mass. It turns out that this new balance constraint is easier to satisfy if we also revise the sphere constraint a little bit.]
New ellipsoid constraint: y⊤ My = Mass(G) = Mii.
[Instead of a sphere, now we constrain y to lie on an axis-aligned ellipsoid.]
[Draw this by hand. ] [The constraint ellipsoid passes through the points of the
hypercube.]
Now solution is Fiedler vector of generalized eigensystem Lv = λMv.
[Most algorithms for computing eigenvectors and eigenvalues of symmetric matrices can easily be adapted to compute eigenvectors and eigenvalues of symmetric generalized eigensystems too.]
[For the grad students, here’s the most important theorem in spectral graph partitioning.]
Lii
Fact: Sweep cut finds a cut w/sparsity ≤ 2λ2 maxi Mii : Cheeger’s inequality.
The optimal cut has sparsity ≥ λ2/2.
[So the spectral partitioning algorithm is an approximation algorithm, albeit not one with a constant factor of approximation. Cheeger’s inequality is a very famous result in spectral graph theory, because it’s one of the most important cases where you can relax a combinatorial optimization problem to a continuous opti- mization problem, round the solution, and still have a provably decent solution to the original combinatorial problem.]
ellipse.pdf
Spectral Graph Clustering 137
Vibration Analogy
vibrate.pdf
[For intuition about spectral partitioning, think of the eigenvectors as vibrational modes in a physical system of springs and masses. Each vertex models a point mass that is constrained to move freely along a vertical rod. Each edge models a vertical spring with rest length zero and stiffness proportional to its weight, pulling two point masses together. The masses are free to oscillate sinusoidally on their rods. The eigenvectors of the generalized eigensystem Lv = λMv are the vibrational modes of this physical system, and their eigenvalues are proportional to their frequencies.]
v1 v2 v3 v4
grids.pdf [Vibrationalmodesinapathgraphandagridgraph.]
[These illustrations show the first four eigenvectors for two simple graphs. On the left, we see that the first eigenvector is the eigenvector of all 1’s, which represents a vertical translation of all the masses in unison. That’s not really a vibration, which is why the eigenvalue is zero. The second eigenvector is the Fiedler vector, which represents the vibrational mode with the lowest frequency. Each component indicates the amplitude with which the corresponding point mass oscillates. At any point in time as the masses vibrate, roughly half the mass is moving up while half is moving down. So it makes sense to cut between the positive components and the negative components. The third eigenvector also gives us a nice bisection of the grid graph, entirely different from the Fiedler vector. Some more sophisticated graph clustering algorithms use multiple eigenvectors.]
[I want to emphasize that spectral partitioning takes a global view of a graph. It looks at the whole gestalt of the graph and finds a good cut. By comparison, the clustering algorithms we saw last lecture were much more local in nature, so they’re easier to fool.]
138 Jonathan Richard Shewchuk Greedy Divisive Clustering
Partition G into 2 subgraphs; recursively cluster them.
[The sparsity is a good criterion for graph clustering. Use G’s sparsest cut to divide it into two subgraphs, then recursively cut them. You can stop when you have the right number of clusters, or you could keep going until each subgraph is a single vertex and create a dendrogram.]
Can form a dendrogram, but it may have inversions.
[There’s no reason to expect that the sparsity of a subgraph is smaller than the sparsity of the parent graph, so the dendrogram can have inversions. But the hierarchy is still useful for getting an arbitrary number of clusters on demand.]
The Normalized Cut
Set vertex i’s mass Mii = Lii. [Sum of edge weights adjoining vertex i.]
[That is how we define a normalized cut, which turns out to be a good choice for many different applica- tions.]
Popular for image segmentation.
[Image segmentation is the problem of looking at a photograph and separating it into different objects. To do that, we define a graph on the pixels.]
For pixels with coordinate pi, brightness bi, use graph weights
∥ p − p ∥ 2 | b − b | 2 ijij
w =exp− − orzeroif∥p−p∥large. ijαβij
[We choose a distance threshold, typically less than 4 to 10 pixels apart. Pixels that are far from each other aren’t connected. α and β are empirically chosen constants. It often makes sense to choose β proportional to the variance of the brightness values.]
[A segmentation of a photo of a scene from a baseball game (upper left). The other figures show segments of the image extracted by recursive spectral partitioning.]
baseballsegment.pdf (Shi and Malik, “Normalized Cut and Image Segmentation”)
Spectral Graph Clustering 139
[Eigenvectors 2–9 from the baseball image.] Invented by [our own] Prof. Jitendra Malik and his student Jianbo Shi.
baseballvectors.pdf (Shi and Malik)
140 Jonathan Richard Shewchuk
23 Multiple Eigenvectors; Random Projection; Applications
Clustering w/Multiple Eigenvectors
[When we use the Fiedler vector for spectral graph clustering, it tells us how to divide a graph into two graphs. If we want more than two clusters, we can use divisive clustering: we repeatedly cut the subgraphs into smaller subgraphs by computing their Fiedler vectors. However, there are several other methods to subdivide a graph into k clusters in one shot that use multiple eigenvectors rather than just the Fiedler vector v2. These methods sometimes give better results. They use k eigenvectors in a natural way to cluster a graph into k subgraphs.]
For k clusters, compute first k eigenvectors v1 = 1, v2, . . . , vk of generalized eigensystem Lv = λMv. ⊤1⊤
Scale them so that vi Mvi = 1. E.g., v1 = √ Mii 1. Now V MV = I. [The eigenvectors are M-orthogonal.]
V =
[V’s columns are the eigenvectors with the k smallest eigenvalues.]
[Yes, we do include the all-1’s vector v1 as one of
= the columns of V.]
[Draw this by hand. ]
v1
vk
n×k
V1
Vn
Row Vi is spectral vector [my name] for vertex i. [The rows are vectors in a k-dimensional space I’ll call the “spectral space.” When we were using just one eigenvector, it made sense to cluster vertices together if their components were close together. When we use more than one eigenvector, it turns out that it makes sense to cluster vertices together if their spectral vectors point in similar directions.]
Normalize each row Vi to unit length.
[Now you can think of the spectral vectors as points on a unit sphere centered at the origin.]
[Draw this by hand ] [A 2D example showing two clusters on a circle. If the graph has k components, the points in each cluster will have identical spectral vectors that are exactly orthogonal to all the other components’ spectral vectors (left). If we modify the graph by connecting these components with small-weight edges, we get vectors more like those at right—not exactly orthogonal, but still tending toward distinct clusters.]
k-means cluster these vectors.
[Because all the spectral vectors lie on the sphere, k-means clustering will cluster together vectors that are separated by small angles.]
eigenvectors.pdf
vectorclusters.png
Multiple Eigenvectors; Random Projection; Applications 141
[Comparison of point sets clustered by k-means— just k-means by itself, that is—vs. a spectral method. To create a graph for the spectral method, we use an exponentially decaying function to assign weights to pairs of points, like
we used for image segmentation but without the brightnesses.]
Invented by [our own] Prof. Michael Jordan, Andrew Ng [when he was still a student at Berkeley], Yair Weiss.
[This wasn’t the first algorithm to use multiple eigenvectors for spectral clustering, but it has become one of the most popular.]
compkmeans.png, compspectral.png
142 Jonathan Richard Shewchuk RANDOM PROJECTION
A cheap alternative to PCA as preprocess for clustering, classification, regression. Approximately preserves distances between points!
[We project onto a random subspace instead of the “best” subspace, but take a fraction of the time of PCA. It works best when you project a very high-dimensional space to a medium-dimensional space. Because we roughly preserve the distances, algorithms like k-means clustering and nearest neighbor classifiers will give similar results to what they would give in high dimensions, but they run much faster.]
Pick a small ε, a small δ, and a random subspace S ⊂ Rd of dimension k, where k = 2 ln(1/δ) . ε2/2 − ε3/3
For any pt q, let qˆ be orthogonal projection of q onto S , multiplied by
[The multiplication by d/k helps preserve the distances between points after you project.]
d . √k
Johnson–Lindenstrauss Lemma (modified):
Foranytwoptsq,w∈Rd,(1−ε)∥q−w∥2 ≤∥qˆ−wˆ∥2 ≤(1+ε)∥q−w∥2 withprobability≥1−2δ. Typical values: ε ∈ [0.02, 0.5], δ ∈ [1/n3, 0.05]. [You choose ε and δ according to your needs.]
J[LWithEthxespe raengreism, theedinstatnsce between two points after projecting might change by 2% to 50%. In practice, you can experiment with k to find the best speed-accuracy tradeoff. If you want most inter-point distances to be accurate, you should set δ smaller than 1/n2, so you need a subspace of dimension Θ(log n). Reducing δ doesn’t cost much, but reducing ε costs more. You can bring 1,000,000 sample points down to a 10,000-
dimensional space with a 10% error in the distances.]
Data: 20-newsgroups, from 100.000 features to 1.000 (1%)
[What is remarkable about this result is that the dimension d of the input points doesn’t matter!]
100000to1000.pdf
[Comparison of inter-point distances before and after projecting points in 100,000-dimensional space down to 1,000 dimensions.]
MATLAB implementation: sqrt k à*randn k,N à*àX.
[Why does this work? A random projection of a vector is like taking a random vector and selecting k com-
ponents. The mean of the squares of those k components approximates the mean for the whole population.]
[How do you get a uniformly distributed random projection direction? You can choose each component from a univariate Gaussian distribution, then normalize the vector to unit length. How do you get a random subspace? You can choose k random vectors, then use Gram–Schmidt orthogonalization to make them mutually orthonormal. Interestingly, Indyk and Motwani show that if you skip the expensive normalization and Gram–Schmidt steps, random projection still works almost as well, because random vectors in a high- dimensional space are nearly equal in length and nearly orthogonal to each other with high probability.]
Multiple Eigenvectors; Random Projection; Applications 143 THE GEOMETRY OF HIGH-DIMENSIONAL SPACES
Consider shell between spheres of radii r & r − ε.
[Draw this by hand ] [Concentric balls. In high dimensions, almost every point chosen uniformly at random in the outer ball lies outside the inner ball.]
Volume of outer ball ∝ rd
Volume of inner ball ∝ (r − ε)d
Ratio of inner ball volume to outer =
(r−ε)d εd εd
rd = 1− r ≈exp − r whichissmallforlarged.
E.g., if ε = 0.1 & d = 100, inner ball has 0.9100 = 0.0027% of volume. r
Random points from uniform distribution in ball: nearly all are in outer shell. ” ” ” Gaussian ” : nearly all are in some shell.
[If the dimension is very high, the majority of the random points generated from an isotropic Gaussian dis- tribution are approximately at the same distance from the center. So they lie in a thin shell. Why? Consider a d-dimensional normal distribution with mean zero. By Pythagoras’ Theorem, the squared distance from a random point p to the mean is]
∥ p ∥ 2 = p 21 + p 2 2 + . . . + p 2d .
[Each component pi is sampled independently from a univariate normal distribution with mean zero. When
you add d independent random numbers, you scale the mean by d but you scale the standard deviation by √
only d.]
E[∥p∥2] = d E[p21].
concentric.png
2√2 std(∥p∥ ) = d std(p1).
[So when d is large, the distance from p to the mean is concentrated in a narrow shell whose radius is √ √4
proportional to d with a standard deviation proportional to d.]
[The principle here is that when you take the mean of a very large sample, you get a very accurate estimate of the population mean. When you sample one point from a high-dimensional normal distribution, it’s like sampling d different scalars from one-dimensional normal distributions. Notice the similarity to the coordinate-sampling discussion for random projections.]
Lessons:
– In high dimensions, sometimes nearest neighbor and 1,000th-nearest neighbor don’t differ much. – k-means clustering and nearest neighbor classifiers are less effective for large d.
[Former CS 189/289A head TA, Marc Khoury, has a nice short essay entitled “Counterintuitive Properties of High Dimensional Space”, which you can read at https://marckhoury.github.io/counterintuitive-properties-of-high-dimensional-space/ ]
144 Jonathan Richard Shewchuk APPLICATIONS
Predicting COVID-19 Severity
Jiang et. al (2020): goals are to predict which COVID-19 patients will develop acute respiratory distress syndrome (ARDS) & identify the clinical signs that predict it.
[Note that both prediction and inference are goals here. They want to develop machine learning tools that predict which patients are likely to enter a life-threatening disease state and might need a ventilator. They also want to identify which symptoms are most predictive of that outcome.]
Subjects: 53 hospitalized patients with confirmed COVID-19 admitted to Wenzhou Central Hospital and Cangnan People’s Hospital in Wenzhou, China.
[So all the subjects were tested and confirmed to have COVID-19, and had it bad enough to be admitted to the hospital. There were 33 men and 20 women. They were a surprisingly young bunch, with a median age of 43 years. Out of those patients, only 5 developed ARDS, and all of them were men. So this is far from a conclusive study; there isn’t a lot of data. Fortunately, all 53 survived and were discharged.]
Features were evaluated 2 ways: forward stepwise selection (with 10-fold cross-validation) and chi-squared statistics for each feature.
[Unfortunately, this part of the paper is not well-written. They give us a rank ordering of the most predictive features, but I can’t tell whether the ranking comes from forward selection or chi-squares tests or some combination.]
[The big surprise in this study is that the following features are not good predictors of disease progression.]
NOT highly predictive (surprise!): – fever
– cough
– ground glass opacities in lung images [computed tomography] – lymphopenia [reduced lymphocytes in bloodstream]
– dyspnea [difficulty breathing]
jiang.pdf
Multiple Eigenvectors; Random Projection; Applications 145
[These symptoms are the hallmarks of COVID-19, but they did not distinguish mild cases from cases that progressed to severity, in part because most of the patients had the first four symptoms. The best predictors are the following symptoms.]
1. mildly elevated alanine aminotransferase (ALT) [a liver enzyme, measured in the bloodstream] 2. myalgias [body aches]
3. elevated hemoglobin [red blood cells]
4. sex (male)
[I think this list is the best thing to come out of this study. It’s valuable to know that the symptoms that are best for predicting whether your COVID-19 will become severe are very different from the symptoms that predict whether you have COVID-19 at all. The complete list has 11 items, but most of those are weak predictors. Age is on the list, surprisingly only the tenth most predictive feature; but the oldest subject was only 67.]
[The number one predictor, alanine aminotransferase in the bloodstream, is generally considered a sign of liver damage. The authors note that none of the five patients who developed ARDS had any pre-existing liver disease, so the elevated ALT was probably a sign of COVID-19 doing damage beyond the respiratory system.]
[The researchers also trained some classifiers and reported their performance. Unfortunately, this part of the paper is mostly just a lesson on how not to write a research paper. Fewer than 10% of the patients got ARDS—five out of 53—so you would think they could achieve an accuracy rate better than 90%, right? Inexplicably, their best classifiers have an 80% accuracy rate. The paper doesn’t separate false negatives from false positives, which should be particularly important when the class of patients who contracted ARDS is so much smaller than the class that didn’t. For what it’s worth, here are the reported accuracies.]
80% accuracy: 5-nearest neighbors
80%: SVM [probably soft-margin, but they didn’t say] 70%: decision tree
70%: random forest
50%: logistic regression
“A decision tree based on the one feature ALT reached a 70% accuracy.”
146 Jonathan Richard Shewchuk Predicting Personality from Faces
hu.pdf
Hu et. al (2017).
Big Five (BF) model of personality:
– E: extraversion
– A: agreeableness
– C: conscientiousness – N: neuroticism
– O: openness
[Researchers have found that these five personality factors are approximately orthogonal to each other. They are highly heritable and highly stable during adulthood.]
Can we predict these traits from 3D faces?
[Studies have shown that people looking at photographs of static faces with neutral expressions can iden- tify the traits better than chance, especially for conscientiousness, extraversion, and agreeableness. This experiment asks whether machine learning can do the same with 3D reconstructions of faces. The subjects were 834 Han Chinese volunteers in Shanghai, China. We don’t know whether any of these results might generalize to people who are not Han Chinese.]
[The faces were scanned in high-resolution 3D and a non-rigid face registration system was used to fit a grid of 32,251 vertices to each face in a manner that maps each vertex to an appropriate landmark on the face. (They call this “anatomical homology.”) So the design matrix X was 834 × 100,053, representing 834 subjects with 32,251 3D features each.]
[Subject personalities were evaluated with a self-questionnaire, namely our own Berkeley Personality Lab’s Big Five Inventory, translated into Chinese. The authors treated men and women separately.]
Multiple Eigenvectors; Random Projection; Applications 147
Uses partial least squares (PLS) to find associations between personality & faces.
[Everything from here to the end is spoken, not written.]
Partial least squares (PLS) is like a supervised version of PCA. It takes in two matrices X and Y with the same number of rows. In our example, X is the face data and Y is the personality data for the 834 subjects. Like PCA, PLS finds a set of vectors in face space that we think of as the most important components. But whereas PCA looks for the directions of maximum variation in X, PLS looks for the directions in X that maximize the correlation with the personality traits in matrix Y.
The researchers found the top 20 or so PLS components and used cross-validation to decide which compo- nents have predictive power for each personality trait. They found that the top two components for extraver- sion in women were predictive, but no components for the other four traits in women were predictive. Men are easier to analyze: they found two or three components were predictive for each of extraversion, agree- ableness, conscientiousness, and neuroticism in men. However, the correlations were statistically significant only for agreeableness and conscientiousness.
[The relationship between male faces, agreeableness, and conscientiousness. The large, colored faces are the mean faces; colors indicate the values in the most predictive PLS component vector.]
More agreeable men correlate with much wider mouths that look a bit smiley even when neutral; stronger, forward jaws; wider noses; and shorter faces, especially shorter in the forehead, compared to less agreeable men. More conscientious men tend to have higher, wider eyebrows; wider, opened eyes; a withdrawn upper lip with more mouth tension; and taller faces with more pronounced brow ridges (the bone protuberance above the eyes). The authors note that men with low A and C scores look both more relaxed and more indifferent.
male.pdf
148 Jonathan Richard Shewchuk
[The relationship between female faces and extraversion. The large, colored faces are the mean faces; colors indicate the values in the most predictive PLS component vector.]
More extraverted women correlate with rounder faces, especially in profile, with a more protruding nose and lips but a recessed chin, whereas the introverts have more flat, square-shaped faces. To my eyes, the extraverts also have more expressive mouths.
It’s interesting is that physiognomy, the art of judging character from facial shape, used to be considered a pseudoscience, but it’s been making a comeback in recent years with the help of machine learning. One reason it fell into disrepute is because, historically, it was sometimes applied across races in fallacious and insulting ways. But if you want to train classifiers that guess people’s personalities with some accuracy, you probably need a different classifier for each race. This is a classifier trained exclusively for one race, Han Chinese, which is probably part of why it works as well as it does. If you tried to train one classifier on many different races, I suspect its performance would be much worse.
Another thing that’s notable is that the authors were able to find statistically significant correlations for some personality traits, the majority of traits defeated them. So while physiognomy is real, it’s still pretty weak. It’s an open question whether machine learning will ever be able to predict personality substantially better than this or not. Adding a time dimension and incorporating people’s movements and dynamic facial expressions seems like a promising way to improve personality predictions.
Tools like this raise some ethical issues. The one that concerns me the most is that, if tools like this are emerging now, many governments probably already had similar tools ten years ago, and have probably been using them to profile us.
One student asked whether these methods might be used by employers to screen prospective employees. I think that tools like this are inferior to simply giving an interviewee a personality test. Such tests are legal, so long as their questions are not found to violate an employee’s right to privacy and the results are not used to discriminate against legally protected groups. The most troubling part of using physiognomy to screen employees would not be that personality testing is unlawful. (It isn’t, and quite a few companies do it.) It would be that physiognomy isn’t nearly accurate enough. An employer who uses a poorly designed or unvalidated personality test to make personnel decisions might run a higher risk that a court might rule that the test could have a discriminatory effect, violating Title VII of the Civil Rights Act of 1964. Also, they probably won’t make good decisions. But perhaps in the future, better measurements, better statistical procedures, and better algorithms might overcome these problems.
female.pdf
Boosting; Nearest Neighbor Classification 149 24 Boosting; Nearest Neighbor Classification
ADABOOST (Yoav Freund and Robert Schapire, 1997)
[We’re done with unsupervised learning. This week I’m going back to classifiers.]
AdaBoost (“adaptive boosting”) is an ensemble method for classification (or regression) that – trains multiple learners on weighted sample points [like bagging];
– uses different weights for each learner;
– increases weights of misclassified sample points;
– gives bigger votes to more accurate learners.
Input: n × d design matrix X, vector of labels y ∈ Rn with yi = ±1.
Ideas:
– TrainT classifiersG1,...,GT. [“T”standsfor“trees”]
– Weight for sample point Xi in Gt grows according to how many of G1, . . . , Gt−1 misclassified it.
[Moreover, if Xi is misclassified by very accurate learners, its weight grows even more.]
[And, the weight shrinks every time Xi is correctly classified.]
– Train Gt to try harder to correctly classify sample pts with larger weights.
– Metalearner is a linear combination of learners. For test point z, M(z) = T β G (z).
Each Gt is ±1, but M is continuous. Return sign of M(z).
[In the previous lecture on ensemble methods, I talked briefly about how to assign different weights to sample points. It varies for different learning algorithms. For example, in regression we usually modify the risk function by multiplying each point’s loss function by its weight. In a soft-margin support vector machine, we modify the objective function by multiplying each point’s slack by its weight.]
[Boosting works with most learning algorithms, but it was originally developed for decision trees, and boosted decision trees are still very popular and successful. For decision trees, we use a weighted entropy where instead of computing the proportion of points in each class, we compute the proportion of weight in each class.]
In iteration T , what classifier GT and coefficient βT should we choose? Pick a loss fn L(prediction, label).
Find GT & βT that minimize
1 n
Risk = n
i=1
L(M(Xi), yi),
M(Xi) =
T t=1
βtGt(Xi).
AdaBoost metalearner uses exponential loss function −ρl e−ρ l = +1
L(ρ, l) = e = eρ l = −1
[This loss function is for the metalearner only. The individual learners Gt usually use other loss functions, if they use a loss function at all.]
Important: label l is binary, Gt is binary, but ρ = M(Xi) is continuous!
[The exponential loss function has the advantage that it pushes hard against badly misclassified points. That’s one reason why it’s usually better than the squared error loss function for classification in a met- alearner. It’s similar to why in neural networks we prefer the cross-entropy loss function to the squared error.]
t=1 t t
150
Jonathan Richard Shewchuk
n·Risk
nn
= L(M(Xi),yi)=e−yiM(Xi)
i=1 i=1
nTnT
i=1 = e−βT
ii
t=1
−βtyiGt(Xi) exp −y β G (X ) = e
yiGt(Xi) = ±1
−1 → Gt misclassifies Xi
=
= w(T)e−βTyiGT(Xi), wherew(T) =e−βtyiGt(Xi)
⇐
itti
i=1 t=1 i=1t=1
n T−1
yi=GT (Xi) n
w(T ) + eβT w(T ) ii
yiGT (Xi)
[correctly classified and misclassified]
= e−βT w(T)+(eβT −e−βT) w(T). ii
i=1 yiGT (Xi)
What GT minimizes the risk? The learner that minimizes the sum of w(T) over all misclassified pts Xi!
i
[This is interesting. By manipulating the formula for the risk, we’ve discovered what weight we should
assign to each sample point. If we want to minimize the risk, we should find the classifier that minimizes the
total weight of the misclassified points for this weight function w(T). It’s a complicated function, but we can i
yi =GT(Xi), yiGT(Xi).
compute it. A useful observation is that each learner’s weights are related to the previous learner’s weights:] Recursive definition of weights:
(T+1) (T) −β yG (X) w(T)e−βT
w=weTiTi=i
i i w(T)eβT
i
[This recursive formulation is a nice benefit of choosing the exponential loss function. Notice that a weight shrinks if the point was classified correctly by learner T , and grows if the point was misclassified.]
[Now, you might wonder if we should just pick a learner that classifies all the training points correctly. But that’s not always possible. If we’re using a linear classifier on data that’s not linearly separable, some points must be classified wrongly. Moreover, it’s NP-hard to find the optimal linear classifier, so in practice GT will be an approximate best learner, not the true minimizer of training error. But that’s okay.]
[You might ask, if we use decision trees, can’t we get 100% training accuracy? Usually we can. But interestingly, boosting is usually used with short, imperfect decision trees instead of tall, pure decision trees, for reasons I’ll explain later.]
[Now, let’s derive the optimal value of βT .]
To choose βT, set d Risk = 0: dβT
n
0 = −e−βT w(T) +(eβT +e−βT) w(T); [nowdividebothsidesbythefirstterm] ii
i=1 yiGT (Xi)
1 1−errT βT = 2ln err .
T
[So now we have derived the optimal metalearner!]
yiGT(Xi) w(T)
0 = −1+(e2βT +1)err , whereerr = i ; [G ’sweightederrorrate]
T T nw(T) T i=1 i
Boosting; Nearest Neighbor Classification 151 – If errT = 0, βT = ∞. [So a perfect learner gets an infinite vote.]
– If errT = 1/2, βT = 0. [So a learner with 50% weighted training accuracy gets no vote at all.]
[More accurate learners get bigger votes in the metalearner. Interestingly, a learner with worse than 50% training accuracy gets a negative vote. A learner with 40% accuracy is just as useful as a learner with 60% accuracy; the metalearner just reverses the sign of its votes.]
[Now we can state the AdaBoost algorithm.] AdaBoost alg:
1. Initialize weights wi ← 1 , ∀i ∈ [1, n].
n
a. TrainGt withweightswi
2. fort←1toT
b. Compute weighted error rate err ←
wi 1
ln
1−err err
.
c. Reweightpts:wi ←wi·
3. return metalearner h(z) = sign T
=wi·
misclassified all wi
; coefficient βt ←
βt
e ,
Gt misclassifies Xi otherwise
err err
2
1−err
,
−β
t=1 t t
e
t ,
β G (z).
1−err .
boost0.png, boost2.png, boost4.png
[At left, all the training points have equal weight. Af- ter choosing a first linear classifier, we increase the weights of the misclassified points and decrease the weights of the correctly classified points (center). We train a second classifier with these weighted points, then again adjust the weights of the points according to whether
they are misclassified by the second classifier.]
Why boost decision trees? [As opposed to other learning algorithms?] Why short trees?
– Fast. [We’re training many learners, and running many learners at classification time too. Short
decision trees that only look at a few features are very fast at both training and testing.]
– No hyperparameter search needed. [Unlike SVMs, neural nets, etc.] [UC Berkeley’s Leo Breiman
called AdaBoost with decision trees “the best off-the-shelf classifier in the world.”]
– Easy to make a tree beat 55% training accuracy [or other threshold] consistently.
– Easy bias-variance control. Boosting can overfit. AdaBoost trees are usually short, to reduce overfit-
ting.
[As you train more learners, the bias decreases. The variance is more complicated: it often decreases at first, because successive trees focus on different features, but often it later increases. Sometimes boosting overfits after many iterations, and sometimes it doesn’t; it’s hard to predict when it will and when it won’t.]
152 Jonathan Richard Shewchuk
– AdaBoost + short trees is a form of subset selection.
[Features that don’t improve the metalearner’s predictive power enough aren’t used at all. This helps reduce overfitting and running time, especially if there are a lot of irrelevant features.]
– Linear decision boundaries don’t boost well.
[Boosting linear classifiers gives you an approximately linear classifier, so SVMs aren’t a great choice. Methods with nonlinear decision boundaries benefit more from boosting, because they allow boosting to reduce the bias faster. Sometimes you’ll see examples where people do AdaBoost with depth-one decision trees with just one decision each. But that’s not ideal, because depth-one decision trees are linear. Even depth-two decision trees boost substantially better.]
More about AdaBoost:
– Posterior prob. can be approximated: P(Y = 1|x) ≈ 1/(1 + e−2M(x)).
– Exponential loss is vulnerable to outliers; for corrupted data, use other loss.
[Better loss functions have been derived for dealing with outliers. Unfortunately, they have more
complicated weight computations.]
– If every learner beats training accuracy μ for μ > 50%, metalearner training accuracy will eventually
be 100%. [You will prove this in Homework 7.]
– [The AdaBoost paper and its authors, Freund and Schapire, won the 2003 Go ̈del Prize, a prize for
outstanding papers in theoretical computer science.]
Single Stump
244 Node Tree
Exponential Loss
Misclassification Rate
0 100 200 300 400 0 100 200 300 400
Boosting Iterations
Boosting Iterations
[Training and testing errors for AdaBoost with stumps, depth-one decision trees that make only one decision each. At left, observe that the training error eventually drops to zero, and even after that the average loss (which is continuous, not binary) continues to decay exponentially. At right, the test error drops to 5.8% after 400 iterations, even though each learner has an error rate of about 46%. AdaBoost with more than 25 stumps outperforms a single 244-node decision tree. In this example no overfitting is observed, but there are other datasets for which overfitting is
a problem.]
trainboost.pdf, testboost.pdf (ESL, Figures 10.2, 10.3)
Training Error
0.0 0.2 0.4 0.6 0.8 1.0
Test Error
0.0 0.1 0.2 0.3 0.4 0.5
Boosting; Nearest Neighbor Classification 153 NEAREST NEIGHBOR CLASSIFICATION
[I saved the simplest classifier for the end of the semester.]
Idea: Given query point q, find the k sample pts nearest q.
Distance metric of your choice.
Regression: Return average label of the k pts.
Classification: Return class with the most votes from the k pts OR
return histogram of class probabilities.
[The histogram of class probabilities tries to estimate the posterior probabilities of the classes. Obviously, the histogram has limited precision. If k = 3, then the only probabilities you’ll ever return are 0, 1/3, 2/3, or 1. You can improve the precision by making k larger, but you might underfit. The histogram works best when you have a huge amount of data.]
KNN: K=1 KNN: K=10 KNN: K=100
oooooo oooo ooo oooo
oo ooo oo
oo
o oo o o o o oo o o oo o o
oooooo o oooo o ooo o oooo
oo o ooo o oo o
o o o o o o o o o
ooo o ooo oo o o o oooo o ooo o ooo oo ooooooooo
ooooooooo oooo ooooooooo
o oooo o o o ooo o oo o oooo o ooo o oooooo oooo
o
o
o ooo ooo o ooo oooooooooooo
oooooooo oooooo
ooo ooooo oooo oo oooooooooo
oo oo oo o oooo oo oo oo ooooo ooo o ooooo
ooooo ooooo ooooo o oo oooo o oo
oooo o ooo oooo ooooo oo oo ooooo
o o o ooooo o o ooooo o o o ooooo ooooooooo
oo o ooo oo o oooooo oo oooo oooooo
oooo ooooo oooo oooooo
ooooo oooo o
oo oooo o o ooo oo
ooooo
o oo o o oo o oooooo
oo oooo o o oo oooo
ooo oo oo ooo o ooo ooo ooo ooo
o o o ooo oo o
o o ooo o
o o oooooo o
o o o ooo o oo
o
ooo
o o
ooo ooo ooo oooo
o
ooo oo ooo o
allnn.pdf (ISL, Figures 2.15, 2.16) [Examples of 1-NN, 10-NN, and 100-NN. A larger k smooths out the boundary. In this example, the 1-NN classifier is badly overfitting the data, and the 100-NN classifier is badly underfitting. The 10-NN classifier does well: it’s reasonably close to the Bayes decision boundary (purple). Generally, the ideal k depends on how dense your data is. As your data gets denser, the best k increases.]
[There are theorems showing that if you have a lot of data, nearest neighbors can work quite well.]
Theorem (Cover & Hart, 1967):
As n → ∞, the 1-NN error rate is < 2B − B2 where B = Bayes risk.
if only 2 classes, ≤ 2B − 2B2
[There are a few technical requirements of this theorem. The most important is that the training points and the test points all have to be drawn independently from the same probability distribution. The theorem ap- plies to any separable metric space, so it’s not just for the Euclidean metric.]
Theorem (Fix & Hodges, 1951):
As n → ∞, k → ∞, k/n → 0, k-NN error rate converges to B. [Which means Bayes optimal.]
o o oo o
oo oo o
o o oo o
ooo
ooo o o
oooo o oo
oo
154 Jonathan Richard Shewchuk
25 Nearest Neighbor Algorithms: Voronoi Diagrams and k-d Trees
NEAREST NEIGHBOR ALGORITHMS
Exhaustive k-NN Alg.
Given query point q:
– Scan through all n sample pts, computing (squared) distances to q.
– Maintain a max-heap with the k shortest distances seen so far.
[Whenever you encounter a sample point closer to q than the point at the top of the heap, you remove the heap-top point and insert the better point. Obviously you don’t need a heap if k = 1 or even 5, but if k = 101 a heap will substantially speed up keeping track of the kth-shortest distance.]
Time to train classifier: 0 [This is the only O(0)-time algorithm we’ll learn this semester.] Query time: O(nd + n log k)
expected O(nd + k log n log k) if random pt order
[It’s a cute theoretical observation that you can slightly improve the expected running time by randomizing the point order so that only expected O(k log n) heap operations occur. But in practice I can’t recommend it; you’ll probably lose more from cache misses than you’ll gain from fewer heap operations.]
Can we preprocess training pts to obtain sublinear query time?
2–5 dimensions: Voronoi diagrams
Medium dim (up to ∼ 30): k-d trees
Large dim: exhaustive k-NN, but can use PCA or random projection
locality sensitive hashing [still researchy, not widely adopted]
Voronoi Diagrams
Let X be a point set. The Voronoi cell of w ∈ X is Vorw={p∈Rd:∥p−w∥≤∥p−v∥ ∀v∈X}
[A Voronoi cell is always a convex polyhedron or polytope.] The Voronoi diagram of X is the set of X’s Voronoi cells.
Nearest Neighbor Algorithms: Voronoi Diagrams and k-d Trees 155
voro.pdf, vormcdonalds.jpg, voronoiGregorEichinger.jpg, saltflat-1.jpg
[Voronoi diagrams sometimes arise in nature (salt flats, giraffe, crystallography).]
[Believe it or not, the first published Voronoi diagram dates back to 1644, in the book “Principia Philosophiae” by the famous mathematician and philosopher Rene ́ Descartes. He claimed that the solar system consists of vortices. In each region, matter is revolving around one of the fixed stars (vortex.pdf). His physics was wrong, but his idea of dividing space into polyhedral regions has survived.]
Size (e.g., # of vertices) ∈ O(n⌈d/2⌉)
[This upper bound is tight when d is a small constant. As d grows, the tightest asymptotic upper bound is somewhat smaller than this, but the complexity still grows exponentially with d.]
. . . but often in practice it is O(n).
[Here I’m leaving out a constant that may grow exponentially with d.]
giraffe-1.jpg, perovskite.jpg, vortex.pdf
156 Jonathan Richard Shewchuk
Point location: Given query point q ∈ Rd, find the point w ∈ X for which q ∈ Vor w.
[We need a second data structure that can perform this search on a Voronoi diagram efficiently.] 2D: O(n log n) time to compute V.d. and a trapezoidal map for pt location
O(log n) query time [because of the trapezoidal map]
[That’s a pretty great running time compared to the linear query time of exhaustive search.]
dD: Use binary space partition tree (BSP tree) for pt location
[Unfortunately, it’s difficult to characterize the running time of this strategy, although it is likely to be reasonably fast in 3–5 dimensions.]
1-NN only!
[A standard Voronoi diagram supports only 1-nearest neighbor queries. If you want the k nearest neighbors, there is something called an order-k Voronoi diagram that has a cell for each possible k nearest neighbors. But nobody uses those, for two reasons. First, the size of an order-k Voronoi diagram is O(k2n) in 2D, and worse in higher dimensions. Second, there’s no software available to compute one.]
[There are also Voronoi diagrams for other distance metrics, like the L1 and L∞ norms.]
[Voronoi diagrams are good for 1-nearest neighbor in 2 or 3 dimensions, maybe 4 or 5, but k-d trees are
much simpler and probably faster in 6 or more dimensions.]
k-d Trees
“Decision trees” for NN search. Differences: [compared to decision trees]
– Choose splitting feature w/greatest width: feature i in maxi,j,k(Xji − Xki).
[With nearest neighbor search, we don’t care about the entropy. Instead, what we want is that if we draw a sphere around the query point, it won’t intersect very many boxes of the decision tree. So it helps if the boxes are nearly cubical, rather than long and thin.]
Cheap alternative: rotate through the features. [We split on the first feature at depth 1, the second feature at depth 2, and so on. This builds the tree faster, by a factor of O(d).]
Median guarantees ⌊log2 n⌋ tree depth; O(nd log n) tree-building time.
[. . . or just O(n log n) time if you rotate through the features. An alternative to the median is splitting at the box center, which improves the aspect ratios of the boxes, but it could unbalance your tree. A compromise strategy is to alternate between medians at odd depths and centers at even depths, which also guarantees an O(log n) depth.]
– Each internal node stores a sample point. [. . . that lies in the node’s box. Usually the splitting point.] [Some k-d tree implementations have points only at the leaves, but it’s better to have points in internal nodes too, so when we search the tree, we often stop searching earlier.]
– Choose splitting value: median point for feature i, or Xji+Xki . 2
5 2
7 9 10
6
root represents R2 right halfplane
1
1
10
3
5
4
7
8
lower right quarter plane
4 6 8 11 2 3 9 11
[Draw this by hand. ]
kdtreestructure.pdf
Nearest Neighbor Algorithms: Voronoi Diagrams and k-d Trees 157
Goal: givenqueryptq,findasampleptwsuchthat∥q−w∥≤(1+ε)∥q−s∥, where s is the closest sample pt.
ε =0 ⇒ exactNN; ε >0 ⇒ approximateNN.
Query alg. maintains:
– Nearest neighbor found so far (or k nearest).
– Binary min-heap of unexplored subtrees, keyed by distance from q.
nearest so far
goes down ↓ goes up ↑
q
[Draw this by hand. ] [A query in progress.]
[Each subtree represents an axis-aligned box. The query tries to avoid searching most of the boxes/subtrees by searching the boxes close to q first. We measure the distance from q to a box and use it as a key for the subtree in the heap. The search stops when the distance from q to the kth-nearest neighbor found so far ≤ the distance from q to the nearest unexplored box (times 1 + ε). For example, in the figure above, the query never visits the box at far lower right, because it doesn’t intersect the circle.]
Alg. for 1-NN query:
Q ← heap containing root node with key zero r←∞
while Q not empty and (1 + ε) · minkey(Q) < r
B ← removemin(Q)
w ← B’s sample point
r ← min{r, dist(q, w)}
B′, B′′ ← child boxes of B
if (1 + ε) · dist(q, B′) < r then insert(Q, B′, dist(q, B′)) if (1 + ε) · dist(q, B′′) < r then insert(Q, B′′, dist(q, B′′))
[For speed, store square of r instead.] [The key for B′ is dist(q, B′)]
kdtreequery.pdf
return point that determined r
For k-NN, replace “r” with a max-heap holding the k nearest neighbors
[. . . just like in the exhaustive search algorithm.] Works with any Lp norm for p ∈ [1, ∞].
[k-d trees are not limited to the Euclidean (L2) norm.] Why ε-approximate NN?
[Draw this by hand. ] [A worst-case exact NN query.]
q
kdtreeproblem.pdf
158 Jonathan Richard Shewchuk
[In the worst case, we may have to visit every node in the k-d tree to find the exact nearest neighbor. In that case, the k-d tree is slower than simple exhaustive search. This is an example where an approximate nearest neighbor search can be much faster. In practice, settling for an approximate nearest neighbor sometimes improves the speed by a factor of 10 or even 100, because you don’t need to look at most of the tree to do a query. This is especially true in high dimensions—remember that in high-dimensional space, the nearest point often isn’t much closer than a lot of other points.]
Software: ANN (U. Maryland), FLANN (U. British Columbia), GeRaF (U. Athens) [random forests!]
Example: im2gps
[I want to emphasize the fact that exhaustive nearest neighbor search really is one of the first classifiers you should try in practice, even if it seems too simple. So here’s an example of a modern research paper that uses 1-NN and 120-NN search to solve a problem.]
Paper by James Hays and [our own] Prof. Alexei Efros.
[Goal: given a query photograph, determine where on the planet the photo was taken. Called geolocalization. They evaluated both 1-NN and 120-NN. What they did not do, however, is treat each photograph as one long vector. That’s okay for tiny digits, but too expensive for millions of travel photographs. Instead, they reduced each photo to a small descriptor made up of a variety of features that extract the essence of each photo.] [Show slides (im2gps.pdf). Sorry, images not included here. http://graphics.cs.cmu.edu/projects/im2gps/]
[Bottom line: With 120-NN, their most sophisticated implementation came within 64 km of the correct location about 50% of the time.]
RELATED CLASSES [if you like machine learning, consider taking these courses]
CS 182 (spring): Deep Neural Networks
CS C281A (fall): Statistical Learning Theory [C281A is the most direct continuation of CS 189/289A.] EECS 127 (both), EE 227BT, EE C227C: Numerical Optimization [a core part of ML]
[It’s hard to overemphasize the importance of numerical optimization to machine learning, as well as other CS fields like graphics, theory, and scientific computing.]
EECS 126 (both): Random Processes [Markov chains, expectation maximization, PageRank]
EE C106A/B (fall/spring): Intro to Robotics [dynamics, control, sensing]
Math 110: Linear Algebra [but the real gold is in Math 221]
Math 221: Matrix Computations [how to solve linear systems, compute SVDs, eigenvectors, etc.]
CS C267 (spring): Scientific Computing [parallelization, practical matrix algebra, some graph partitioning] CS C280 (spring): Computer Vision
CS 285 (fall): Deep Reinforcement Learning (Levine)
CS 287 (fall): Advanced Robotics (Abbeel, Canny)
CS 287H (spring): Algorithmic Human-Robot Interaction (Dragan)
CS 288 (spring): Natural Language Processing (DeNero, Klein)
CS 294-82 (fall): Machine Learning on Multimedia Data (Friedland)
CS 294-150 (spring): ML & Biology (Listgarten)
CS 294-162 (fall): ML Systems (Gonzalez)
CS 294-166 (spring): Beneficial AI (Russell)
CS 294-173 (fall): Learning for 3D Vision (Kanazawa)
VS 265 (fall): Neural Computation
Bonus Lecture: Learning Theory 159 A Bonus Lecture: Learning Theory
LEARNING THEORY: WHAT IS GENERALIZATION?
[One thing humans do well is generalize. When you were a young child, you only had to see a few examples of cows before you learned to recognize cows, including cows you had never seen before. You didn’t have to see every cow. You didn’t even have to see log n of the cows.]
[Learning theory tries to explain how machine learning algorithms generalize, so they can classify data they’ve never seen before. It also tries to derive mathematically how much training data we need to general- ize well. Learning theory starts with the observation that if we want to generalize, we must constrain what hypotheses we allow our learner to consider.]
A range space (aka set system) is a pair (P, H), where
P is set of all possible test/training points (can be infinite)
H is hypothesis class, a set of hypotheses (aka ranges, aka classifiers):
each hypothesis is a subset h ⊆ P that specifies which points h predicts are in class C. [So each hypothesis h is a 2-class classifier, and H is a set of sets of points.]
Examples:
1. Power set classifier: P is a set of k numbers; H is the power set of P, containing all 2k subsets of P.
e.g., P = {1,2},H = {∅,{1},{2},{1,2}}
2. Linear classifier: P = Rd; H is the set of all halfspaces; each halfspace has the form {x : w · x ≥ −α}.
[In this example, both P and H are infinite. In particular, H contains every possible halfspace—that is, every possible linear classifier in d dimensions.]
[The power set classifier sounds very powerful, because it can learn every possible hypothesis. But the reality is that it can’t generalize at all. Imagine we have three training points and three test points in a row.]
?CC?N?
[The power set classifier can classify these three test points any way you like. Unfortunately, that means it has learned nothing about the test points from the training points. By contrast, the linear classifier can learn only two hypotheses that fit this training data. The leftmost test point must be classified class C, and the rightmost test point must be classified class Not-C. Only the test point in the middle can swing either way. So the linear classifier has a big advantage: it can generalize from a few training points. That’s also a big disadvantage if the data isn’t close to linearly separable, but that’s another story.]
[Now we will investigate how well the training error predicts the test error, and how that differs for these two classifiers.]
Suppose all training pts & test pts are drawn independently from same prob. distribution D defined on domain P. [D also determines each point’s label. Classes C and Not-C may have overlapping distributions.]
Let h ∈ H be a hypothesis [a classifier]. h predicts a pt x is in class C if x ∈ h.
The risk aka generalization error R(h) of h is the probability that h misclassifies a random pt x drawn from D—i.e., the prob. that x ∈ C but x h or vice versa.
[Risk is almost the same as the test error. To be precise, the risk is the average test error for test points drawn randomly from D. For a particular test set, sometimes the test error is higher, sometimes lower, but on average it is R(h). If you had an infinite amount of test data, the risk and the test error would be the same.]
160 Jonathan Richard Shewchuk
Let X ⊆ P be a set of n training pts drawn from D
The empirical risk aka training error Rˆ(h) is % of X misclassified by h.
[This matches the definition of empirical risk I gave you in Lecture 12, if you use the 0-1 loss function.]
h misclassifies each training pt w/prob. R(h), so total misclassified has a binomial distribution. As n → ∞, Rˆ(h) better approximates R(h).
0.20
0.15
0.10
0.05
0.04
0.03
0.02
0.01
5
10
15 20
100 200 300 400 500
[Consider a hypothesis whose risk of misclassification is 25%. Plotted are distributions of the number of misclassified training points for 20 points and 500 points, respectively. For 20 points, the training error is not a reliable estimate of
the risk: the hypothesis might get “lucky” with misleadingly low training error.]
binom20.pdf, binom500.pdf
[If we had infinite training data, this distribution would become infinitely narrow and the training error would always be equal to the risk. But we can’t have infinite training data. So, how well does the training error approximate the risk?]
Hoeffding’s inequality tells us prob. of bad estimate: Pr(|Rˆ(h) − R(h)| > ε) ≤ 2e−2ε2n.
[Hoeffding’s inequality is a standard result about how likely it is that a number drawn from a binomial distribution will be far from its mean. If n is big enough, then it’s very unlikely.]
bad estimate probability 1.0
0.8 0.6 0.4 0.2
[Hoeffding’s bound for the unambitious ε = 0.1. It takes at least 200 training points to have high confidence of attaining that error bound.]
[One reason this matters is because we will try to choose the best hypothesis. If the training error is a bad estimate of the test error, we might choose a hypothesis we think is good but really isn’t. So we are happy to see that the likelihood of that decays exponentially in the amount of training data.]
points 0 50 100 150 200 250 300
hoeffding.pdf
Bonus Lecture: Learning Theory 161
Idea for learning alg: choose hˆ ∈ H that minimizes Rˆ(hˆ)! Empirical risk minimization.
[None of the classification algorithms we’ve studied actually do this, but only because it’s computationally infeasible to pick the best hypothesis. Support vector machines can find a linear classifier with zero training error when the training data is linearly separable. But when it isn’t, SVMs try to find a linear classifier with low training error, but they don’t generally find the one with minimum training error. That’s NP-hard.] [Nevertheless, for the sake of understanding learning theory, we’re going to pretend that we have the com- putational power to try every hypothesis and pick the one with the lowest training error.]
Problem: if too many hypotheses, some h with high R(h) will get lucky and have very low Rˆ(h)!
[This brings us to a central idea of learning theory. You might think that the ideal learning algorithm would have the largest class of hypotheses, so it could find the perfect one to fit the data. But the reality is that you can have so many hypotheses that some of them just get lucky and score far lower training error than their actual risk. That’s another way to understand what “overfitting” is.]
[More precisely, the problem isn’t too many hypotheses. Usually we have infinitely many hypotheses, and that’s okay. The problem is too many dichotomies.]
Dichotomies
A dichotomy of X is X ∩ h, where h ∈ H.
[A dichotomy picks out the training points that h predicts are in class C. Think of each dichotomy as a function assigning each training point to class C or class Not-C.]
CCN CNN CCC CNC
[Draw this by hand. ] [Three examples of dichotomies for three points in a hypothesis class of linear classifiers, and one example (right) that is not a dichotomy.]
[For n training points, there could be up to 2n dichotomies. The more dichotomies there are, the more likely it is that one of them will get lucky and have misleadingly low empirical risk.]
Extreme case: if H allows all 2n possible dichotomies, Rˆ(hˆ) = 0 even if every h ∈ H has high risk.
[If our hypothesis class permits all 2n possible assignments of the n training points to classes, then one of them will have zero training error. But that’s true even if all of the hypotheses are terrible and have a large risk. Because the hypothesis class imposes no structure, we overfit the training points.]
Given Π dichotomies, Pr(at least one dichotomy has |Rˆ − R| > ε) ≤ δ, where δ = 2Π e−2ε2n. [Let’sfixavalueofδandsolveforε.] Hencewithprob.≥1−δ,foreveryh∈H,
1 2Π |Rˆ(h)−R(h)|≤ε= 2nln δ .
[This tells us that the smaller we make Π, the number of possible dichotomies, and the larger we make n, the number of training points, the more accurately the training error will approximate how well the classifier performs on test data.]
smaller Π or larger n ⇒ training error probably closer to true risk (& test error).
dichotomies.pdf
162 Jonathan Richard Shewchuk
[Smaller Π means we’re less likely to overfit. We have less variance, but more bias. This doesn’t necessarily mean the risk will be small. If our hypothesis class H doesn’t fit the data well, both the training error and the test error will be large. In an ideal world, we want a hypothesis class that fits the data well, yet doesn’t have many hypotheses.]
Let h∗ ∈ H minimize R(h∗); “best” classifier.
[Remember we picked the classifier hˆ that minimizes the empirical risk. We really want the classifier h∗ that minimizes the actual risk, but we can’t know what h∗ is. But if Π is small and n is large, the hypothesis hˆ we have chosen is probably nearly as good as h∗.]
With prob. ≥ 1 − δ, our chosen hˆ has nearly optimal risk:
R(hˆ) ≤ Rˆ(hˆ) + ε ≤ Rˆ(h∗) + ε ≤ R(h∗) + 2ε, ε = ln .
1 2Π 2n δ
[This is excellent news! It means that with enough training data and a limit on the number of dichotomies, empirical risk minimization usually chooses a classifier close to the best one in the hypothesis class.]
Choose a δ and an ε.
The sample complexity is the # of training pts needed to achieve this ε with prob. ≥ 1 − δ:
n≥ 1 ln2Π. 2ε2 δ
[If Π is small, we won’t need too many samples to choose a good classifier. Unfortunately, if Π = 2n we lose, because this inequality says that n has to be bigger than n. So the power set classifier can’t learn much or generalize at all. We need to severely reduce Π, the number of possible dichotomies. One way to do that is to use a linear classifier.]
The Shatter Function & Linear Classifiers
[How many ways can you divide n points into two classes with a hyperplane?] #ofdichotomies:ΠH(X) = |{X∩h:h∈H}| ∈[1,2n]wheren=|X|
shatter function: ΠH(n) = max ΠH(X) [The most dichotomies of any point set of size n] |X|=n,X⊆P
Example: Linear classifiers in plane. H = set of all halfplanes. ΠH(3) = 8: NCCC
CCNCCNCC
[Draw this by hand. ] [Linear classifiers can induce all eight dichotomies of these three points. The other four dichotomies are the complements of these four.]
shatter.pdf
Bonus Lecture: Learning Theory 163
ΠH(4) = 14:
[Instead of showing you all 14 dichotomies, let me show you dichotomies that halfplanes cannot learn, which illustrate why no four points have 16 dichotomies.]
NC CCNCNCC
NCC
[Draw this by hand. ] [Examples of dichotomies of four points in the plane that no linear classifier can induce.]
[This isn’t a proof that 14 is the maximum, because we have to show that 16 is not possible for any four points in the plane. The standard proof uses a result called Radon’s Theorem.]
Fact: for all range spaces, either ΠH(n) is polynomial in n, or ΠH(n) = 2n ∀n ≥ 0.
[This is a surprising fact with deep implications. Imagine that you have m points, some of them training points and some of them test points. Either a range space permits every possible dichotomy of the points, and the training points don’t help you classify the test points at all; or the range space permits only a polynomial subset of the 2m possible dichotomies, so once you have labeled the training points, you have usually cut down the number of ways you can classify the test points dramatically. No shatter function ever occupies the no-man’s-land between polynomial and 2m.]
[For linear classifiers, we know exactly how many dichotomies there can be.]
d n − 1
Cover’s Theorem [1965]: linear classifiers in Rd allow up to ΠH(n) = 2 i dichotomies of n pts. For n ≤ d + 1, ΠH(n) = 2n. i=0
unshatter.pdf
For n ≥ d + 1, ΠH (n) ≤ 2 e(n−1) d [Observe that this is polynomial in n! With exponent d.] dˆˆˆ
and the sample complexity needed to achieve R(h) ≤ R(h) + ε ≤ R(h∗) + 2ε with prob. ≥ 1 − δ satisfies 1n−1 4
n ≥ 2ε2 d ln d + d + ln δ . [Observe that the logarithm turned the exponent d into a factor!] Corollary: linear classifiers need only n ∈ O(d) training pts
for training error to accurately predict risk/test error.
[In a d-dimensional feature space, we need more than d training points to train an accurate linear classifier. But it’s reassuring to know that the number we need is linear in d. By contrast, if we have a classifier that permits all 2n possible dichotomies however large n is, then no amount of training data will guarantee that the training error of the hypothesis we choose approximates the true risk.]
[The constant hidden in that big-O notation can be quite large. For example, if you choose ε = 0.1 and δ = 0.1, then setting n = 550d will always suffice. (For very large d, n = 342d will do.) If you want a lot of confidence that you’ve chosen one of the best hypotheses, you have to pay for it with a large sample size.]
[This sample complexity applies even if you add polynomial features or other features, but you have to count the extra features in d. So the number of training points you need increases with the number of polynomial terms.]
164 Jonathan Richard Shewchuk
VC Dimension
The Vapnik–Chervonenkis dimension of (P, H) is
VC(H) = max{n : ΠH(n) = 2n}. ⇐ Can be ∞.
Say that H shatters a set X of n pts if ΠH(X) = 2n.
VC(H) is size of largest X that H can shatter.
[This means that X is a point set for which all 2n dichotomies are possible.]
[I told you earlier that if the shatter function isn’t 2n for all n, then it’s a polynomial in n. The VC dimension is motivated by an observation that sometimes makes it easy to bound that polynomial.]
VC(H) n Theorem: ΠH(n) ≤ .
i=0 i
Hence for n ≥ VC(H), ΠH(n) ≤
en VC(H) VC(H)
.
[So the VC dimension is an upper bound on the exponent of the polynomial. This theorem is useful because often we can find an easy upper bound on the VC dimension. You just need to show that for some number n, no set of n points can have all 2n dichotomies.]
Corollary: O(VC(H)) training pts suffice for accuracy. [Again, the hidden constant is big.]
[If the VC dimension is finite, it tells us how the sample complexity grows with the number of features. If
the VC dimension is infinite, no amount of training data will make the classifier generalize well.]
Example: Linear classifiers in plane.
Recall ΠH(3) = 8: there exist 3 pts shattered by halfplanes. But ΠH(4) = 14: no 4 pts are shattered.
Hence:
– VC(H) = 3 [The VC dimension of halfplanes is 3.] – ΠH(n) ≤ e3 n3 [The shatter function is polynomial.]
27
– O(1) sample complexity.
[The VC dimension doesn’t always give us the tightest bound. In this example, the VC dimension promises that the number of ways halfplanes can classify the points is at worst cubic in n; but in reality, it’s quadratic in n. In general, linear classifiers in d dimensions have VC dimension d + 1, which is one dimension looser than the exponent Thomas Cover proved. That’s not a big deal, though, as the sample complexity and the accuracy bound are both based on the logarithm of the shatter function. So if we get the exponent wrong, it only changes a constant in the sample complexity.]
[The important thing is simply to show that there is some polynomial bound on the shatter function at all. VC dimension is not the only way to do that, but often it’s the easiest.]
[The main point you should take from this lecture is that if you want to have generalization, you need to limit the expressiveness of your hypothesis class so that you limit the number of possible dichotomies of a point set. This may or may not increase the bias, but if you don’t limit the number of dichotomies at all, the overfitting will be very bad. If you limit the hypothesis class, your artificial child will only need to look at O(d) cows to learn the concept of cows. If you don’t, your artificial child will need to look at every cow in the world, and every non-cow too.]
Bonus Mini-Lecture: Latent Factor Analysis 165 B Bonus Mini-Lecture: Latent Factor Analysis
LATENT FACTOR ANALYSIS [aka Latent Semantic Indexing]
[You can think of this as dimensionality reduction for matrices.]
Suppose X is a term-document matrix: [aka bag-of-words model]
row i represents document i; column j represents term j. [Term = word.] [Term-document matrices are usually sparse, meaning most entries are zero.] Xi j = occurrences of term j in doc i
better: log (1+ occurrences) [So frequent words don’t dominate.]
[Better still is to weight the entries so rare words give big entries and common words like “the” give small entries. To do that, you need to know how frequently each word occurs in general. I’ll omit the details, but this is the common practice.]
d
RecallSVDX=UDV⊤ =δiuiv⊤i . Supposeδi ≤δj fori≥ j.
i=1
Unlike PCA, we usually don’t center X.
For large δi, ui and vi represent a cluster of documents & terms.
– Large components in ui mark docus using similar/related terms, i.e., a genre.
– ” ” ” vi mark frequent terms in that genre.
– E.g., u1 might have large components for the romance novels,
– v1 ” ” ” ” for terms “passion,” “ravish,” “bodice” . . .
[. . . and δ1 would give us an idea how much bigger the romance novel market is than the markets for every other genre of books.]
[v1 and u1 tell us that there is a large subset of books that tend to use the same large subset of words. We can read off the words by looking at the larger components of v1, and we can read off the books by looking at the larger components of u1.]
[The property of being a romance novel is an example of a latent factor. So is the property of being the sort of word used in romance novels. There’s nothing in X that tells you explicitly that romance novels exist, but the similar vocabulary is a hidden connection between them that gives them a large singular value. The vector u1 reveals which books have that genre, and v1 reveals which words are emphasized in that genre.]
Like clustering, but clusters overlap: if u1 picks out romances & u2 picks out histories, they both pick out historical romances.
[So you can think of latent factor analysis as a sort of clustering that permits clusters to overlap. Another way in which it differs from traditional clustering is that the u-vectors contain real numbers, and so some points have stronger cluster membership than others. One book might be just a bit romance, another a lot.]
166 Jonathan Richard Shewchuk
Application in market research:
identifying consumer types (hipster, suburban mom) & items bought together.
[For applications like this, the first few singular vectors are the most useful. Most of the singular vectors are mostly noise, and they have small singular values to tell you so. This motivates approximating a matrix by using only some of its singular vectors.]
r
Truncated SVD X′ = δiuiv⊤i is a low-rank approximation of X, of rank r. [Assuming δr > 0.]
i=1
[We choose the singular vectors with the largest singular values, because they carry the most information.]
=
n×d n×r
Applications:
r×r r×d
X′
u1
ur
δ1… 0 0 δr
v1 vr
[Draw this by hand. ]
truncate.pdf
X′ is the rank-r matrix that minimizes the [squared] Frobenius norm ∥X−X′∥2 =Xij −X′ 2
Fij i,j
– Fuzzy search. [Suppose you want to find a document about gasoline prices, but the document you want doesn’t have the word “gasoline”; it has the word “petrol.” One cool thing about the reduced- rank matrix X′ is that it will probably associate that document with “gasoline,” because the SVD tends to group synonyms together.]
– Denoising. [The idea is to assume that X is a noisy measurement of some unknown matrix that probably has low rank. If that assumption is partly true, then the reduced-rank matrix X′ might be better than the input X.]
– Matrix compression. [As you can see above, if we use a low-rank approximation with a small rank r, we can express the approximate matrix as an SVD that takes up much less space than the original matrix. Often this low-rank approximation supports faster matrix computations.]
– Collaborative filtering: fills in unknown values, e.g., user ratings.
[Suppose the rows of X represents Netflix users and the columns represent movies. The entry Xi j is the review score that user i gave to movie j. But most users haven’t reviewed most movies. We want to fill in the missing values. Just as the rank reduction will associate “petrol” with “gasoline,” it will tend to associate users with similar tastes in movies, so the reduced-rank matrix X′ can predict ratings for users who didn’t supply any.]