COMS 4771 Introduction to Machine Learning
Nakul Verma
Machine learning: what?
Study of making machines learn a concept without having to explicitly program it.
• Constructing algorithms that can:
• learn from input data, and be able to make predictions.
• find interesting patterns in data.
• Analyzing these algorithms to understand the limits of ‘learning’
Machine learning: why?
We are smart programmers, why can’t we just write some code with a set of rules to solve a particular problem?
Write down a set of rules to code to distinguish these two faces:
What if we don’t even know the explicit task we want to solve?
Machine learning: problems in the real world
• Recommendation systems (Netflix, Amazon, Overstock)
• Stock prediction (Goldman Sachs, Morgan Stanley)
• Risk analysis (Credit card, Insurance)
• Face and object recognition (Cameras, Facebook, Microsoft)
• Speech recognition (Siri, Cortana, Alexa, Dragon)
• Search engines and content filtering (Google, Yahoo, Bing)
Machine learning: how?
so…. how do we do it?
This is what we will focus on in this class!
This course
We will learn:
• Study a prediction problem in an abstract manner and come up with
a solution which is applicable to many problems simultaneously.
• Different types of paradigms and algorithms that have been
successful in prediction tasks.
• How to systematically analyze how good an algorithm is for a prediction task.
Prerequisites
Mathematical prerequisites
• Basics of probability and statistics
• Linear algebra
• Calculus
Computational prerequisites
• Basics of algorithms and datastructure design
• Ability to program in a high-level language.
Administrivia
Website:
The team:
Instructor: Nakul Verma (me) TAs
Students: you!
Evaluation:
• Homeworks (40%)
• Exam 1 (30%)
• Exam 2 (30%)
http://www.cs.columbia.edu/~verma/classes/ml/
Policies
Homeworks:
• No late homework
• Must type your homework (no handwritten homework)
• Please include your name and UNI
• Submit a pdf copy of the assignment via gradescope
• We encourage discussing the problems (piazza/groups/etc), but please don’t copy.
Announcement!
• Visit the course website
• Review the basics (prerequisites)
• HW0 is out!
• Sign up on Piazza & Gradescope
Let’s get started!
Machine Learning: the basics…
A closer look at some prediction problems… • Handwritten character recognition:
• Spam filtering:
{ 0, 1, 2, …, 9 }
{ spam, } not spam
{ building, tree, } car, road, sky,…
• Object recognition:
Machine Learning: the basics…
Commonalities in a prediction problem:
==
Input:
To learn:
Output: 5
=
{ 0, 1, 2, …, 9 }
Machine Learning: the basics…
Data:
Assumption: there is a (relatively simple) function
such that for most i
Learning task: given n examples from the data, find an approximation
Supervised learning
Goal: gives mostly correct prediction on unseen examples
Training Phase
Testing Phase
Unlabeled test data (unseen / future data)
Labeled training data (n examples from data)
Learning Algorithm
‘classifier’
prediction
Machine Learning: the basics…
Data:
Assumption: there is an underlying structure in
Learning task: discover the structure given n examples from the data
Goal: come up with the summary of the data using the discovered structure
More later in the course…
Unsupervised learning
Supervised Machine Learning
Statistical modeling approach:
Labeled training data (n examples from data)
Learning Algorithm
‘classifier’
drawn independently from
a fixed underlying distribution (also called the i.i.d. assumption)
select from…? from a pool of models
that maximizes label agreement of the
training data
How to select ?
• Maximum likelihood (best fits the data)
• Maximum a posteriori (best fits the data but incorporates prior assumptions)
• Optimization of ‘loss’ criterion (best discriminates the labels) •…
Maximum Likelihood Estimation (MLE)
Given some data i.i.d. (Let’s forget about the labels for now)
Say we have a model class
find the parameter settings θ that best fits the data.
ie, each model p can be described by a set of parameters θ
If each model p, is a probability model then we can find the best fitting probability model via the likelihood estimation!
i.i.d.
Likelihood
Interpretation: How probable (or how likely) is the data given the model pθ ? Parameter setting θ that maximizes
MLE Example
Fitting a statistical probability model to heights of females Height data (in inches): 60, 62, 53, 58, … ∈R
Model class: Gaussian models in R
So, what is the MLE for the given data X ?
μ = mean parameter
σ 2 = variance parameter > 0
MLE Example (contd.)
Height data (in inches): = R Model class: Gaussian models in R
MLE:
Trick #1: Trick #2:
Good luck! “Log” likelihood
finding max (or other extreme values) of a function is simply analyzing the ‘stationary points’ of a function. That is, values at which the derivative of the function is zero !
pθ
θ
MLE Example (contd. 2)
Let’s calculate the best fitting θ = {μ, σ 2}
“Log” likelihood i.i.d.
Maximizing μ : Maximizing σ 2:
MLE Example
So, the best fitting Gaussian model Female height data: 60, 62, 53, 58, … ∈R
Is the one with parameters: and
What about other model classes?
Other popular probability models
Bernoulli model (coin tosses)
Multinomial model (dice rolls)
Poisson model (rare counting events) Gaussian model (most common phenomenon)
Scalar valued Scalar valued
Scalar valued Scalar valued
Most machine learning data is vector valued!
Multivariate Gaussian Model
Multivariate version available of other scalar valued models
Vector valued
Multivariate Gaussian
Univariate R
μ = mean parameter
σ 2 = variance parameter > 0 Multivariate Rd
μ = mean vector
Σ = Covariance matrix (positive definite)
From MLE to Classification
MLE sounds great, how do we use it to do classification using labelled data? Why? More later…
Bayes rule
Class prior:
Class conditional probability model
Class Prior
Simply the probability of data sample occurring from a category
Class conditional:
Use a separate probability model individual categories/class-type We can find the appropriate parameters for the model using MLE!
indep. of y
Classification via MLE Example
Task: learnaclassifiertodistinguishmalesfromfemales based on say height and weight measurements
Classifier:
Using labelled training data, learn all the parameters: Learning class priors:
fraction of training data labelled as male
Learning class conditionals:
fraction of training data labelled as female
= MLE using only male data
θ (male)
θ (female) = MLE using only female data
What are we doing geometrically?
Data geometry:
male data
female data
Weight
Height
What are we doing geometrically?
Data geometry:
x
male data
female data
MLE Gaussian (male) MLE Gaussian (female)
Weight
Weight
pθ
Height
Classification via MLE Example
Task: learnaclassifiertodistinguishmalesfromfemales based on say height and weight measurements
Classifier:
Using labelled training data, learn all the parameters: Learning class priors:
fraction of training data labelled as male
Learning class conditionals:
fraction of training data labelled as male
= MLE using only male data
θ (male)
θ (female) = MLE using only female data
Classification via MLE Example
We just made our first predictor !
But why:
Why the particular f = argmaxy P[Y|X] ?
Accuracy of a classifier f :
Assume binary classification (for simplicity) :
Let: Bayes classifier any classifier
Theorem:
!!! Bayes classifier is optimal !!!
Optimality of Bayes classifier
Theorem:
Observation: For any classifier h
So:
By the choice of f
Integrate over X to remove the conditional
So… is classification a solved problem?
We know that Bayes classifier is optimal.
So have we solved all classification problems?
Why?
Not even close!
How to estimate P[Y|X] ? How to estimate P[X|Y] ?
• How good is the model class?
• Quality of estimation degrades with increase in the dimension of X!
• Active area of research!
Classification via Prob. Models: Variation
Naïve Bayes classifier:
=
Naïve Bayes assumption: The individual features/measurements are independent given the class label
Advantages:
Computationally very simple model. Quick to code.
Disadvantages:
Does not properly capture the interdependence between features, giving bad estimates.
How to evaluate the quality of a classifier?
Your friend claims: “My classifier is better than yours” How can you evaluate this statement?
Given a classifier f , we essentially need to compute:
But… we don’t know the underlying distribution We can use training
data to estimate…
Why? Training data is already used to construct f, so it is NOT an unbiased estimator
Accuracy of f
Severely overestimates the accuracy!
How to evaluate the quality of a classifier?
General strategy:
•
•
•
Divide the labelled data into training and test FIRST Only use the training data for learning f
Then the test data can be used as an unbiased estimator for gauging the predictive accuracy of f
Unlabeled test data (unseen / future data)
Testing Phase
Training Phase
Labeled training data (n examples from data)
Learning Algorithm
‘classifier’
prediction
What we learned…
• Why machine learning
• Basics of Supervised Learning
• Maximum Likelihood Estimation
• Learning a classifier via probabilistic modelling
• Optimality of Bayes classifier
• Naïve Bayes classifier
• How to evaluate the quality of a classifier
Questions?
Next time…
Direct ways of finding the discrimination boundary
Remember
• Visit the course website http://www.cs.columbia.edu/~verma/classes/ml/
• Review the basics (prerequisites)
• HW0 is out
• Sign up on Piazza & Gradescope