What is Machine Learning
Jan 25, 2022
DS-GA 1003 Jan 25, 2022 1 / 14
Copyright By PowCoder代写 加微信 powcoder
Machine Learning Problems
Typically our goal is to solve a prediction problem of the format: Given an input x,
Predict an output y.
We’ll start with a few canonical examples.
DS-GA 1003
Jan 25, 2022
Example: Spam Detection
Input: Incoming email
Output: “SPAM” or “NOT SPAM”
This is a binary classification problem: there are two possible outputs.
DS-GA 1003 Jan 25, 2022 3 / 14
Example: Medical Diagnosis
Input: Symptoms (fever, cough, fast breathing, shaking, nausea, …) Output: Diagnosis (pneumonia, flu, common cold, bronchitis, …)
A multiclass classification problem: choosing an output out of a discrete set of possible outputs.
How do we express uncertainty about the output? Probabilistic classification or soft classification:
P(pneumonia) = 0.7 P(flu) = 0.2
DS-GA 1003
Jan 25, 2022
Example: Predicting a Stock Price
Input: History of the stock’s prices
Output: The price of the stock at the close of the next day
This is called a regression problem (for historical reasons): the output is continuous.
DS-GA 1003 Jan 25, 2022 5 / 14
Comparison to Rule-Based Approaches (Expert Systems)
Consider the problem of medical diagnosis.
1 Talk to experts (in this case, medical doctors).
2 Understand how the experts come up with a diagnosis.
3 Implement this process as an algorithm (a rule-based system): e.g., a set of
symptoms → a particular diagnosis.
4 Potentially use logical deduction to infer new rules from the rules that are stored in
the knowledge base.
DS-GA 1003 Jan 25, 2022 6 / 14
Rule-Based Approach
Fig 1-1 from Hands-On Machine Learning with Scikit-Learn and TensorFlow by (2017).
DS-GA 1003 Jan 25, 2022 7 / 14
Advantages of Rule-Based Approaches
Leverage existing domain expertise.
Generally interpretable: We can describe the rule to another human
Produce reliable answers for the scenarios that are included in the knowledge bases.
DS-GA 1003 Jan 25, 2022 8 / 14
Limitations of Rule-Based Systems
Labor intensive to build: experts’ time is expensive.
Rules work very well for areas they cover, but often do not generalize to unanticipated input combinations.
Don’t naturally handle uncertainty.
DS-GA 1003 Jan 25, 2022 9 / 14
The Machine Learning Approach
Instead of explicitly engineering the process that a human expert would use to make the decision…
We have the machine learn on its own from inputs and outputs (decisions). We provide training data: many examples of (input x , output y) pairs, e.g.
A set of videos, and whether or not each has a cat in it.
A set of emails, and whether or not each one should go to the spam folder.
Learning from training data of this form (inputs and outputs) is called supervised learning.
DS-GA 1003 Jan 25, 2022 10 / 14
Machine Learning Algorithm
A machine learning algorithm learns from the training data: Input: Training Data (e.g., emails x and their labels y)
Output: A prediction function that produces output y given input x.
The goal of machine learning is to find the “best” (to be defined) prediction function
automatically, based on the training data
The success of ML depends on
The availability of large amounts of data;
Generalization to unseen samples (the test set): just memorizing the training set will not be useful.
DS-GA 1003 Jan 25, 2022 11 / 14
Machine Learning Approach
Fig 1-2 from Hands-On Machine Learning with Scikit-Learn and TensorFlow by (2017).
DS-GA 1003 Jan 25, 2022 12 / 14
Key concepts
The most common ML problem types: Classification (binary and multiclass)
Regression
Prediction function: predicts output y (e.g. spam or not?) given input x (e.g. email)
Training data: a set of (input x, output y) pairs
Supervised learning algorithm: takes training data and produces a prediction function
Beyond prediction
Unsupervised learning: finding structures in data, e.g. clustering
Reinforcement learning: optimizing long-term objective, e.g. Go Representation learning: learning good features of real-world objects, e.g. text
DS-GA 1003 Jan 25, 2022 13 / 14
Core Questions in Machine Learning
Given any task, the following questions need to be answered: Modeling: What class of prediction functions are we considering?
Learning: How do we learn the “best” prediction function in this class from our training data?
Inference: How do we compute the output of the prediction function for a new input?
DS-GA 1003 Jan 25, 2022 14 / 14
Statistical Learning Theory
Jan 25, 2022
DS-GA 1003 Jan 25, 2022 1 / 31
Decision theory
In data science problems, we generally need to: Make a decision:
Move email to spam folder?
Take an action:
In a self-driving car, make a right turn
Reject the hypothesis that θ = 0 (classical statistics)
Produce some output:
Whose face is it in the image?
The Hindi translation of a Japanese input sentence
Predicting where a storm will be in an hour (what forms of output are possible here?) An action is the generic term for what is produced by our system.
DS-GA 1003 Jan 25, 2022 2 / 31
We make our decision based on context: Inputs [ML]
Covariates [Statistics]
Examples of inputs
The location of the storm in the last 24 hours, other weather-related measurements A search query
DS-GA 1003 Jan 25, 2022 3 / 31
Inputs are often paired with outputs or labels. Examples of outcomes/outputs/labels
Whether or not the picture actually contains an animal The storm’s location one hour after they query
Which, if any, of the suggested URLs were selected
DS-GA 1003
Jan 25, 2022
Evaluation Criterion
Decision theory is about finding “optimal” actions, under various definitions of optimality. Examples of Evaluation Criteria
Is the classification correct?
Does the transcription exactly match the spoken words?
Should we give partial credit (for getting only some of the words right)? How?
How far is the storm from the predicted location? (If we’re producing a point estimate)
How likely is the storm’s actual location under the predicted distribution? (If we’re doing density prediction)
DS-GA 1003 Jan 25, 2022 5 / 31
Typical Sequence of Events
Many problem domains can be formalized as follows:
1 Observe input x.
2 Take action a.
3 Observe outcome y.
4 Evaluate action in relation to the outcome.
Three spaces:
Input space: X
Action space: A Outcome space: Y
DS-GA 1003
Jan 25, 2022
Formalization
Prediction Function
A prediction function (or decision function) gets input x ∈ X and produces an action a ∈ A : f:X→A
Loss Function
A loss function evaluates an action in the context of the outcome y. l:A×Y→ R
(a,y) → l(a,y)
DS-GA 1003
Jan 25, 2022
Evaluating a Prediction Function
Goal: Find the optimal prediction function.
Intuition: If we can evaluate how good a prediction function is, we can turn this into an
optimization problem.
The loss function l evaluates a single action
How do we evaluate the prediction function as a whole?
We will use the standard statistical learning theory framework.
DS-GA 1003 Jan 25, 2022 8 / 31
Statistical Learning Theory
Define a space where the prediction function is applicable Assume there is a data generating distribution PX×Y.
All input/output pairs (x,y) are generated i.i.d. from PX×Y.
One common desideratum is to have a prediction function f (x) that “does well on average”: l(f(x),y) is usually small, in some sense
How can we formalize this?
DS-GA 1003 Jan 25, 2022 9 / 31
Definition
The risk of a prediction function f : X → A is
R(f ) = E(x,y)∼PX×Y [l(f (x),y)].
In words, it’s the expected loss of f over PX×Y. We can’t actually compute the risk function:
Since we don’t know PX×Y, we cannot compute the expectation. But we can estimate it.
DS-GA 1003
Jan 25, 2022
The Bayes Prediction Function
Definition
A Bayes prediction function f ∗ : X → A is a function that achieves the minimal risk among all possible functions:
f ∗ ∈ argminR(f ), f
where the minimum is taken over all functions from X to A.
The risk of a Bayes prediction function is called the Bayes risk.
A Bayes prediction function is often called the “target function”, since it’s the best prediction function we can possibly produce.
DS-GA 1003 Jan 25, 2022 11 / 31
Example: Multiclass Classification
Spaces: A = Y = {1,…,k} 0-1 loss:
l(a,y) = 1(a ̸= y) :=
0 otherwise.
E[1(f (x) ̸= y)] = 0·P(f (x) = y)+1·P(f (x) ̸= y)
= P(f (x) ̸= y),
which is just the misclassification error rate.
The Bayes prediction function returns the most likely class:
f ∗(x) ∈ argmaxP(y = c | x) 1c k
DS-GA 1003
Jan 25, 2022
But we can’t compute the risk!
Can’t compute R(f ) = E[l(f (x),y)] because we don’t know PX×Y.
One thing we can do in ML/statistics/data science is estimate it: Assume we have sample data:
Let Dn = ((x1,y1),…,(xn,yn)) be drawn i.i.d. from PX×Y.
We draw inspiration from the strong law of large numbers:
If z1,…,zn are i.i.d. with expected value Ez, then 1 n
with probability 1.
lim zi =Ez, n→∞ n i=1
DS-GA 1003
Jan 25, 2022
The Empirical Risk
Let Dn = ((x1,y1),…,(xn,yn)) be drawn i.i.d. from PX×Y. Definition
The empirical risk of f : X → A with respect to Dn is
Rn(f)= n By the strong law of large numbers,
almost surely.
l(f(xi),yi).
lim Rˆn(f)=R(f), n→∞
DS-GA 1003
Jan 25, 2022
Empirical Risk Minimization
Definition
A function fˆ is an empirical risk minimizer if fˆ∈argminRˆn(f),
where the minimum is taken over all functions f : X → A.
In an ideal world we’d want to find the risk minimizer. Is the empirical risk minimizer close enough?
In practice, we always only have a finite sample…
DS-GA 1003
Jan 25, 2022
Empirical Risk Minimization
PX = Uniform[0, 1], Y ≡ 1 (i.e. Y is always 1). A plot of PX×Y:
1.00 0.75 0.50 0.25 0.00
0.50 0.75 1.00
DS-GA 1003 Jan 25, 2022 16 / 31
Empirical Risk Minimization
PX = Uniform[0, 1], Y ≡ 1 (i.e. Y is always 1).
1.00 ●●● 0.75
0.50 0.75 1.00
A sample of size 3 from PX×Y.
DS-GA 1003 Jan 25, 2022 17 / 31
Empirical Risk Minimization
PX = Uniform[0, 1], Y ≡ 1 (i.e. Y is always 1).
A proposed prediction function:
1.00 0.75 0.50 0.25 0.00
fˆ(x) = 1(x ∈ {0.25,0.5,0.75}) = DS-GA 1003
1 if x ∈ {0.25, .5, .75} 0 otherwise
0.25 0.50 0.75 1.00
Jan 25, 2022
Empirical Risk Minimization
PX = Uniform[0, 1], Y ≡ 1 (i.e. Y is always 1).
1.00 0.75 0.50 0.25 0.00
0.25 0.50 0.75 1.00
Under either the square loss or the 0/1 loss, fˆ has Empirical Risk = 0 and Risk = 1.
DS-GA 1003 Jan 25, 2022 19 / 31
Empirical Risk Minimization
In this case, ERM led to a function f that just memorized the data.
How can we improve generalization from the training inputs to new inputs?
We need to smooth things out somehow!
A lot of modeling is about spreading and extrapolating information from one part of the input space X into unobserved parts of the space.
One approach is constrained ERM:
Instead of minimizing empirical risk over all prediction functions,
We constrain our search to a particular subset of the space of functions, called a hypothesis space.
DS-GA 1003 Jan 25, 2022 20 / 31
Hypothesis Spaces
Definition
A hypothesis space F is a set of prediction functions X → A that we consider when applying ERM.
Desirable properties of a hypothesis space:
Includes only those functions that have the desired “regularity”, e.g. smoothness, simplicity
Easy to work with (e.g., we have efficient algorithms to find the best function within the space)
Most applied work is about designing good hypothesis spaces for specific tasks.
DS-GA 1003 Jan 25, 2022 21 / 31
Constrained Empirical Risk Minimization
Given a hypothesis space F, a set of prediction functions mapping X → A, An empirical risk minimizer (ERM) in F is a function fˆ such that
fn ∈ argmin n l(f (xi),yi).
A Risk minimizer in F is a function fF∗ ∈ F such that
fF∗ ∈ argminE[l(f (x),y)]. f∈F
DS-GA 1003
Jan 25, 2022
Excess Risk Decomposition
Approximation error (of F) = R(fF)−R(f ∗) Estimation error (of fˆ in F) = R(fˆ)−R(f )
f ∗ =argminE[l(f (x),y)] f
fF =argminE[l(f (x),y)] f∈F
ˆ 1n fn =argmin n
l(f (xi),yi)
DS-GA 1003
Jan 25, 2022
Excess Risk Decomposition for ERM
Definition
The excess risk compares the risk of f to the Bayes optimal f ∗: Excess Risk(f) = R(f)−R(f∗)
Can excess risk ever be negative?
The excess risk of the ERM fˆ can be decomposed:
Excess Risk(fˆ) = R(fˆ)−R(f∗) nn
= R(fˆ)−R(f )+R(f )−R(f∗). n F F
estimation error approximation error
There is a tradeoff between estimation error and approximation error
DS-GA 1003
Jan 25, 2022
Approximation Error
Approximation error R(fF)−R(f∗) is
a property of the class F
the penalty for restricting to F (rather than considering all possible functions)
Bigger F mean smaller approximation error.
Concept check: Is approximation error a random or non-random variable?
DS-GA 1003 Jan 25, 2022 25 / 31
Estimation Error
Estimation error R(fˆ)−R(f ) nF
is the performance hit for choosing f using finite training data
is the performance hit for minimizing empirical risk rather than true risk
With smaller F we expect smaller estimation error.
Under typical conditions: “With infinite training data, estimation error goes to zero.” Concept check: Is estimation error a random or non-random variable?
DS-GA 1003 Jan 25, 2022 26 / 31
ERM in Practice
What have we been glossing over by writing “argmin”?
In practice, we need a method to find fˆ ∈ F: this can be very difficult!
For nice choices of loss functions and classes F, we can get arbitrarily close to the exact minimizer
But that takes time – is it always worth it?
For some hypothesis spaces (e.g. neural networks), we don’t know how to find fˆ ∈ F.
DS-GA 1003 Jan 25, 2022 27 / 31
Optimization Error
In practice, we don’t find the ERM fˆ ∈ F. n
We find f ̃ ∈ F that we hope is good enough. n
Optimization error: If f ̃ is the function our optimization method returns, and fˆ is the nn
empirical risk minimizer, then
Optimization Error = R(f ̃)−R(fˆ). nn
DS-GA 1003
Jan 25, 2022 28 / 31
Error Decomposition in Practice
Excess risk decomposition for function f ̃ returned by an optimization algorithm in practice: n
Excess Risk(f ̃ ) = R(f ̃ ) − R(f ∗) nn
= R(f ̃)−R(fˆ) +R(fˆ)−R(f )+ R(f )−R(f ∗) n n n F F
optimization error estimation error approximation error
It would be nice to observe the error decomposition for a practical f ̃ ! n
How would we address each type of error?
Why is this usually impossible?
But we could constuct an artificial example, where we know PX×Y and f ∗ and fF …
DS-GA 1003 Jan 25, 2022 29 / 31
ERM Overview
Given a loss function l : A×Y → R,
Choose a hypothesis space F.
Use an optimization method to find an empirical risk minimizer fˆ ∈ F: n
ˆ 1n fn = argmin n
f∈F i=1 (Or find a f ̃ that comes close to fˆ )
The data scientist’s job:
Choose F that balances approximation and estimation error.
As we get more training data, we can use a bigger F. DS-GA 1003
l(f (xi),yi).
Jan 25, 2022
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com