CSCI 567: Machine Learning
Copyright By PowCoder代写 加微信 powcoder
Lecture 11, Nov 17
Administrivia
• Quiz 2 is on December 1
• More info will be posted on Ed later
• Today’s plan:
• Density estimation & Naïve Bayes
• Multi-armed bandits
• Responsible ML
Density estimation
Density estimation
• Introduction
• Parametric methods
• Non-parametric methods
With clustering using GMMs, our high-level goal was the following:
Given a training set x1, . . . ,xn, estimate a density function p that could have generated this
dataset (via xi
This is a special case of the general problem of density estimation, an important unsupervised learn-
ing problem.
Density estimation is useful for many downstream applications
• we have seen clustering already, will see more today
• these applications also provide a way to measure quality of the density estimator
Introduction
Density estimation
• Introduction
• Parametric methods
• Non-parametric methods
Parametric estimation assumes a generative model parametrized by !:
p(x) = p(x ;!)
• GMM: p(x ;!) =
j=1 !jN(x | µj ,!j) where ! = {!j ,µj ,!j}
• Multinomial: a discrete variable with values in {1, 2, . . . , k} s.t.
p(x = j ;!) = “j
where ! is a distribution over the k elements.
Size of ! is independent of the size of the training set, so it’s parametric.
Parametric methods: generative models
As usual, we can apply MLE to learn the parameters !:
ln p(xi ;!)
For some cases this is intractable and we can use algorithms such as EM to approximately solve the
MLE problem (e.g. GMMs).
For some other cases this admits a simple closed-form solution (e.g. multinomial).
Parametric methods: estimation
MLE for multinomials
The log-likelihood is
ln p(x = xi ;!) =
1(xi = j) ln !j
where zj = |{i : xi = j}| is the number of examples with value j.
The solution is simply
i.e. the fraction of examples with value j. (See HW4 Q2.1)
Density estimation
• Introduction
• Parametric methods
• Non-parametric methods
Can we estimate without assuming a fixed generative model?
Kernel density estimation (KDE) provides a solution.
• the approach is nonparametric: it keeps the entire training set
• we focus on the one-dimensional (continuous) case
Nonparametric methods
• Construct something similar to a histogram:
• For each data point, create a “bump” (via a Kernel)
• Sum up or average all the bumps
High-level idea
Kernel needs to satisfy:
• symmetry: K(u) = K(!u)
K(u)du = 1, makes sure p is a density function.
KDE with a kernel K: R ! R:
e.g. K(u) = 1!
2 , the standard Gaussian density
Different kernels
I[|u| ” 1]
max{1# u2, 0}
If K(u) is a kernel, then for any h > 0
(stretching the kernel)
can be used as a kernel too (verify the two properties yourself)
So general KDE is determined by both the kernel K and the bandwidth h
Kh (x! xi) =
• xi controls the center of each bump
• h controls the width/variance of the bumps
Gray curve is ground-truth
• Red: h = 0.05
• Black: h = 0.337
• Green: h = 2
Larger ℎ means larger variance and smoother density
Naive Bayes
Image source
A simplistic taxonomy of ML
Supervised learning:
Aim to predict
outputs of future
datapoints
Unsupervised
Aim to discover
hidden patterns and
explore data
Reinforcement
Aim to make
sequential decisions
Naïve Bayes
• Motivation & setup
• Prediction with Naïve Bayes, and some connections
Suppose (x, y) is drawn from a joint distribution p. The Bayes optimal classifier is
f!(x) = argmax
i.e. predict the class with the largest conditional probability.
p is of course unknown, but we can estimate it, which is exactly a density estimation problem!
Bayes optimal classifier
How to estimate a joint distribution? Observe we always have
p(x, y) = p(y)p(x | y)
We know how to estimate p(y) by now.
To estimate p(x | y = c) for some c ! [C], we are doing density estimation using data {xi : yi = c}.
This is not a 1D problem in general.
Estimation
Naive Bayes assumption: conditioning on a label, features are independent, which means
p(x | y = c) =
p(xj | y = c)
Now for each j and c we have a simple 1D density estimation problem!
Is this a reasonable assumption? Sometimes yes, e.g.
• use x = (Height, Vocabulary) to predict y = Age
• Height and Vocabulary are dependent
• but conditioned on Age, they are independent!
More often this assumption is unrealistic and “naive”, but still Naive Bayes can work very well
even if the assumption is wrong.
A naïve assumption
Height: !3’, 3’-4’, 4’-5’, 5’-6’, “6’
Vocabulary: !5K, 5K-10K, 10K-15K, 15K-20K, “20K
Age: !5, 5-10, 10-15, 15-20, 20-25, “25
MLE estimation: e.g.
p(Age = 10-15) =
#examples with age 10-15
p(Height = 5’-6’ | Age = 10-15)
#examples with height 5’-6’ and age 10-15
#examples with age 10-15
Example: Discrete features
For a label c ! [C],
p(y = c) =
|{i : yi = c}|
For each possible value ! of a discrete feature j,
p(xj = ! | y = c) =
|{i : xi,j = !, yi = c}|
|{i : yi = c}|
Discrete features: More formally
If the feature is continuous, we can do
• parametric estimation, e.g. via a Gaussian
p(xj = x | y = c) =
(x” µc,j)2
where µc,j and ”
c,j are the empirical mean and variance of feature j among all examples with
• or nonparametric estimation, e.g. via a Kernel K and bandwidth h:
p(xj = x | y = c) =
|{i : yi = c}|
Kh(x” xi,j)
Continuous features
Naïve Bayes
• Motivation & setup
• Prediction with Naïve Bayes, and some connections
After learning the model
p(x, y) = p(y)
the prediction for a new example x is
p(y = c | x) = argmax
p(x, y = c)
p(xj | y = c)
#ln p(y = c) +
ln p(xj | y = c)
Naïve Bayes: Prediction
For discrete features, plugging in previous MLE estimation gives
p(y = c | x)
“ln p(y = c) +
ln p(xj | y = c)
“ln |{i : yi = c}|+
|{i : xi,j = xj , yi = c}|
|{i : yi = c}|
Example: Discrete features
Observe again for the case of continuous features with a Gaussian model, if we fix the variance for
each feature to be ! (i.e. not a parameter of the model any more), then the prediction becomes
p(y = c | x)
“ln |{i : yi = c}|!
(xj ! µc,j)
“ln |{i : yi = c}|!
‘ = argmax
cx (linear classifier!)
where we denote wc0 = ln |{i : yi = c}|!
What is Naïve Bayes learning?
Moreover, by a similar calculation you can verify
p(y = c | x) ! ew
This the softmax function, the same model we used for the probabilistic interpretation of logistic
regression!
So what is different then? They learn the parameters in different ways:
• both via MLE, one on p(y = c | x), the other on p(x, y)
• solutions are different: logistic regression has no closed-form, naive Bayes admits a simple
closed-form
Connection to logistic regression
Logistic regression Naive Bayes
Model conditional p(y | x) joint p(x, y)
Learning MLE (can also be viewed as minimizing logistic loss) MLE
Accuracy usually better for large n usually better for small n
Connection to logistic regression
Multiarmed bandits
A simplistic taxonomy of ML
Supervised learning:
Aim to predict
outputs of future
datapoints
Unsupervised
Aim to discover
hidden patterns and
explore data
Reinforcement
Aim to make
sequential decisions
• Motivation & setup
• Exploration vs. Exploitation
Multi-armed bandits
Problems we have discussed so far:
• start with a fixed training dataset
• learn a predictor from the data or discover some patterns in the data
But many real-life problems are about learning continuously:
• make a prediction/decision
• receive some feedback
Broadly, these are called online decision making problems.
Decision making
Amazon/Netflix/MSN recommendation systems:
• a user visits the website
• the system recommends some products/movies/news stories
• the system observes whether the user clicks on the recommendation
Playing games (Go/Atari/StarCraft/…) or controlling robots:
• make a move
• receive some reward (e.g. score a point) or loss (e.g. fall down)
• make another move…
Imagine going to a casino to play a slot machine
• it robs you, like a “bandit” with a single arm
Of course there are many slot machines in the casino
• like a bandit with multiple arms (hence the name)
• if I can play 10 times, which machines should I play?
Multiarmed bandits: Motivation
This simple model and its variants capture many real-life applications:
• recommendation systems, each product/movie/news story is an arm
(Microsoft MSN employs a variant of bandit algorithm)
• game playing, each possible move is an arm
(AlphaGo has a bandit algorithm as one of the components)
Applications
Applications
From a thesis defense yesterday…
There are K arms (actions/choices/…)
The problem proceeds in rounds between the environment and a learner: for each time t = 1, . . . , T
• the environment decides the reward for each arm rt,1, . . . , rt,K
• the learner picks an arm at ! [K]
• the learner observes the reward for arm at, i.e., rt,at
Importantly, learner does not observe rewards for arms not selected!
This kind of limited feedback is usually referred to as bandit feedback
Formal setup
Evaluating performance
What should be the goal here?
Maximizing total rewards
t=1 rt,at seems natural.
But the absolute value of rewards is not meaningful, instead we should compare it to some bench-
mark. A classic benchmark is
i.e. the largest reward one can achieve by always playing a fixed arm
So we want to minimize
This is called the regret: how much I regret not sticking with the best fixed arm in hindsight?
How are the rewards generated by the environments?
• they could be generated via some fixed distribution
• they could be generated via some changing distribution
• they could be generated even completely arbitrarily/adversarially
We focus on a simple setting:
• rewards of arm a are i.i.d. samples of Ber(µa), that is, rt,a is 1 with prob. µa, and 0 with prob.
1! µa, independent of anything else.
• each arm has a different mean (µ1, . . . , µK); the problem is essentially about finding the best
arm argmaxa µa as quickly as possible
Environments
Let µ̂t,a be the empirical mean of arm a up to time t:
I[a! == a]
is the number of times we have picked arm a.
Concentration: µ̂t,a should be close to µa if nt,a is large
Empirical means
Multi-armed bandits
• Motivation & setup
• Exploration vs. Exploitation
Pick each arm once for the first K rounds.
For t = K + 1, . . . , T , pick at = argmaxa µ̂t!1,a.
What’s wrong with this greedy algorithm?
Consider the following example:
• K = 2, µ1 = 0.6, µ2 = 0.5 (so arm 1 is the best)
• suppose the algorithm first picks arm 1 and sees reward 0, then picks arm 2 and sees reward 1
(this happens with decent probability)
• the algorithm will never pick arm 1 again!
Exploitation only
All bandit problems face the same dilemma:
Exploitation vs. Exploration trade-off
• on one hand we want to exploit the arms that we think are good
• on the other hand we need to explore all arms often enough in order to figure out which one
• so each time we need to ask: do I explore or exploit? and how?
We next discuss three ways to trade off exploration and exploitation for our simple multi-armed
bandit setting.
The key challenge
Explore–then–Exploit:
Input: a parameter T0 ! [T ]
Exploration phase: for the first T0 rounds, pick each arm for T0/K
Exploitation phase: for the remaining T ” T0 rounds, stick with the
empirically best arm argmaxa µ̂T0,a
Parameter T0 clearly controls the exploration/exploitation trade-off
A natural first attempt
It’s pretty reasonable, but the disadvantages are also clear:
• not clear how to tune the hyperparameter T0
• in the exploration phase, even if an arm is clearly worse than others based on a few pulls, it’s
still pulled T0/K times
• clearly it won’t work if the environment is changing
Explore-then-Exploit: Issues
!-Greedy Pick each arm once for the first K rounds.
For t = K + 1, . . . , T ,
• with probability !, explore: pick an arm uniformly at random
• with probability 1! !, exploit: pick at = argmaxa µ̂t!1,a
A slightly better algorithm
• always exploring and exploiting
• applicable to many other problems
• first thing to try usually
• need to tune !
• same uniform exploration
Is there a more adaptive way to explore?
A simple modification of “Greedy” leads to the well-known:
Upper Confidence Bound (UCB) algorithm
For t = 1, . . . , T , pick at = argmaxa UCBt,a where
UCBt,a := µ̂t!1,a + 2
• the first term in UCBt,a represents exploitation, while the second (bonus) term represents
exploration
• the bonus term is large if the arm is not pulled often enough, which encourages exploration
(adaptive due to the first term)
• a parameter-free algorithm, and it enjoys optimal regret!
More adaptive exploration
Upper confidence bound
Why is it called upper confidence bound?
One can prove that with high probability,
µa ! UCBt,a
so UCBt,a is indeed an upper bound on the true mean.
Another way to interpret UCB, “optimism in face of uncertainty”:
• true environment (best mean) is unknown due to randomness (uncertainty)
• have an upper bound (optimistic guess) on the expected reward of each environment, and pick
best one according to upper bound (optimism)
This principle is useful for many other bandit problems.
Multi-armed bandit is among the simplest decision making problems with limited feedback.
Often, it can be too simple to capture real-life problems. One important aspect it fails to capture is
the “state” of the learning agent, which has impacts on the reward of each action.
• e.g. for Atari games, after making one move, the agent moves to a different state, with possible
different rewards for each action
There are many other techniques and models in reinforcement learning (RL) which can deal with
this issue.
Limitations of multi-armed bandits
Responsible ML
Machine Learning can be brittle
The Blind Men and the Elephant
It was six men of Indostan
To learning much inclined,
Who went to see the Elephant
(Though all of them were blind),
That each by observation
Might satisfy his mind.
The First approached the Elephant,
And happening to fall
Against his broad and sturdy side,
At once began to bawl:
“God bless me! but the Elephant
Is very like a WALL!”
Responsible ML
• ML models can be very sensitive to changes in the data distribution
• ML models can be biased against certain sub-populations
• Conclusion
ML models can be very sensitive to changes in
the data distribution
You saw a small example of this in the HW3 Bonus question:
ML models can latch onto
spurious features to make predictions
LandbirdWaterbird vs.
Consider the following task:
Images courtesy of: University of Hawaii at Manoa, (Flickr)
LandbirdsWaterbirds vs.
Most images of waterbirds are in water,
and landbirds are on land
ML models can latch onto
spurious features to make predictions
LandbirdsWaterbirds vs.
But this isn’t always true!
ML models can latch onto
spurious features to make predictions
LandbirdsWaterbirds vs.
This is known as failure to distributional shifts
ML models can latch onto
spurious features to make predictions
Source: Deep learning for chest radiograph diagnosis: A retrospective comparison of the CheXNeXt algorithm to practicing
radiologists, Rajpurkar et al. 2018
A real-world example
CNN models have obtained impressive results for diagnosing X-rays
E.g. ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization
of Common Thorax Diseases, Wang et a;. 2017
Source: Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-
sectional study, Zech et al. 2018
CNN to predict hospital system detects both general and specific image features.
(A) We obtained activation heatmaps from our trained model and averaged over a sample of images to reveal which subregions
tended to contribute to a hospital system classification decision. Many different subregions strongly predicted the correct
hospital system, with especially strong contributions from image corners. (B-C) On individual images, which have been
normalized to highlight only the most influential regions and not all those that contributed to a positive classification, we note that
the CNN has learned to detect a metal token that radiology technicians place on the patient in the corner of the image field of
view at the time they capture the image. When these strong features are correlated with disease prevalence, models can
leverage them to indirectly predict disease.
But the models may not generalize as well to data from new hospitals because they can
learn to pickup on spurious correlations such as the type of scanner and marks used by
technicians in specific hospitals!
Responsible ML
• ML models can be very sensitive to changes in the data distribution
• ML models can be biased against certain sub-populations
• Conclusion
ML models can be biased against certain sub-
populations
You saw a small example of this in the HW4 word embedding question:
The Gendershades Project
How well do facial recognition tools
perform on various demographics?
http://gendershades.org/
Ans: Not very well
Ans: Not very well
The COMPAS software
https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing
Link to article Link to article
Link to article Link to article
Responsible ML
• ML models can be very sensitive to changes in the data distribution
• ML models can be biased against certain sub-populations
• Conclusion
Going back to the beginning of Lecture 1..
1. Examine your data
2. Examine your model
ML/AI can be very powerful, but should be used responsibly
Lots of great resources to
learn more about best
AI/ML practices.
https://ai.google/responsibil
ities/responsible-ai-
practices/
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com