代写代考 CSCI 567: Machine Learning

CSCI 567: Machine Learning

Copyright By PowCoder代写 加微信 powcoder

Lecture 11, Nov 17

Administrivia

• Quiz 2 is on December 1
• More info will be posted on Ed later

• Today’s plan:
• Density estimation & Naïve Bayes
• Multi-armed bandits
• Responsible ML

Density estimation

Density estimation

• Introduction

• Parametric methods

• Non-parametric methods

With clustering using GMMs, our high-level goal was the following:

Given a training set x1, . . . ,xn, estimate a density function p that could have generated this

dataset (via xi

This is a special case of the general problem of density estimation, an important unsupervised learn-
ing problem.

Density estimation is useful for many downstream applications

• we have seen clustering already, will see more today

• these applications also provide a way to measure quality of the density estimator

Introduction

Density estimation

• Introduction

• Parametric methods

• Non-parametric methods

Parametric estimation assumes a generative model parametrized by !:

p(x) = p(x ;!)

• GMM: p(x ;!) =

j=1 !jN(x | µj ,!j) where ! = {!j ,µj ,!j}

• Multinomial: a discrete variable with values in {1, 2, . . . , k} s.t.

p(x = j ;!) = “j

where ! is a distribution over the k elements.

Size of ! is independent of the size of the training set, so it’s parametric.

Parametric methods: generative models

As usual, we can apply MLE to learn the parameters !:

ln p(xi ;!)

For some cases this is intractable and we can use algorithms such as EM to approximately solve the
MLE problem (e.g. GMMs).

For some other cases this admits a simple closed-form solution (e.g. multinomial).

Parametric methods: estimation

MLE for multinomials

The log-likelihood is

ln p(x = xi ;!) =

1(xi = j) ln !j

where zj = |{i : xi = j}| is the number of examples with value j.

The solution is simply

i.e. the fraction of examples with value j. (See HW4 Q2.1)

Density estimation

• Introduction

• Parametric methods

• Non-parametric methods

Can we estimate without assuming a fixed generative model?

Kernel density estimation (KDE) provides a solution.

• the approach is nonparametric: it keeps the entire training set

• we focus on the one-dimensional (continuous) case

Nonparametric methods

• Construct something similar to a histogram:
• For each data point, create a “bump” (via a Kernel)
• Sum up or average all the bumps

High-level idea

Kernel needs to satisfy:

• symmetry: K(u) = K(!u)

K(u)du = 1, makes sure p is a density function.

KDE with a kernel K: R ! R:

e.g. K(u) = 1!

2 , the standard Gaussian density

Different kernels

I[|u| ” 1]

max{1# u2, 0}

If K(u) is a kernel, then for any h > 0

(stretching the kernel)

can be used as a kernel too (verify the two properties yourself)

So general KDE is determined by both the kernel K and the bandwidth h

Kh (x! xi) =

• xi controls the center of each bump

• h controls the width/variance of the bumps

Gray curve is ground-truth

• Red: h = 0.05

• Black: h = 0.337

• Green: h = 2

Larger ℎ means larger variance and smoother density

Naive Bayes

Image source

A simplistic taxonomy of ML

Supervised learning:
Aim to predict

outputs of future
datapoints

Unsupervised

Aim to discover
hidden patterns and

explore data

Reinforcement

Aim to make
sequential decisions

Naïve Bayes

• Motivation & setup

• Prediction with Naïve Bayes, and some connections

Suppose (x, y) is drawn from a joint distribution p. The Bayes optimal classifier is

f!(x) = argmax

i.e. predict the class with the largest conditional probability.

p is of course unknown, but we can estimate it, which is exactly a density estimation problem!

Bayes optimal classifier

How to estimate a joint distribution? Observe we always have

p(x, y) = p(y)p(x | y)

We know how to estimate p(y) by now.

To estimate p(x | y = c) for some c ! [C], we are doing density estimation using data {xi : yi = c}.

This is not a 1D problem in general.

Estimation

Naive Bayes assumption: conditioning on a label, features are independent, which means

p(x | y = c) =

p(xj | y = c)

Now for each j and c we have a simple 1D density estimation problem!

Is this a reasonable assumption? Sometimes yes, e.g.

• use x = (Height, Vocabulary) to predict y = Age

• Height and Vocabulary are dependent

• but conditioned on Age, they are independent!

More often this assumption is unrealistic and “naive”, but still Naive Bayes can work very well
even if the assumption is wrong.

A naïve assumption

Height: !3’, 3’-4’, 4’-5’, 5’-6’, “6’

Vocabulary: !5K, 5K-10K, 10K-15K, 15K-20K, “20K

Age: !5, 5-10, 10-15, 15-20, 20-25, “25

MLE estimation: e.g.

p(Age = 10-15) =
#examples with age 10-15

p(Height = 5’-6’ | Age = 10-15)

#examples with height 5’-6’ and age 10-15

#examples with age 10-15

Example: Discrete features

For a label c ! [C],

p(y = c) =
|{i : yi = c}|

For each possible value ! of a discrete feature j,

p(xj = ! | y = c) =
|{i : xi,j = !, yi = c}|

|{i : yi = c}|

Discrete features: More formally

If the feature is continuous, we can do

• parametric estimation, e.g. via a Gaussian

p(xj = x | y = c) =

(x” µc,j)2

where µc,j and ”
c,j are the empirical mean and variance of feature j among all examples with

• or nonparametric estimation, e.g. via a Kernel K and bandwidth h:

p(xj = x | y = c) =

|{i : yi = c}|

Kh(x” xi,j)

Continuous features

Naïve Bayes

• Motivation & setup

• Prediction with Naïve Bayes, and some connections

After learning the model

p(x, y) = p(y)

the prediction for a new example x is

p(y = c | x) = argmax

p(x, y = c)

p(xj | y = c)

#ln p(y = c) +

ln p(xj | y = c)

Naïve Bayes: Prediction

For discrete features, plugging in previous MLE estimation gives

p(y = c | x)

“ln p(y = c) +

ln p(xj | y = c)

“ln |{i : yi = c}|+

|{i : xi,j = xj , yi = c}|

|{i : yi = c}|

Example: Discrete features

Observe again for the case of continuous features with a Gaussian model, if we fix the variance for
each feature to be ! (i.e. not a parameter of the model any more), then the prediction becomes

p(y = c | x)

“ln |{i : yi = c}|!

(xj ! µc,j)

“ln |{i : yi = c}|!

‘ = argmax

cx (linear classifier!)

where we denote wc0 = ln |{i : yi = c}|!

What is Naïve Bayes learning?

Moreover, by a similar calculation you can verify

p(y = c | x) ! ew

This the softmax function, the same model we used for the probabilistic interpretation of logistic
regression!

So what is different then? They learn the parameters in different ways:

• both via MLE, one on p(y = c | x), the other on p(x, y)

• solutions are different: logistic regression has no closed-form, naive Bayes admits a simple
closed-form

Connection to logistic regression

Logistic regression Naive Bayes

Model conditional p(y | x) joint p(x, y)

Learning MLE (can also be viewed as minimizing logistic loss) MLE

Accuracy usually better for large n usually better for small n

Connection to logistic regression

Multiarmed bandits

A simplistic taxonomy of ML

Supervised learning:
Aim to predict

outputs of future
datapoints

Unsupervised

Aim to discover
hidden patterns and

explore data

Reinforcement

Aim to make
sequential decisions

• Motivation & setup

• Exploration vs. Exploitation

Multi-armed bandits

Problems we have discussed so far:

• start with a fixed training dataset

• learn a predictor from the data or discover some patterns in the data

But many real-life problems are about learning continuously:

• make a prediction/decision

• receive some feedback

Broadly, these are called online decision making problems.

Decision making

Amazon/Netflix/MSN recommendation systems:

• a user visits the website

• the system recommends some products/movies/news stories

• the system observes whether the user clicks on the recommendation

Playing games (Go/Atari/StarCraft/…) or controlling robots:

• make a move

• receive some reward (e.g. score a point) or loss (e.g. fall down)

• make another move…

Imagine going to a casino to play a slot machine

• it robs you, like a “bandit” with a single arm

Of course there are many slot machines in the casino

• like a bandit with multiple arms (hence the name)

• if I can play 10 times, which machines should I play?

Multiarmed bandits: Motivation

This simple model and its variants capture many real-life applications:

• recommendation systems, each product/movie/news story is an arm
(Microsoft MSN employs a variant of bandit algorithm)

• game playing, each possible move is an arm
(AlphaGo has a bandit algorithm as one of the components)

Applications

Applications

From a thesis defense yesterday…

There are K arms (actions/choices/…)

The problem proceeds in rounds between the environment and a learner: for each time t = 1, . . . , T

• the environment decides the reward for each arm rt,1, . . . , rt,K

• the learner picks an arm at ! [K]

• the learner observes the reward for arm at, i.e., rt,at

Importantly, learner does not observe rewards for arms not selected!

This kind of limited feedback is usually referred to as bandit feedback

Formal setup

Evaluating performance

What should be the goal here?

Maximizing total rewards

t=1 rt,at seems natural.

But the absolute value of rewards is not meaningful, instead we should compare it to some bench-
mark. A classic benchmark is

i.e. the largest reward one can achieve by always playing a fixed arm

So we want to minimize

This is called the regret: how much I regret not sticking with the best fixed arm in hindsight?

How are the rewards generated by the environments?

• they could be generated via some fixed distribution

• they could be generated via some changing distribution

• they could be generated even completely arbitrarily/adversarially

We focus on a simple setting:

• rewards of arm a are i.i.d. samples of Ber(µa), that is, rt,a is 1 with prob. µa, and 0 with prob.
1! µa, independent of anything else.

• each arm has a different mean (µ1, . . . , µK); the problem is essentially about finding the best
arm argmaxa µa as quickly as possible

Environments

Let µ̂t,a be the empirical mean of arm a up to time t:

I[a! == a]

is the number of times we have picked arm a.

Concentration: µ̂t,a should be close to µa if nt,a is large

Empirical means

Multi-armed bandits

• Motivation & setup

• Exploration vs. Exploitation

Pick each arm once for the first K rounds.

For t = K + 1, . . . , T , pick at = argmaxa µ̂t!1,a.

What’s wrong with this greedy algorithm?

Consider the following example:

• K = 2, µ1 = 0.6, µ2 = 0.5 (so arm 1 is the best)

• suppose the algorithm first picks arm 1 and sees reward 0, then picks arm 2 and sees reward 1
(this happens with decent probability)

• the algorithm will never pick arm 1 again!

Exploitation only

All bandit problems face the same dilemma:

Exploitation vs. Exploration trade-off

• on one hand we want to exploit the arms that we think are good

• on the other hand we need to explore all arms often enough in order to figure out which one

• so each time we need to ask: do I explore or exploit? and how?

We next discuss three ways to trade off exploration and exploitation for our simple multi-armed
bandit setting.

The key challenge

Explore–then–Exploit:

Input: a parameter T0 ! [T ]

Exploration phase: for the first T0 rounds, pick each arm for T0/K

Exploitation phase: for the remaining T ” T0 rounds, stick with the
empirically best arm argmaxa µ̂T0,a

Parameter T0 clearly controls the exploration/exploitation trade-off

A natural first attempt

It’s pretty reasonable, but the disadvantages are also clear:

• not clear how to tune the hyperparameter T0

• in the exploration phase, even if an arm is clearly worse than others based on a few pulls, it’s
still pulled T0/K times

• clearly it won’t work if the environment is changing

Explore-then-Exploit: Issues

!-Greedy Pick each arm once for the first K rounds.

For t = K + 1, . . . , T ,

• with probability !, explore: pick an arm uniformly at random

• with probability 1! !, exploit: pick at = argmaxa µ̂t!1,a

A slightly better algorithm

• always exploring and exploiting
• applicable to many other problems
• first thing to try usually

• need to tune !
• same uniform exploration

Is there a more adaptive way to explore?

A simple modification of “Greedy” leads to the well-known:

Upper Confidence Bound (UCB) algorithm

For t = 1, . . . , T , pick at = argmaxa UCBt,a where

UCBt,a := µ̂t!1,a + 2

• the first term in UCBt,a represents exploitation, while the second (bonus) term represents
exploration

• the bonus term is large if the arm is not pulled often enough, which encourages exploration
(adaptive due to the first term)

• a parameter-free algorithm, and it enjoys optimal regret!

More adaptive exploration

Upper confidence bound
Why is it called upper confidence bound?

One can prove that with high probability,

µa ! UCBt,a

so UCBt,a is indeed an upper bound on the true mean.

Another way to interpret UCB, “optimism in face of uncertainty”:

• true environment (best mean) is unknown due to randomness (uncertainty)

• have an upper bound (optimistic guess) on the expected reward of each environment, and pick
best one according to upper bound (optimism)

This principle is useful for many other bandit problems.

Multi-armed bandit is among the simplest decision making problems with limited feedback.

Often, it can be too simple to capture real-life problems. One important aspect it fails to capture is
the “state” of the learning agent, which has impacts on the reward of each action.

• e.g. for Atari games, after making one move, the agent moves to a different state, with possible
different rewards for each action

There are many other techniques and models in reinforcement learning (RL) which can deal with
this issue.

Limitations of multi-armed bandits

Responsible ML

Machine Learning can be brittle

The Blind Men and the Elephant

It was six men of Indostan
To learning much inclined,

Who went to see the Elephant
(Though all of them were blind),

That each by observation
Might satisfy his mind.

The First approached the Elephant,
And happening to fall

Against his broad and sturdy side,
At once began to bawl:

“God bless me! but the Elephant
Is very like a WALL!”

Responsible ML

• ML models can be very sensitive to changes in the data distribution

• ML models can be biased against certain sub-populations

• Conclusion

ML models can be very sensitive to changes in
the data distribution

You saw a small example of this in the HW3 Bonus question:

ML models can latch onto
spurious features to make predictions

LandbirdWaterbird vs.

Consider the following task:

Images courtesy of: University of Hawaii at Manoa, (Flickr)

LandbirdsWaterbirds vs.

Most images of waterbirds are in water,
and landbirds are on land

ML models can latch onto
spurious features to make predictions

LandbirdsWaterbirds vs.

But this isn’t always true!

ML models can latch onto
spurious features to make predictions

LandbirdsWaterbirds vs.

This is known as failure to distributional shifts

ML models can latch onto
spurious features to make predictions

Source: Deep learning for chest radiograph diagnosis: A retrospective comparison of the CheXNeXt algorithm to practicing
radiologists, Rajpurkar et al. 2018

A real-world example
CNN models have obtained impressive results for diagnosing X-rays

E.g. ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization
of Common Thorax Diseases, Wang et a;. 2017

Source: Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-
sectional study, Zech et al. 2018

CNN to predict hospital system detects both general and specific image features.
(A) We obtained activation heatmaps from our trained model and averaged over a sample of images to reveal which subregions
tended to contribute to a hospital system classification decision. Many different subregions strongly predicted the correct
hospital system, with especially strong contributions from image corners. (B-C) On individual images, which have been
normalized to highlight only the most influential regions and not all those that contributed to a positive classification, we note that
the CNN has learned to detect a metal token that radiology technicians place on the patient in the corner of the image field of
view at the time they capture the image. When these strong features are correlated with disease prevalence, models can
leverage them to indirectly predict disease.

But the models may not generalize as well to data from new hospitals because they can
learn to pickup on spurious correlations such as the type of scanner and marks used by
technicians in specific hospitals!

Responsible ML

• ML models can be very sensitive to changes in the data distribution

• ML models can be biased against certain sub-populations

• Conclusion

ML models can be biased against certain sub-
populations

You saw a small example of this in the HW4 word embedding question:

The Gendershades Project

How well do facial recognition tools
perform on various demographics?

http://gendershades.org/

Ans: Not very well

Ans: Not very well

The COMPAS software

https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing

Link to article Link to article

Link to article Link to article

Responsible ML

• ML models can be very sensitive to changes in the data distribution

• ML models can be biased against certain sub-populations

• Conclusion

Going back to the beginning of Lecture 1..

1. Examine your data
2. Examine your model

ML/AI can be very powerful, but should be used responsibly

Lots of great resources to
learn more about best
AI/ML practices.

https://ai.google/responsibil
ities/responsible-ai-
practices/

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com