CS代考计算机代写 algorithm ant Machine Learning 10-601

Machine Learning 10-601
Tom M. Mitchell
Machine Learning Department Carnegie Mellon University
January 26, 2015
Today:
• BayesClassifiers
• ConditionalIndependence • NaïveBayes
Readings: Mitchell:
“Naïve Bayes and Logistic Regression”
(available on class website)

Two Principles for Estimating Parameters
• MaximumLikelihoodEstimate(MLE):chooseθthat maximizes probability of observed data
• MaximumaPosteriori(MAP)estimate:chooseθthat is most probable given prior probability and the data

Maximum Likelihood Estimate
X=1 X=0 P(X=1) = θ
P(X=0) = 1-θ (Bernoulli)

Maximum A Posteriori (MAP) Estimate
X=1 X=0

Let’s learn classifiers by learning P(Y|X)
Consider Y=Wealth, X=
Gender HrsWorked P(rich | G,HW) P(poor | G,HW)
F <40.5 .09 .91 F >40.5 .21 .79
M <40.5 .23 .77 M >40.5 .38 .62

How many parameters must we estimate?
Suppose X =
where Xi and Y are boolean RV’s
To estimate P(Y| X1, X2, … Xn)
If we have 30 boolean Xi’s: P(Y | X1, X2, … X30)

How many parameters must we estimate?
Suppose X =
where Xi and Y are boolean RV’s
To estimate P(Y| X1, X2, … Xn)
If we have 30 Xi’s instead of 2?

Bayes Rule
Which is shorthand for:
Equivalently:

Can we reduce params using Bayes Rule?
Suppose X =
where Xi and Y are boolean RV’s
How many parameters to define P(X1,… Xn | Y)?
How many parameters to define P(Y)?

Can we reduce params using Bayes Rule?
Suppose X =
where Xi and Y are boolean RV’s

Naïve Bayes
Naïve Bayes assumes
i.e., that Xi and Xj are conditionally independent given Y, for all i≠j

Conditional Independence
Definition: X is conditionally independent of Y given Z, if the probability distribution governing X is independent of the value of Y, given the value of Z
Which we often write E.g.,

Naïve Bayes uses assumption that the Xi are conditionally independent, given Y. E.g.,
Given this assumption, then:

Naïve Bayes uses assumption that the Xi are conditionally independent, given Y. E.g.,
Given this assumption, then:
in general:

Naïve Bayes uses assumption that the Xi are conditionally independent, given Y. E.g.,
Given this assumption, then:
in general:
How many parameters to describe P(X1…Xn|Y)? P(Y)? • Withoutconditionalindepassumption?
• Withconditionalindepassumption?

Naïve Bayes uses assumption that the Xi are conditionally independent, given Y
Given this assumption, then:
in general:
How many parameters to describe P(X1…Xn|Y)? P(Y)? • Withoutconditionalindepassumption?
• Withconditionalindepassumption?

Naïve Bayes in a Nutshell
Bayes rule:
Assuming conditional independence among Xi’s: So, to pick most probable Y for Xnew = < X1, ..., Xn >

Naïve Bayes Algorithm – discrete Xi
• Train Naïve Bayes (examples) for each* value yk
estimate
for each* value xij of each attribute Xi
estimate • Classify (Xnew)
* probabilities must sum to 1, so need estimate only n-1 of these…

Estimating Parameters: Y, Xi discrete-valued Maximum likelihood estimates (MLE’s):
Number of items in dataset D for which Y=yk

Example: Live in Sq Hill? P(S|G,D,M)
• S=1 iff live in Squirrel Hill • D=1 iff Drive to CMU
• G=1 iff shop at SH Giant Eagle • M=1 iff Rachel Maddow fan
What probability parameters must we estimate?

Example: Live in Sq Hill? P(S|G,D,M)
• S=1 iff live in Squirrel Hill
• G=1 iff shop at SH Giant Eagle
P(S=1) : P(D=1 | S=1) : P(D=1 | S=0) : P(G=1 | S=1) : P(G=1 | S=0) : P(M=1 | S=1) : P(M=1 | S=0) :
• D=1 iff Drive to CMU
• M=1 iff Rachel Maddow fan
P(S=0) : P(D=0 | S=1) : P(D=0 | S=0) : P(G=0 | S=1) : P(G=0 | S=0) : P(M=0 | S=1) : P(M=0 | S=0) :

Example: Live in Sq Hill? P(S|G,D,B)
• S=1 iff live in Squirrel Hill • D=1 iff Drive or carpool to CMU
• G=1 iff shop at SH Giant Eagle • B=1 iff Birthday is before July 1
What probability parameters must we estimate?

Example: Live in Sq Hill? P(S|G,D,E)
• S=1 iff live in Squirrel Hill
• G=1 iff shop at SH Giant Eagle
P(S=1) : P(D=1 | S=1) : P(D=1 | S=0) : P(G=1 | S=1) : P(G=1 | S=0) : P(B=1 | S=1) : P(B=1 | S=0) :
• D=1 iff Drive or Carpool to CMU • B=1 iff Birthday is before July 1
P(S=0) : P(D=0 | S=1) : P(D=0 | S=0) : P(G=0 | S=1) : P(G=0 | S=0) : P(B=0 | S=1) : P(B=0 | S=0) :

Naïve Bayes: Subtlety #1
Often the Xi are not really conditionally independent
• We use Naïve Bayes in many cases anyway, and it often works pretty well
– often the right classification, even when not the right probability (see [Domingos&Pazzani, 1996])
• What is effect on estimated P(Y|X)? – Extremecase:whatifweaddtwocopies:Xi =Xk

Extremecase:whatifweaddtwocopies:Xi =Xk

Extremecase:whatifweaddtwocopies:Xi =Xk

Naïve Bayes: Subtlety #2
If unlucky, our MLE estimate for P(Xi | Y) might be zero. (for example, Xi = birthdate. Xi = Jan_25_1992)
• Why worry about just one parameter out of many?
• What can be done to address this?

Naïve Bayes: Subtlety #2
If unlucky, our MLE estimate for P(Xi | Y) might be zero. (e.g., Xi = Birthday_Is_January_30_1992)
• Why worry about just one parameter out of many?
• What can be done to address this?

Estimating Parameters
• MaximumLikelihoodEstimate(MLE):chooseθthat maximizes probability of observed data
• MaximumaPosteriori(MAP)estimate:chooseθthat is most probable given prior probability and the data

Estimating Parameters: Y, Xi discrete-valued Maximum likelihood estimates:
MAP estimates (Beta, Dirichlet priors):
Only difference: “imaginary” examples

Learning to classify text documents
• Classifywhichemailsarespam?
• Classifywhichemailspromiseanattachment?
• Classifywhichwebpagesarestudenthomepages?
How shall we represent text documents for Naïve Bayes?

Baseline: Bag of Words Approach
aardvark 0 about 2 all 2 Africa 1 apple 0 anxious 0 …
gas 1 …
oil 1 …
Zaire 0

Learning to classify document: P(Y|X) the “Bag of Words” model
• Y discrete valued. e.g., Spam or not
• X==document
• Xi is a random variable describing the word at position i in the document
• possible values for Xi : any word wk in English
• Document=bagofwords:thevectorofcountsforall wk’s
– like #heads, #tails, but we have many more than 2 values
– assume word probabilities are position independent (i.i.d. rolls of a 50,000-sided die)

Naïve Bayes Algorithm – discrete Xi
• TrainNaïveBayes(examples) for each value yk
estimate
for each value xj of each attribute Xi
estimate • Classify(Xnew)
prob that word xj appears in position i, given Y=yk
* Additional assumption: word probabilities are position independent

MAP estimates for bag of words
Map estimate for multinomial
What β’s should we choose?

For code and data, see
www.cs.cmu.edu/~tom/mlbook.html
click on “Software and Data”

What you should know:
• Training and using classifiers based on Bayes rule
• Conditional independence – What it is
– Why it’s important
• Naïve Bayes – What it is
– Why we use it so much
– Training using MLE, MAP estimates
– Discrete variables and continuous (Gaussian)

Questions:
• HowcanweextendNaïveBayesifjust2oftheXi‘s are dependent?
• WhatdoesthedecisionsurfaceofaNaïveBayes classifier look like?
• WhaterrorwilltheclassifierachieveifNaïveBayes assumption is satisfied and we have infinite training data?
• CanyouuseNaïveBayesforacombinationof discrete and real-valued Xi?

What if we have continuous Xi ? Eg., image classification: Xi is ith pixel

What if we have continuous Xi ? image classification: Xi is ith pixel, Y = mental state
Still have:
Just need to decide how to represent P(Xi | Y)

What if we have continuous Xi ? Eg., image classification: Xi is ith pixel
Gaussian Naïve Bayes (GNB): assume
Sometimes assume σik
• is independent of Y (i.e., σi), • or independent of Xi (i.e., σk) • or both (i.e., σ)

Gaussian Naïve Bayes Algorithm – continuous Xi (but still discrete Y)
• Train Naïve Bayes (examples) for each value yk
estimate*
for each attribute Xi estimate
class conditional mean , variance
• Classify (Xnew)
* probabilities must sum to 1, so need estimate only n-1 parameters…

jth training example
ith feature
kth class
δ(z)=1 if z true, else 0
Estimating Parameters: Y discrete, Xi continuous Maximum likelihood estimates:

GNB Example: Classify a person’s cognitive activity, based on brain image
• are they reading a sentence or viewing a picture? • reading the word “Hammer” or “Apartment”
• viewing a vertical or horizontal line?
• answering the question, or getting confused?

Stimuli for our study:
ant
or 60 distinct exemplars, presented 6 times each

fMRI voxel means for “bottle”: means defining P(Xi | Y=“bottle) fMRI
Mean fMRI activation over all stimuli:
“bottle” minus mean activation:
below average
activation high
average

Rank Accuracy Distinguishing among 60 words

Tools vs Buildings: where does brain encode their word meanings?
Accuracies of cubical 27-voxel Naïve Bayes classifiers centered at each voxel [0.7-0.8]

Expected values
Given discrete random variable X, the expected value of X, written E[X] is
We also can talk about the expected value of functions of X

Covariance
Given two random vars X and Y, we define the covariance of X and Y as
e.g., X=gender, Y=playsFootball or X=gender, Y=leftHanded
Remember: