CS代考 CSCI 544, lecture 2: Naïve Bayes

CSCI 544, lecture 2: Naïve Bayes
Ron Artstein
2022-01-12
These notes are not comprehensive, and do not cover the entire lec- ture. They are provided as an aid to students, but are not a replacement for watching the lecture video, taking notes, and participating in class discussions. Any distribution, posting or publication of these notes out- side of class (for example, on a public web site) requires my prior ap- proval.
Administrative notes
Please login to Zoom using USC credentials Due January 18: Quiz
 Reading for current and next class
 Textbook chapters
Due January 20: Written Assignment Due January 27: Coding Assignment 1
 Instructions on website now
 Data on Blackboard now
 Submission on Vocareum soon
Probability review
P(A,B) = P(A) · P(B|A) = P(B) · P(A|B) Joint Prior Conditional
Probability
P(A|B) ∝ Posterior
Probability
Probability Probability
P(A) · P(B|A) Bayes’ Theorem P(B)
P(B|A) When P(B) is constant
Probability Probability
Conditional
The importance of priors
Sometimes it is easy to determine conditional probabilities: Infected → 1% false negative (99% sensitivity)
Not infected → 1% false positive (99% specificity)
Suppose a person tests positive. How likely are they infected?
The importance of priors
Sometimes it is easy to determine conditional probabilities: Infected → 1% false negative (99% sensitivity)
Not infected → 1% false positive (99% specificity)
Suppose a person tests positive. How likely are they infected?
It depends on the incidence! Suppose 1% of population infected.
Positive Negative
99 1 99 9801
Posterior is 99/198 = 50%: higher than
the prior of 1%.
Infected Not infected
Classification
Classification task: assign labels (classes) to items.
Two kinds of squirrels in California:
Ground squirrel and Tree squirrel
Make observations about individual squirrels:
Tail: bushy or thin?
Ears: pointy or round?
Location: in a tree or on the ground?
For each individual squirrel, identify most likely class:
argmax P(class|observations) class
Naïve Bayes model (Bayes part)
Often, easier to model the observations given the class (e.g., How likely is a ground squirrel to have a bushy tail?)
P(observations|class)
Use Bayes’ theorem to turn one probability into the other! P(class|observations) ∝ P(class) · P(observations|class)
This is a generative model Need to know class priors
Don’t need observation priors: P(observations) is constant given the observations.
Naïve Bayes model (naïve part)
Modeling the probability of observations:
Observations represented as features: f1, f2, f3, . . . (e.g., Tail, Ears, Location, . . . )
P(observations|class)
= P(f1, f2, f3, . . . |class)
= P(f1|class) · P(f2|class, f1) · P(f3|class, f1, f2) · · ·
Naïve assumption:
observations are conditionally independent, given the class:
For all fx,fy,fz,…: P(fx|class,fy,fz,…) = P(fx|class)
P(observations|class)
= P(f1, f2, f3, . . . |class)
= P(f1|class) · P(f2|class, f1) · P(f3|class, f1, f2) · · ·
≈ P(f1|class) · P(f2|class) · P(f3|class) · · ·
Naïve Bayes model: parameter estimation
We have a model with many parameters:
For each class, a prior probability
For each feature × class, a conditional probability
How do we estimate the parameters of our model?
Maximum Likelihood Estimate: choose the model that gives the highest probability to the observed data
Suppose I flip a coin, and it falls 6 times on heads, 4 times on tails. What’s the most likely bias of the coin?
If the coin is 50%–50%: P = 0.20508
If the coin is 60%–40%: P = 0.25082 Maximum If the coin is 70%–30%: P = 0.20012
MLE of a probability distribution = observed frequencies
Naïve Bayes model: five squirrels
We found 5 squirrels (training data)
Ground Ground Ground Tree Tree
bushy bushy thin bushy bushy
Ears Location
pointy ground round ground round ground pointy ground round tree
Tbrt Ground Tree
Class priors
Bushy tail Thin tail
Pointy ears Round ears
On the ground In a tree
2/3 2/2 1/3 0/2
1/3 1/2 2/3 1/2
3/3 1/2 0/3 1/2
Naïve Bayes model: new squirrel
New squirrel: Thin tail, round ears, in a tree. What does the model predict?
P(Ground|observations) ∝ 3/5 · 1/3 · 2/3 · 0/3 = 0
P(Tree|observations) ∝ 2/5 · 0/2 · 1/2 · 1/2 = 0 Zero probability of just one feature → product = zero!
Smoothing to get rid of zero probabilities
Naïve Bayes model: add-one smoothing
Tbrt Ground Tree
New squirrel: Thin tail, round ears, in a tree. What does the model predict?
P(Ground|observations) ∝ 3/5 · 2/5 · 3/5 · 1/5 = 0.0288
P(Tree|observations) ∝
2/5 · 1/4 · 2/4 · 2/4 = 0.0250
What’s the most likely squirrel?
P(Ground, observations) = 3/5 · 3/5 · 3/5 · 4/5 = 0.1728
Class priors
Bushy tail Thin tail
Pointy ears Round ears
On the ground In a tree
3/5 3/4 2/5 1/4
2/5 2/4 3/5 2/4
4/5 2/4 1/5 2/4
Data splits for machine learning
Training data: What our program learns from Labeled (supervised learning)
Unlabeled (unsupervised learning) Test data: Used for evaluating the program
Provides correct answers
Development data: Testing during development Avoid “contaminating” the test data
Try different models Tune parameters
Coding Assignment 1
Write a Naive Bayes classifier
From Scratch
Classify hotel reviews on two dimensions
Positive / Negative
Truthful / Deceptive Words as features Graded on performance Programming in Python
Coding Assignment 1: directory structure
fold2 fold3 fold4 fold1 fold2 fold3 fold4 fold1 fold2 fold3 fold4 fold1 fold2 fold3 fold4
Coding Assignment 1: programs
Training data
Learning program
Model file(s) Test data Classifiation program
Output file
Answer key Scoring program Score
Coding Assignment 1: notes
Reserve data for development
Put training and development data in separate directories Vocareum uses fold1 for development
Problem formulation
Two binary classifiers One 4-class classifier
Reference solution uses add-one smoothing Reference solution ignores unknown test tokens
Tokenization
Experiment to see what works best!
What is a word?
One word or two?
handwriting ; hand writing
; Forty niner
San Francisco-based ; San Francisco–based he’s ; it’ll
David’s book ; David’s happy
Tokenization: splitting the text into units for processing
Token, stem, lemma
Token: atomic unit for processing
Anywhere on the scale character ←→ word
Stem: wordform stripped of some characters cat, cats → cat
defense, defensible → defens
Lemma: the base (or citation) form of a word
cat, cats → cat brought → bring is, were → be
Lemmatizer = function: {wordform|token} → lemma
Information lost (or gained)
Tokenization can be lossy Where were you born?
where be you bear Where is your bear?
Is it important to be able to reconstruct the original? Depends on the application
Byte-pair encoding (BPE): use lossless compression Not necesarrily linguistically meaningful
May be useful for some applications
Building a lemmatizer
Rules • Dictionary • Learn from corpus (learn what?)
What are some problems with using a dictionary? Word form not in the dictionary (OOV):
Kardashians → Kardashian
Word form ambiguous:
I saw the chicken lay an egg → lay
I saw the chicken lay in the sun → lie
OOV, Disambiguation: major issues in all of NLP