L18 – Learning from Examples
EECS 391
Intro to AI
Learning from Examples
L16 Thu Nov 8
Classifying Uncertain Data
<2 years at current job? missed payments? defaulted? N N N Y N Y N N N N N N N Y Y Y N N N Y N N Y Y Y N N Y N N • • • • • • • • • Predicting credit risk• Consider the credit risk data again. • Suppose now we want to learn the best classification P(D | J,M) from the data? • Instead of a yes or no answer want some estimate of how strongly we believe a loan applicant is a credit risk. • This might be useful if we want some flexibility in adjusting our decision criteria. - Eg, suppose we’re willing to take more risk if times are good. - Or, if we want to examine case we believe are higher risks more carefully. Pick your poison: Mushrooms • Or suppose we wanted to know how likely a mushroom was safe to eat? • One approach is to consult a guide, and go by the listed criteria, but does that allow us to place any certainty on the decision? • (btw: never eat wild mushrooms without an expert guide, it really is a serious risk) EDIBLE? CAP-SHAPE CAP-SURFACE • • • 1 edible flat fibrous • • • 2 poisonous convex smooth • • • 3 edible flat fibrous • • • 4 edible convex scaly • • • 5 poisonous convex smooth • • • 6 edible convex fibrous • • • 7 poisonous flat scaly • • • 8 poisonous flat scaly • • • 9 poisonous convex fibrous • • • 10 poisonous convex fibrous • • • 11 poisonous flat smooth • • • 12 edible convex smooth • • • 13 poisonous knobbed scaly • • • 14 poisonous flat smooth • • • 15 poisonous flat fibrous • • • ••• ••• ••• • • • Mushroom data “Death Cap” Bayesian classification for more complex models • Recall the class conditional probability: • How do we define the data likelihood, p(x|Ck) ie the probability of x given class Ck p(Ck|x) = p(x|Ck)p(Ck) p(x) = p(x|Ck)p(Ck)� k p(x|Ck)p(Ck) Defining a probabilistic classification model • How would we define credit risk problem? - Class: C1 = “defaulted” C2 = “didn’t default” - Data: x = { “<2 years”, “missed payments” } - Prior (from data): p(C1) = 3/10; p(C2) = 7/10; - Likelihood: p(x1, x2 | C1) = ? p(x1, x2 | C2) = ? - How would we determine these? <2 years at current job? missed payments? defaulted? N N N Y N Y N N N N N N N Y Y Y N N N Y N N Y Y Y N N Y N N • • • • • • • • • Predicting credit risk Defining a probabilistic model by counting • The “prior” is obtained by counting number of classes in the data: • The likelihood is obtained the same way: • This is the maximum likelihood estimate (MLE) of the probabilities p(Ck = k) = Count(Ck = k) # records p(x = v|Ck) = Count(x = v ∧ Ck = k) Count(Ck = k) p(x1 = v1, . . . , xN = vN |Ck = k) = Count(x1 = v1, . . . ∧ xN = vN ,∧Ck = k) Count(Ck = k) Defining a probabilistic classification model • Determining the likelihood: p(x1, x2 | C1) = ? p(x1, x2 | C2) = ? • Simple approach: look at counts in data <2 years at current job? missed payments? defaulted? N N N Y N Y N N N N N N N Y Y Y N N N Y N N Y Y Y N N Y N N • • • • • • • • • Predicting credit risk x1 <2 years at current job? x2 missed payments? C1 did default C2 did not default N N N Y Y N Y Y Defining a probabilistic classification model • Determining the likelihood: p(x1, x2 | C1) = ? p(x1, x2 | C2) = ? • Simple approach: look at counts in data <2 years at current job? missed payments? defaulted? N N N Y N Y N N N N N N N Y Y Y N N N Y N N Y Y Y N N Y N N • • • • • • • • • Predicting credit risk x1 <2 years at current job? x2 missed payments? C1 did default C2 did not default N N 0/3 3/3 N Y Y N Y Y Defining a probabilistic classification model • Determining the likelihood: p(x1, x2 | C1) = ? p(x1, x2 | C2) = ? • Simple approach: look at counts in data <2 years at current job? missed payments? defaulted? N N N Y N Y N N N N N N N Y Y Y N N N Y N N Y Y Y N N Y N N • • • • • • • • • Predicting credit risk x1 <2 years at current job? x2 missed payments? C1 did default C2 did not default N N 0/3 3/3 N Y 2/3 1/3 Y N Y Y Defining a probabilistic classification model • Determining the likelihood: p(x1, x2 | C1) = ? p(x1, x2 | C2) = ? • Simple approach: look at counts in data <2 years at current job? missed payments? defaulted? N N N Y N Y N N N N N N N Y Y Y N N N Y N N Y Y Y N N Y N N • • • • • • • • • Predicting credit risk x1 <2 years at current job? x2 missed payments? C1 did default C2 did not default N N 0/3 3/3 N Y 2/3 1/3 Y N 1/4 3/4 Y Y Defining a probabilistic classification model • Determining the likelihood: p(x1, x2 | C1) = ? p(x1, x2 | C2) = ? • Simple approach: look at counts in data <2 years at current job? missed payments? defaulted? N N N Y N Y N N N N N N N Y Y Y N N N Y N N Y Y Y N N Y N N • • • • • • • • • Predicting credit risk x1 <2 years at current job? x2 missed payments? C1 did default C2 did not default N N 0/3 3/3 N Y 2/3 1/3 Y N 1/4 3/4 Y Y 0/0 0/0 Defining a probabilistic classification model • Determining the likelihood: p(x1, x2 | C1) = ? p(x1, x2 | C2) = ? • Simple approach: look at counts in data <2 years at current job? missed payments? defaulted? N N N Y N Y N N N N N N N Y Y Y N N N Y N N Y Y Y N N Y N N • • • • • • • • • Predicting credit risk x1 <2 years at current job? x2 missed payments? C1 did default C2 did not default N N 0/3 3/3 N Y 2/3 1/3 Y N 1/4 3/4 Y Y 0/0 0/0 What do we do about these? Being (proper) Bayesians: Recall our coin-flipping example • In Bernoulli trials, each sample is either 1 (e.g. heads) with probability θ, or 0 (tails) with probability 1 − θ. • The binomial distribution specifies probability of total #heads, y, out of n trials: p(y|θ, n) = ! n y " θy(1 − θ)n−y 0 5 10 15 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 y p( y| θ =0 .2 5, n =1 0) Applying Bayes’ rule • Given n trials with k heads, what do we know about θ? • We can apply Bayes’ rule to see how our knowledge changes as we acquire new observations: p(θ|y, n) = p(y|θ, n)p(θ|n) p(y|n) posterior likelihood prior normalizing constant ● Uniform on [0, 1] is a reasonable assumption, i.e. “we don’t know anything”. ● We know the likelihood, what about the prior? = ! p(y|θ, n)p(θ|n)dθ p(θ|y, n) ∝ ! n y " θy(1 − θ)n−y ● In this case, the posterior is just proportional to the likelihood: ● What is the form of the posterior? Evaluating the posterior 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 θ p( θ | y= 0, n =0 ) ● What do we know initially, before observing any trials? Coin tossing 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 θ p( θ | y= 0, n =1 ) ● What is our belief about θ after observing one “tail” ? Coin tossing 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 θ p( θ | y= 1, n =2 ) ● Now after two trials we observe 1 head and 1 tail. Coin tossing 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 θ p( θ | y= 1, n =3 ) ● 3 trials: 1 head and 2 tails. Coin tossing 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 θ p( θ | y= 1, n =4 ) ● 4 trials: 1 head and 3 tails. Coin tossing 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 θ p( θ | y= 1, n =5 ) ● 5 trials: 1 head and 4 tails. Evaluating the normalizing constant • To get proper probability density functions, we need to evaluate p(y|n): p(θ|y, n) = p(y|θ, n)p(θ|n) p(y|n) ● Bayes in his original paper in 1763 showed that: p(y|n) = ! 1 0 p(y|θ, n)p(θ|n)dθ = 1 n + 1 ⇒ p(θ|y, n) = ! n y " θy(1 − θ)n−y(n + 1) The ratio estimate • What about after just one trial: 0 heads and 1 tail? 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 θ p( θ | y= 0, n =1 ) ● MAP and ratio estimate would say 0. y/n = 0 * Does this make sense? ● What would a better estimate be? MAP estimate = 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 θ p( θ | y= 0, n =1 ) The expected value estimate • The expected value of a pdf is: 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 θ p( θ | y= 0, n =1 ) E(θ|y, n) = ! 1 0 θp(θ|y, n)dθ = y + 1 n + 2 What happens for zero trials? E(θ|y = 0, n = 1) = 1 3 This is called “smoothing” or “regularization” On to the mushrooms! EDIBLE? CAP-SHAPE CAP-SURFACE CAP-COLOR ODOR STALK-SHAPE POPULATION HABITAT • • • 1 edible flat fibrous red none tapering several woods • • • 2 poisonous convex smooth red foul tapering several paths • • • 3 edible flat fibrous brown none tapering abundant grasses • • • 4 edible convex scaly gray none tapering several woods • • • 5 poisonous convex smooth red foul tapering several woods • • • 6 edible convex fibrous gray none tapering several woods • • • 7 poisonous flat scaly brown fishy tapering several leaves • • • 8 poisonous flat scaly brown spicy tapering several leaves • • • 9 poisonous convex fibrous yellow foul enlarging several paths • • • 10 poisonous convex fibrous yellow foul enlarging several woods • • • 11 poisonous flat smooth brown spicy tapering several woods • • • 12 edible convex smooth yellow anise tapering several woods • • • 13 poisonous knobbed scaly red foul tapering several leaves • • • 14 poisonous flat smooth brown foul tapering several leaves • • • 15 poisonous flat fibrous gray foul enlarging several woods • • • 16 edible sunken fibrous brown none enlarging solitary urban • • • 17 poisonous flat smooth brown foul tapering several woods • • • 18 poisonous convex smooth white foul tapering scattered urban • • • 19 poisonous flat scaly yellow foul enlarging solitary paths • • • 20 edible convex fibrous gray none tapering several woods • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • The scaling problem p(x = v|Ck) = Count(x = v ∧ Ck = k) Count(Ck = k) p(x1 = v1, . . . , xN = vN |Ck = k) = Count(x1 = v1, . . . ∧ xN = vN ,∧Ck = k) Count(Ck = k) • The prior is easy enough. • But for the likelihood, the table is huge! Mushroom attributes and values EDIBLE: edible poisonous CAP-SHAPE: bell conical convex flat knobbed sunken CAP-SURFACE: fibrous grooves scaly smooth CAP-COLOR: brown buff cinnamon gray green pink purple red white yellow BRUISES: bruises no ODOR: almond anise creosote fishy foul musty none pungent spicy GILL-ATTACHMENT: attached free GILL-SPACING: close crowded GILL-SIZE: broad narrow GILL-COLOR: black brown buff chocolate gray green orange pink purple red white yellow STALK-SHAPE: enlarging tapering STALK-ROOT: bulbous club equal rooted STALK-SURFACE-ABOVE-RING: fibrous scaly silky smooth STALK-SURFACE-BELOW-RING: fibrous scaly silky smooth STALK-COLOR-ABOVE-RING: brown buff cinnamon gray orange pink red white yellow STALK-COLOR-BELOW-RING: brown buff cinnamon gray orange pink red white yellow VEIL-TYPE: partial universal VEIL-COLOR: brown orange white yellow RING-NUMBER: none one two RING-TYPE: evanescent flaring large none pendant SPORE-PRINT-COLOR: black brown buff chocolate green orange purple white yellow POPULATION: abundant clustered numerous scattered several solitary HABITAT: grasses leaves meadows paths urban waste woods 2 6 4 10 2 9 2 2 2 12 2 4 4 4 9 9 2 4 3 5 9 6 7 # values attributes 22 attributes with an average of 5 values! Simplifying with “Naïve” Bayes • What if we assume the features are independent? • We know that’s not precisely true, but it might make a good approximation. • Now we only need to specify N different likelihoods: • Huge savings in number of of parameters p(x|Ck) = p(x1, . . . , xN |Ck) = N! n=1 p(xn|Ck) p(xi = vi|Ck = k) = Count(xi = vi ∧ Ck = k) Count(Ck = k) Inference with Naïve Bayes • Inference is just like before, but with the independence approximation: • Classification performance is often surprisingly good • easy to implement p(Ck|x) = p(x|Ck)p(Ck) p(x) = p(Ck) ! n p(xn|Ck) p(x) = p(Ck) ! n p(xn|Ck)" k p(Ck) # n p(xn|Ck) Implementation issues • If you implement Naïve Bayes naïvely, you’ll run into trouble. Why? • It’s never good to compute products of a long list of numbers • They’ll quickly go to zero with machine precision, even using doubles (64 bit) • Strategy: compute log probabilities • What about that constant? It still has a product. p(Ck|x) = p(Ck) ! n p(xn|Ck)" k p(Ck) # n p(xn|Ck) log p(Ck|x) = log p(Ck) + ! n log p(xn|Ck) − log " ! k p(Ck) # n p(xn|Ck) $ = log p(Ck) + ! n log p(xn|Ck) − constant Converting back to probabilities • The only requirement of the denominator is that it normalize the numerator to yield a valid probability distribution. • We used a log transformation: • The form of the probability the same for any constant c • A common choice: choose c so that the log probabilities are shifted to zero: gi = log pi + constant pi! i pi = e gi ! i egi = e c e gi ! i ecegi = e gi+c ! i egi+c c = −max i gi Text classification with the bag of words model • Each row is a document represented as a bag-of-words vector. • The different classes are different newsgroups. • The differences in word frequencies are readily apparent. • We can use mixture models and naïve Bayes to classify the documents • We only replace the data likelihood with our bag-of-words model. • This is a common way to build a spam filter or classify web pages. 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 10 introBody.tex words d o c u m e n ts 10 20 30 40 50 60 70 80 90 100 100 200 300 400 500 600 700 800 900 1000 Figure 1.9: 20-newsgroups data, of size 16242 x 100 (also known as 20-news-w100). We only show 1000 rows, for clarity. Each row is a document (represented as a bag-of-words bit vector), each column is a word. The red lines separate the 4 classes, which are (in descending order) comp, rec, sci, talk (these are the titles of USENET groups). We can see that there are subsets of words whose presence or absence is indicative of the class. Produced by newsgroupsVisualize. in such a simple setting, for reasons we explain in Section 1.6. By contrast, in Figure 1.8, we see that logistic regression in the original feature space works poorly, so we need to use some kind of nonlinear or non-parametric model. As a consequence of the no free lunch theorem, we need to develop many different types of model, each of which corre- sponds to a different set of assumptions. We can then choose the best model using techniques discussed in Section 1.7.5. 1.2.13 Real world applications Classification is probably the most widely used form of machine learning, and has been used to solve many interesting and often difficult real-world problems. We give a few examples below. 1.2.13.1 Document classification and email spam filtering In document classification, the goal is to classify a document, such as a web page or email message, into one of C classes, that is, to compute p(y = c|x,D), where x is some representation of the text. A special case of this is email spam filtering, where the classes are spam y = 1 or ham y = 0. Most classifiers assume that the input vector x has a fixed size. A common way to represent variable-length documents in feature-vector format is to use a bag of words representation. The basic idea is very simple: just count how many times each word occurs, and ignore any word order. This gives a histogram representation of a document. We can further simplify matters by thresholding this histogram at 1, so we set xij = 1 if word j occurs one or more times in document i, and otherwise set xij = 0. Although this loses some information [MN98], it simplifies the problem somewhat. For example, consider the following fragment of a famous nursery rhyme: Mary had a little lamb, little lamb, little lamb, Mary had a little lamb, whose fleece was white as snow. Now imagine our vocabulary consists of the following words: mary lamb little big fleece white black snow rain unk 1 2 3 4 5 6 7 8 9 10 Here unk stands for unknown, and represents all other words that do not appear elsewhere on the list. To encode each line of the nursery rhyme, we first strip off punctuation, and remove any stop words such as “a”, “as”, “the”, etc. We can also perform stemming, which means reducing words to their base form, such as stripping off the final s in plural words, or the ing from verbs (e.g., running becomes run). In this example, no words need stemming. Finally, we replace each word by its index into the vocabulary to get: 1 10 3 2 3 2 3 2 1 10 3 2 10 5 10 6 8 c� Kevin P. Murphy. Draft — not for circulation. p(Ck|x) = p(Ck) ! n p(xn|Ck)" k p(Ck) # n p(xn|Ck)