Computational
Linguistics
CSC 485 Summer 2020
10
10. Maximum Entropy Models
Gerald Penn
Department of Computer Science, University of Toronto
(slides borrowed from Chris Manning and Dan Klein)
Copyright © 2017 Gerald Penn. All rights reserved.
Introduction
Much of what we’ve looked at has been “generative”
PCFGs, Naive Bayes for WSD
In recent years there has been extensive use of conditional or discriminative probabilistic models in NLP, IR, Speech (and ML generally)
Because:
They give high accuracy performance
They make it easy to incorporate lots of linguistically important features
They allow automatic building of language independent, retargetable NLP modules
Joint vs. Conditional Models We have some data {(d, c)} of paired observations d
and hidden classes c.
Joint (generative) models place probabilities over both observed data and the hidden stuff (generate the
Discriminative (conditional) models take the data as given, and put a probability over hidden structure
observed data from hidden stuff):
All the best known StatNLP models:
P(c,d)
n-gram models, Naive Bayes classifiers, hidden Markov models, probabilistic context-free grammars
given the data:
P(c|d)
Logistic regression, conditional log-linear or maximum entropy models, conditional random fields, (SVMs, …)
Bayes net diagrams draw circles for random variables, and lines for direct dependencies
Some variables are observed; some are hidden
Each node is a little classifier (conditional probability table) based on incoming arcs
cc
Bayes Net/Graphical Models
d1 d2 d3
d1 d2 d3
Naive Bayes Logistic Regression
Generative Discriminative
Conditional models work well: Word Sense Disambiguation
Even with exactly the same features, changing from joint to conditional estimation increases performance
That is, we use the same smoothing, and the same word-class features, we just change the numbers (parameters)
Training Set
Objective
Accuracy
Joint Like.
Cond. Like.
86.8
98.5
Test Set
Objective
Accuracy
Joint Like.
Cond. Like.
73.6
76.1
(Klein and Manning 2002, using Senseval-1 Data)
Features
In these slides and most MaxEnt work: features are elementary pieces of evidence that link aspects of what we observe d with a category c that we want to predict.
A feature has a (bounded) real value: f: C D → R
Usually features specify an indicator function of properties of the input and a particular class (every one we present is). They pick out a subset.
fi(c, d) [Φ(d) c = cj] [Value is 0 or 1]
We will freely say that Φ(d) is a feature of the data d, when, for each cj, the conjunction Φ(d) c = cj is a feature of the data-class pair (c, d).
Features For example:
f1(c,witi) [c= “NN” islower(w0) ends(w0, “d”)] f2(c, witi) [c = “NN” w-1 = “to” t-1 = “TO”] f3(c, witi) [c = “VB” islower(w0)]
Models will assign each feature a weight Empirical count (expectation) of a feature:
empirical Efi=∑c,d∈observedC,Dfic,d Model expectation of a feature:
INNN TONN TOVB in bed to aid to aid
IN JJ in blue
Efi=∑c,d∈C,DPc,dfic,d
Feature-Based Models
The decision about a data point is based only on
the features active at that point.
Data
DT JJ NN … The previous fall …
Label
NN
Features
{W=fall, PT=JJ PW=previous}
Data
BUSINESS: Stocks hit a yearly low …
Label
BUSINESS
Features
{…, stocks, hit, a, yearly, low, …}
Data
… to restructure bank:MONEY debt.
Label
MONEY
Features
{…, P=restructure, N=debt, L=12, …}
Text Categorization Word-Sense POS Tagging Disambiguation
Example: Text Categorization
(Zhang and Oles 2001)
Features are a word in document and class (they do feature selection to use reliable indicators)
Tests on classic Reuters data set (and others)
Naïve Bayes: 77.0% F1
Linear regression: 86.0%
Logistic regression: 86.4%
Support vector machine: 86.5%
Emphasizes the importance of regularization (smoothing) for successful use of discriminative methods (not used in most early NLP/IR work)
Example: POS Tagging Features can include:
Current, previous, next words in isolation or together. Previous (or next) one, two, three tags.
Word-internal features: word types, suffixes, dashes, etc.
Decision Point Features
Local Context
W0
22.6
W+1
%
W-1
fell
T-1
VBD
T-1-T-2
NNP-VBD
hasDigit?
true
…
…
-3
-2
-1
0
+1
DT
NNP
VBD
???
???
The
Dow
fell
22.6
%
(Ratnaparkhi 1996; Toutanova et al. 2003, etc.)
Example: NER Interaction
Previous-state and current- signature have interactions, e.g. P=PERS-C=Xx indicates C=PERS much more strongly than C=Xx and P=PERS independently.
This feature type allows the model to capture this interaction.
Local Context
Feature Weights
Feature Type
Feature
PERS
LOC
Previous word
at
-0.73
0.94
Current word
Grace
0.03
0.00
Beginning bigram