程序代写代做代考 algorithm Computational

Computational

Linguistics

Copyright © 2017 Gerald

Penn. All rights reserved.

10

10. Maximum Entropy Models
Gerald Penn

Department of Computer Science, University of Toronto

(slides borrowed from Chris Manning and Dan Klein)

CSC 2501 / 485

Fall 2018

Introduction

 Much of what we’ve looked at has been “generative”
 PCFGs, Naive Bayes for WSD

 In recent years there has been extensive use of
conditional or discriminative probabilistic
models in NLP, IR, Speech (and ML generally)

 Because:
 They give high accuracy performance
 They make it easy to incorporate lots of

linguistically important features
 They allow automatic building of language

independent, retargetable NLP modules

Joint vs. Conditional Models

 We have some data {(d, c)} of paired observations d
and hidden classes c.

 Joint (generative) models place probabilities over both
observed data and the hidden stuff (generate the
observed data from hidden stuff):
 All the best known StatNLP models:

 n-gram models, Naive Bayes classifiers, hidden Markov
models, probabilistic context-free grammars

 Discriminative (conditional) models take the data as
given, and put a probability over hidden structure
given the data:

 Logistic regression, conditional log-linear or maximum
entropy models, conditional random fields, (SVMs, …)

P(c,d)

P(c|d)

Bayes Net/Graphical Models

 Bayes net diagrams draw circles for random
variables, and lines for direct dependencies

 Some variables are observed; some are hidden
 Each node is a little classifier (conditional probability

table) based on incoming arcs

c

d1 d 2 d 3

Naive Bayes

c

d1 d2 d3

Generative

Logistic Regression

Discriminative

Conditional models work well:
Word Sense Disambiguation

 Even with exactly the same
features, changing from
joint to conditional
estimation increases
performance

 That is, we use the same
smoothing, and the same
word-class features, we just
change the numbers
(parameters)

Training Set

Objective Accuracy

Joint Like. 86.8

Cond. Like. 98.5

Test Set

Objective Accuracy

Joint Like. 73.6

Cond. Like. 76.1

(Klein and Manning 2002, using Senseval-1 Data)

Features

 In these slides and most MaxEnt work: features are
elementary pieces of evidence that link aspects of
what we observe d with a category c that we want to
predict.

 A feature has a (bounded) real value: f: C  D → R
 Usually features specify an indicator function of

properties of the input and a particular class (every
one we present is). They pick out a subset.

 fi(c, d)  [Φ(d)  c = cj] [Value is 0 or 1]
 We will freely say that Φ(d) is a feature of the data d,

when, for each cj, the conjunction Φ(d)  c = cj is a
feature of the data-class pair (c, d).

Features
 For example:

 f1(c,witi)  [c= “NN”  islower(w0)  ends(w0, “d”)]
 f2(c, witi)  [c = “NN”  w-1 = “to”  t-1 = “TO”]
 f3(c, witi)  [c = “VB”  islower(w0)]

 Models will assign each feature a weight
 Empirical count (expectation) of a feature:

 Model expectation of a feature:

TO NN
to aid

IN JJ
in blue

TO VB
to aid

IN NN
in bed

empirical E f i =∑c ,d ∈observed C , D  f i c ,d 

E  f i =∑c ,d ∈C , D  P c , d f ic , d 

Feature-Based Models

 The decision about a data point is based only on
the features active at that point.

BUSINESS: Stocks
hit a yearly low …

Data

Features
{…, stocks, hit, a,
yearly, low, …}

Label
BUSINESS

Text Categorization

… to restructure
bank:MONEY debt.

Data

Features
{…, P=restructure,
N=debt, L=12, …}

Label
MONEY

Word-Sense
Disambiguation

DT JJ NN …
The previous fall …

Data

Features
{W=fall, PT=JJ
PW=previous}

Label
NN

POS Tagging

Example: Text Categorization

(Zhang and Oles 2001)

 Features are a word in document and class (they do
feature selection to use reliable indicators)

 Tests on classic Reuters data set (and others)
 Naïve Bayes: 77.0% F1
 Linear regression: 86.0%
 Logistic regression: 86.4%
 Support vector machine: 86.5%

 Emphasizes the importance of regularization
(smoothing) for successful use of discriminative
methods (not used in most early NLP/IR work)

Example: POS Tagging

 Features can include:
 Current, previous, next words in isolation or together.
 Previous (or next) one, two, three tags.
 Word-internal features: word types, suffixes, dashes, etc.

-3 -2 -1 0 +1

DT NNP VBD ??? ???

The Dow fell 22.6 %

Local Context

Features
W0 22.6

W+1 %

W-1 fell

T-1 VBD

T-1-T-2 NNP-VBD

hasDigit? true

… …

Decision Point

(Ratnaparkhi 1996; Toutanova et al. 2003, etc.)

Other MaxEnt Examples

 Sentence boundary detection (Mikheev 2000)
 Is period end of sentence or abbreviation?

 PP attachment (Ratnaparkhi 1998)
 Features of head noun, preposition, etc.

 Language models (Rosenfeld 1996)
 P(w0|w-n,…,w-1). Features are word n-gram

features, and trigger features which model
repetitions of the same word.

 Parsing (Ratnaparkhi 1997; Johnson et al. 1999, etc.)
 Either: Local classifications decide parser actions

or feature counts choose a parse.

Conditional vs. Joint Likelihood

 A joint model gives probabilities P(c,d) and tries to
maximize this joint likelihood.
 It turns out to be trivial to choose weights: just

relative frequencies.
 A conditional model gives probabilities P(c|d). It takes

the data as given and models only the conditional
probability of the class.
 We seek to maximize conditional likelihood.
 Harder to do (as we’ll see…)
 More closely related to classification error.

Feature-Based Classifiers

 “Linear” classifiers:
 Classify from feature sets {fi} to classes {c}.

 Assign a weight i to each feature fi.
 For a pair (c,d), features vote with their weights:

 vote(c) = ifi(c,d)

 Choose the class c which maximizes ifi(c,d) = VB
 There are many ways to chose weights

 Perceptron: find a currently misclassified example, and
nudge weights in the direction of a correct classification

TO NN
to aid

TO VB
to aid

1.2 –1.8 0.3

Feature-Based Classifiers
 Exponential (log-linear, maxent, logistic, Gibbs) models:

 Use the linear combination ifi(c,d) to produce a
probabilistic model:

 P(NN|to, aid, TO) = e1.2e–1.8/(e1.2e–1.8 + e0.3) = 0.29
 P(VB|to, aid, TO) = e0.3 /(e1.2e–1.8 + e0.3) = 0.71

 The weights are the parameters of the probability model,
combined via a “soft max” function

 Given this model form, we will choose parameters {i} that
maximize the conditional likelihood of the data according
to this model.

exp is smooth and positive
but see also below

Normalizes votes.P c∣d , λ=
exp∑

i
λi f i c ,d 


c ‘

exp∑
i

λi f i c ‘,d 

Other Feature-Based Classifiers

 The exponential model approach is one way of
deciding how to weight features, given data.

 It constructs not only classifications, but probability
distributions over classifications.

 There are other (good!) ways of discriminating
classes: SVMs, boosting, even perceptrons – though
these methods are not as trivial to interpret as
distributions over classes.

Comparison to Naïve-Bayes
 Naïve-Bayes is another tool for classification:

 We have a bunch of random variables (data
features) which we would like to use to
predict another variable (the class):

 The Naïve-Bayes likelihood over classes is:

c

1  2  3

P c ∏
i

P φ i∣c 


c ‘

P c ‘ ∏
i

P φ i∣c ‘ 

exp [logP c ∑i logP φ i∣c ]

c ‘

exp[ logP c’ ∑i logP φ i∣c ‘ ]
exp [∑i λic f ic d ,c ]


c ‘

exp[∑i λic ‘ f ic ‘ d ,c ‘]
Naïve-Bayes is just an

exponential model.

Comparison to Naïve-Bayes

 The primary differences between Naïve-Bayes
and maxent models are:

Naïve-Bayes Maxent

Features assumed to supply
independent evidence.

Features weights take feature
dependence into account.

Feature weights can be set
independently.

Feature weights must be
mutually estimated.

Features must be of the
conjunctive Φ(d)  c = ci
form.

Features need not be of this
conjunctive form (but usually
are).

Trained to maximize joint
likelihood of data and classes.

Trained to maximize the conditional
likelihood of classes.

Example: Sensors

NB FACTORS:
 P(s) = 1/2
 P(+|s) = 1/4
 P(+|r) = 3/4

Raining Sunny

P(+,+,r) = 3/8 P(-,-,s) = 3/8

Reality

P(-,-,r) = 1/8 P(+,+,s) = 1/8

Raining?

M1 M2

NB Model PREDICTIONS:
 P(r,+,+) = (½)(¾)(¾)
 P(s,+,+) = (½)(¼)(¼)
 P(r|+,+) = 9/10
 P(s|+,+) = 1/10

gpenn
Rectangle

Example: Sensors
 Problem: NB multi-counts the evidence.

 Maxent behavior:
 Take a model over (M1,…Mn,R) with features:

 fri: Mi=+, R=r weight: ri
 fsi: Mi=+, R=s weight: si

 exp(ri-si) is the factor analogous to P(+|r)/P(+|s)
 … but instead of being 3, it will be 31/n

 … because if it were 3, E[fri] would be far higher than
the target of 3/8!

 NLP problem: we often have overlapping features….

P r∣. . .
P s∣. . .

=
P  r 
P s 

P ∣r 
P ∣s 

. . .
P ∣r 
P ∣s 

Example: Stoplights

Lights Working Lights Broken

P(g,r,w) = 3/7 P(r,g,w) = 3/7 P(r,r,b) = 1/7

Working?

NS EW

NB Model

Reality

NB FACTORS:
 P(w) = 6/7
 P(r|w) = 1/2
 P(g|w) = 1/2

 P(b) = 1/7
 P(r|b) = 1
 P(g|b) = 0

Example: Stoplights

 What does the model say when both lights are red?
 P(b,r,r) = (1/7)(1)(1) = 1/7 = 4/28
 P(w,r,r) = (6/7)(1/2)(1/2) = 6/28= 6/28
 P(w|r,r) = 6/10!

 We’ll guess that (r,r) indicates lights are working!

 Imagine if P(b) were boosted higher, to 1/2:
 P(b,r,r) = (1/2)(1)(1) = 1/2 = 4/8
 P(w,r,r) = (1/2)(1/2)(1/2) = 1/8 = 1/8
 P(w|r,r) = 1/5!

 Changing the parameters bought conditional accuracy
at the expense of data likelihood!

Exponential Model Likelihood

 Maximum Likelihood (Conditional) Models :
 Given a model form, choose values of parameters

to maximize the (conditional) likelihood of the
data.

 Exponential model form, for a data set (C,D):

logP C∣D , λ = ∑
c ,d ∈C , D 

logP c∣d , λ= ∑
c , d ∈C , D 

log

c ‘

exp∑
i

λi f i c ‘,d 

exp∑
i

λi f i c ,d 

Building a Maxent Model

 Define features (indicator functions) over data points.
 Features represent sets of data points which are distinctive

enough to deserve model parameters.
 Words, but also “word contains number”, “word ends with ing”

 Usually features are added incrementally to “target”
errors.

 For any given feature weights, we want to be able to
calculate:
 Data (conditional) likelihood
 Derivative of the likelihood wrt each feature weight

 Use expectations of each feature according to the model

 Find the optimum feature weights (MaxEnt).

Digression: Lagrange’s Method

Task: find the highest yellow point.

This is “constrained optimization.”

Digression: Lagrange’s Method

F(x,y): height of (x,y) on surface.
G(x,y): color of (x,y) on surface.
Maximize F(x,y) subject to constraint: G(x,y)=k.

Digression: Lagrange’s Method

Suppose G(x,y)-k=0 is given by an implicit function
y=f(x).
(We’re allowed to change coordinate systems, too.)
So we really want to maximize u(x)=F(x,f(x)).

Digression: Lagrange’s Method

Maximize F(x,f(x)) so we want du/dx = 0:
du
dx

=0=
∂F
∂ x


∂F
∂ y

df
dx

We also know G(x,f(x)) – k = 0:
∂G
∂ x


∂G
∂ y

df
dx

=0
df
dx

=


∂G
∂ x
∂G
∂ y

So: du
dx

=

∂F
∂ x

∂G
∂ y


∂F
∂ y

∂G
∂ x

∂G
∂ y

=0

Let: − :=
∂F
∂ x
∂G
∂ x

=

∂F
∂ y
∂G
∂ y

Lagrange Multipliers

These constants are called Lagrange Multipliers.
They allow us to convert constraint optimization
problems into unconstrained optimization problems:

− :=

∂F
∂ x
∂G
∂ x

=

∂F
∂ y
∂G
∂ y

x , y ;=F x , yGx , y

We don’t actually care about  – we want its
derivatives to be 0:

0=
∂F
∂ x i


∂G
∂ x i

for all i

So what is/are G?

This generalizes to having multiple constraints –
use one Lagrange multiplier for each.

We’ll be searching over probability distributions
p instead of (x,y).
But what should our constraints be? Answer:

Up to the sensitivity of our feature representation, p
acts like what we see in our training data.

x , y ;=F x , y∑
j

 j G j x , y 

Ep f j −E p f j =0

So what is F?

This generalizes to having multiple constraints –
use one Lagrange multiplier for each.

We’ll be searching over probability distributions
p instead of (x,y).

But what should we maximize as a function of p?
Answer…

x , y ;=F x , y∑
j

 j G j x , y 

Maximize Entropy!

 Entropy: the uncertainty of a distribution.
 Quantifying uncertainty (“surprise”):

 Event x
 Probability px
 “Surprise” log(1/px)

 Entropy: expected surprise (over p):

H p =-∑
x

p x log2 px

H p =Ep[ log2
1
p x ]

A coin-flip is
most uncertain
for a fair coin.

pHEADS

H

Maximum Entropy Models

 Lots of distributions out there, most of them
very spiked, specific, overfit.

 We want a distribution which is uniform except
in specific ways we require.

 Uniformity means high entropy – we can search
for distributions which have properties we
desire, but also have high entropy.

Ignorance is preferable to error and he is less remote from
the truth who believes nothing than he who believes what

is wrong – Thomas Jefferson (1781)

Maxent Examples I

 What do we want from a distribution?
 Minimize commitment = maximize entropy.
 Resemble some reference distribution (data).

 Solution: maximize entropy H, subject to feature-
based constraints:

 Adding constraints (features):
 Lowers maximum entropy
 Raises maximum likelihood of data
 Brings the distribution further from uniform
 Brings the distribution closer to data

Ep [ f i ]=E p [ f i ]
Unconstrained,

max at 0.5

Constraint that
pHEADS = 0.3

Maxent Examples II
H(pH pT,) pH + pT = 1 pH = 0.3

– x log x

1/e

Maxent Examples III

 Lets say we have the following event space:

 … and the following empirical data:

 Maximize H:

 … want probabilities: E[NN,NNS,NNP,NNPS,VBZ,VBD] = 1

NN NNS NNP NNPS VBZ VBD

1/e 1/e 1/e 1/e 1/e 1/e

1/6 1/6 1/6 1/6 1/6 1/6

3 5 11 13 3 1

Maxent Examples IV
 Too uniform!
 N* are more common than V*, so we add the feature fN = {NN,

NNS, NNP, NNPS}, with E[fN] =32/36

 … and proper nouns are more frequent than common nouns, so
we add fP = {NNP, NNPS}, with E[fP] =24/36

 … we could keep refining the models, e.g. by adding a feature to
distinguish singular vs. plural nouns, or verb types.

8/36 8/36 8/36 8/36 2/36 2/36

4/36 4/36 12/36 12/36 2/36 2/36

NN NNS NNP NNPS VBZ VBD

Digression: Jensen’s Inequality

“Convex” Non-Convex

Convexity guarantees a single, global maximum because
any higher points are greedily reachable.

f ∑
i

w i x i≥∑
i

w i f x i where ∑
i

w i=1

f ∑
i

w i xi 


i

w i f x i

Convexity

 Constrained H(p) = –  x log x is
convex:
 – x log x is convex
 –  x log x is convex (sum of

convex functions is convex).
 The feasible region of constrained

H is a linear subspace (which is
convex)

 The constrained entropy surface
is therefore convex.

 The maximum likelihood
exponential model (dual)
formulation is also convex.

The Kuhn-Tucker Theorem

When the components of this are convex, we can
find the optimal p and λ by first calculating:

with λ held constant, then solving the “dual:”

The optimal p is then .

p ;=Hp∑
j

 j Ep f j−E p f j

p=argmax
p

p ;

=argmax

p , .

p

The Kuhn-Tucker Theorem

For us, there is an analytic solution to the first part:

So the only thing we have to do is find λ, given this.

p ;=Hp∑
j

 j Ep f j−E p f j

pc∣d=
exp∑

i

i f ic ,d 


c ‘

exp∑
i

i f ic ‘,d 

Digression: Log-Likelihoods
 The (log) conditional likelihood is a function of the iid data (C,D)

and the parameters :

 If there aren’t many values of c, it’s easy to calculate:

 We can separate this into two components:

 The derivative is the difference between the derivatives of each component

logP C∣D , λ =log ∏
c , d ∈C , D 

P c∣d , λ= ∑
c , d ∈C , D 

logP c∣d , λ 

logP

C∣D, λ= ∑

c , d ∈C , D 

log

c ‘

exp∑
i

λi f i c ‘,d 


c , d ∈C , D 

log∑
c’

exp∑
i

λ i f i c ‘,d ∑
c , d ∈C , D 

log exp∑
i

λi f i c ,d 

exp∑
i

λi f i c ,d 


logP C∣D , λ =N −M 

logP

C∣D, λ=

LL Derivative I: Numerator

=

∂ ∑
c ,d ∈C ,D 


i

λi f i c ,d 

∂λ i

= ∑
c ,d ∈C , D 

∂∑
i

λi f i c ,d 

∂ λi

= ∑
c ,d ∈C , D 

f i c ,d 

∂N λ 
∂ λi

=

∂ ∑
c , d ∈C , D 

log exp∑
i

λci f i c , d 

∂ λi

Derivative of the numerator is: the empirical count(fi, c)

LL Derivative II: Denominator

∂M  λ
∂ λi

=

∂ ∑
c ,d ∈C ,D 

log∑
c ‘

exp∑
i

λi f i c ‘,d 

∂ λ i

= ∑
c ,d ∈C , D 

1


c ”

exp∑
i

λ i f ic ”,d 

∂∑
c’

exp∑
i

λ i f i c ‘,d 

∂ λi

= ∑
c ,d ∈C , D 

1


c”

exp∑
i

λ i f ic ”,d 

c’

exp∑
i

λi f i c ‘,d 

1

∂∑
i

λi f i c ‘,d 

∂λ i

= ∑
c ,d ∈C , D 


c ‘

exp∑
i

λi f i c ‘,d 


c”

exp∑
i

λ i f i c ”,d 

∂∑
i

λi f i c ‘,d 

∂ λi

= ∑
c ,d ∈C , D 


c ‘

P c’∣d , λ  f i c ‘,d  = predicted count(fi, )

LL Derivative III

 Our choice of constraint is vindicated: with our choice of
p

λ
, these correspond to the stable equilibrium points of

the log conditional likelihood with respect to λ.
 The optimum distribution is:

 Always unique (but parameters may not be unique)
 Always exists (if feature counts are from actual data).

∂logP

C∣D , λ

∂ λi
=E

p
f

i
−E

p
 f

i

Fitting the Model

 To find the parameters
write out the conditional log-likelihood of the

training data and maximize it

 The log-likelihood is concave and has a single
maximum; use your favorite numerical
optimization package

 Good large scale techniques: conjugate
gradient or limited memory quasi-Newton

CLogLik D =∑
i=1

n

logP ci∣d i 

1 ,2 ,3

Fitting the Model
Generalized Iterative Scaling

 A simple optimization algorithm which works
when the features are non-negative

 We need to define a slack feature to make the
features sum to a constant over all considered
pairs from .

 Define

 Add new feature

M=max
i , c


j=1

m

f j di ,c 

f m1d ,c =M−∑
j=1

m

f j d ,c 

D×C

Generalized Iterative Scaling

 Compute empirical expectation for all features:

 Initialize
 Repeat

 Compute feature expectations according to current model

 Update parameters:

 Until converged

λ
j
=0, j=1.. .m1

E
pf j =

1
N

i=1

n

f j d i ,c i 

E
p

t f j =
1
N

i=1

N


k=1

K

P c k∣d i f jd i ,c k 

λ
jt 1 


j t 


1
M

log
E

p
f

j

E
p

t  f j  

Feature Overlap
 Maxent models handle overlapping features well.
 Unlike a NB model, there is no double counting!

A a

B 2 1

b 2 1
A a

B 1/4 1/4

b 1/4 1/4

Empirical

All = 1

A a

B

b

A a

B 1/3 1/6

b 1/3 1/6

A = 2/3

A a

B

b

A a

B 1/3 1/6

b 1/3 1/6

A = 2/3

A a

B

b

A a

B

b

A a

B A
b A

A a

B ’A+’’A

b ’A+’’A

Example: NER Overlap

Feature Type Feature PERS LOC

Previous word at -0.73 0.94

Current word Grace 0.03 0.00

Beginning bigram