The Basics of Logistic Regression
ANLP: Week 7, Unit 2
Shay Cohen
Based on slides from ANLP 2019
Building a classifier for next actions
We said:
Probabilistic parser assumes we also have a model that tells us
P (action|configuration). Where does that come from?
Training data
Our goal is:
Given (features from) the current configuration, predict the next action.
Our corpus contains annotated sentences such as:
SBJ
ROOT
PC
ATT ATT ATT VC TMP
A hearing on the issue is scheduled
Is this sufficient to train a classifier to achieve our goal?
today
Classification for action prediction
We’ve seen text classification:
Given (features from) text document, predict the class it
belongs to.
Generalized classification task:
Given features from observed data, predict one of a set of classes (labels).
Here, actions are the labels to predict:
Given (features from) the current configuration, predict the next action.
Creating the right training data
Well, not quite. What we need is a sequence of the correct (configuration, action) pairs.
Problem: some sentences may have more than one possible sequence that yields the correct parse. (see tutorial exercise)
Solution: JM3 describes rules to convert each annotated sentence to a unique sequence of (configuration, action) pairs.1
OK, finally! So what kind of model will we train?
1This algorithm is called the training oracle. An oracle is a fortune-teller, and in NLP it refers to an algorithm that always provides the correct answer. Oracles can also be useful for evaluating certain aspects of NLP systems, and we may say a bit more about them later.
Logistic regression
Actually, we could use any kind of classifier (Naive Bayes, SVM, neural net…)
Logistic regression is a standard approach that illustrates a different type of model: a discriminative probabilistic model.
So far, all our models have been generative.
Even if you have seen it before, the formulation often used in
NLP is slightly different from what you might be used to.
Generative models have a “generative story”
a probabilistic process that describes how the data were created
Multiplying probabilities of each step gives us P(⃗x,⃗y).
Naive Bayes: For each item i to be classified, (e.g.,
document)
Generate its class ci (e.g., sport)
Generate its features fi 1 . . . fin conditioned on ci (e.g., ball,
goal, Tuesday)
Generative probabilistic models
Model the joint probability P(⃗x,⃗y)
⃗x: the observed variables (what we’ll see at test time).
⃗y: the latent variables (not seen at test time; must predict).
Model ⃗x ⃗y Naive Bayes features classes HMM words tags PCFG words tree
Generative models have a “generative story”
a probabilistic process that describes how the data were created
Multiplying probabilities of each step gives us P(⃗x,⃗y). Naive Bayes: For each item i to be classified, (e.g.,
document)
Generate its class ci (e.g., sport)
Generate its features fi 1 . . . fin conditioned on ci (e.g., ball,
goal, Tuesday)
Result: P(⃗c, f⃗) = P(ci ) P(fij |ci ) ij
Other generative stories
HMM: For each position i in sentence,
Generate its tag ti conditioned on previous tag ti−1
Generate its word wi conditioned on ti PCFG:
Starting from S node, recursively generate children for each phrasal category ci conditioned on ci , until all unexpanded
nodes are pre-terminals (tags).
For each pre-terminal ti , generate a word wi conditioned on ti .
Discriminative probabilistic models
Model P(⃗y|⃗x) directly
No model of P(⃗x,⃗y) No generative story No Bayes’ rule
One big advantage: we can use arbitrary features and don’t have to make strong independence assumptions.
P ( ⃗x ) = ⃗y P ( ⃗x , ⃗y ) .
But: unlike generative models, we can’t get
Inference in generative models
At test time, given only ⃗x, infer ⃗y using Bayes’ rule: P(⃗y|⃗x) = P(⃗x|⃗y)P(⃗y)
P ( ⃗x )
So, notice we actually model P(⃗x,⃗y) as P(⃗x|⃗y)P(⃗y). You can confirm this for each of the previous models.
Discriminative models more broadly
Trained to discriminate right v. wrong value of ⃗y, given input ⃗x .
Need not be probabilistic.
Examples: support vector machines, (some) neural networks, decision trees, nearest neighbor methods.
Here, we consider only multinomial logistic regression models, which are probabilistic.
multinomial means more than two possible classes
otherwise (or if lazy) just logistic regression
In NLP, also known as Maximum Entropy (or MaxEnt)
models.
Example task: word sense disambiguation
Remember, logistic regression can be used for any classification task.
The following slides use an example from lexical semantics:
Given a word with different meanings (senses), can we classify which sense is intended?
I visited the Ford plant yesterday.
The farmers plant soybeans in spring. This plant produced three kilos of berries.
Defining a MaxEnt model: intuition
Start by defining a set of features that we think are likely to help discriminate the classes. E.g.,
the POS of the target word
the words immediately preceding and following it other words that occur in the document
During training, the model will learn how much each feature contributes to the final decision.
WSD as example classification task
Disambiguate three senses of the target word plant
⃗x are the words and POS tags in the document the target word occurs in
y is the latent sense. Assume three possibilities:
y = 1
2
3
sense
Noun: a member of the plant kingdom Verb: to place in the ground
Noun: a factory
We want to build a model of P(y|⃗x).
Defining a MaxEnt model
Features fi(⃗x,y) depend on both observed and latent variables. E.g., if tgt is the target word:
f1: POS(tgt)=NN&y=1
f2: POS(tgt)=NN&y=2
f3 : preceding word(tgt) = ‘chemical’ & y = 3 f4 : doc contains(‘animal’) & y = 1
Defining a MaxEnt model
Features fi(⃗x,y) depend on both observed and latent variables. E.g., if tgt is the target word:
f1: POS(tgt)=NN&y=1
f2: POS(tgt)=NN&y=2
f3 : preceding word(tgt) = ‘chemical’ & y = 3 f4 : doc contains(‘animal’) & y = 1
Each feature fi has a real-valued weight wi (learned in training).
P(y|⃗x) is a monotonic function of w⃗·f⃗ (that is, i wifi(⃗x,y)).
Example of features and weights
Let’s look at just two features from the plant disambiguation example:
f1: POS(tgt)=NN&y=1 f2: POS(tgt)=NN&y=2
Our classes are:
{1: member of plant kingdom; 2: put in ground; 3: factory}
Our example doc (⃗x):
[… animal/NN … chemical/JJ plant/NN …]
Defining a MaxEnt model
Features fi(⃗x,y) depend on both observed and latent variables. E.g., if tgt is the target word:
f1: POS(tgt)=NN&y=1
f2: POS(tgt)=NN&y=2
f3 : preceding word(tgt) = ‘chemical’ & y = 3 f4 : doc contains(‘animal’) & y = 1
Each feature fi has a real-valued weight wi (learned in training).
⃗ ⃗⃗
P(y|x) is a monotonic function of w·f (that is,
i wifi(⃗x,y)).
To make P(y|⃗x) large, we need weights that make w⃗ ·f⃗ large.
Two cases to consider
Computing P(y = 1|⃗x):
Here,f1 =1andf2 =0.
We would expect the probability to be relatively high.
Can be achieved by having a positive value for w1.
Since f2 = 0, its weight has no effect on the final probability.
Computing P(y = 2|⃗x):
Two cases to consider
Computing P(y = 1|⃗x):
Here,f1 =1andf2 =0.
We would expect the probability to be relatively high.
Can be achieved by having a positive value for w1.
Since f2 = 0, its weight has no effect on the final probability.
Computing P(y = 2|⃗x):
Here,f1 =0andf2 =1.
We would expect the probability to be close to zero, because
sense 2 is a verb sense, and here we have a noun.
Can be achieved by having a large negative value for w2.
By doing so, f2 says: “If I am active, do not choose sense 2!”.
Which features are active?
Example doc:
[… animal/NN … chemical/JJ plant/NN …]
P(y = 1|⃗x) P ( y = 2 | ⃗x ) P ( y = 3 | ⃗x )
will have
f1,f4 = 1 and f 2 = 1
f 3 = 1
f2,f3 = 0
f 1 , f 3 , f 4 = 0 f 1 , f 2 , f 4 = 0
Notice that zero-valued features have no effect on the final probability
Other features will be multiplied by their weights, summed, then exp.
Classification with MaxEnt
Choose the class that has highest probability according to
P(y|⃗x) = 1 expwifi(⃗x,y) Zi
where
exp(x) = ex (the monotonic function)
i wi fi is the dot product of w⃗ and f⃗, also written w⃗ · f⃗.
The normalization constant Z = y′ exp(i wifi(⃗x,y′))
Feature templates
In practice, features are usually defined using templates
POS(tgt)=t & y
preceding word(tgt)=w & y doc contains(w) & y
instantiate with all possible POSs t or words w and classes y
usually filter out features occurring very few times
templates can also define real-valued or integer-valued features
NLP tasks often have a few templates, but 1000s or 10000s of features
Features for dependency parsing
We want the model to tell us P(action|configuration).
So y is the action, and ⃗x is the configuration.
Features are various combinations of words/tags from
stack/input:
The Logistic Regression Model
Decide on some features that associate certain ⃗x with certain y
Uses exp(z) to make all values of dot-product between weights and features positive
exp(i wifi(⃗x,y)) P(y|⃗x) = y′ exp(i wifi(⃗x,y′))
We divide by the sum of all exp-dot-product values for all y so that yp(y|⃗x)=1
Summary
We’ve discussed
Dependency annotations, heads, and constituency → dependency conversion.
Transition-based dependency parsing.
Training data and evaluation for probabilistic dependency
parsing.
On Thursday, details of the model for P(action|configuration).
WSD as example classification task
Disambiguate three senses of the target word plant
⃗x are the words and POS tags in the document the target word occurs in
y is the latent sense. Assume three possibilities:
y = 1
2
3
sense
Noun: a member of the plant kingdom Verb: to place in the ground
Noun: a factory
We want to build a model of P(y|⃗x).
Defining a MaxEnt model: intuition
Start by defining a set of features that we think are likely to help discriminate the classes. E.g.,
the POS of the target word
the words immediately preceding and following it other words that occur in the document
During training, the model will learn how much each feature contributes to the final decision.
MaxEnt for n-best re-ranking
We can also use MaxEnt for reranking an n-best list. Example scenario (Charniak and Johnson, 2005)
Use a generative parsing model M with beam search to produce a list of the top n parses for each sentence.
(= most probable according to M)
Use a MaxEnt model M′ to re-rank those n parses, then pick the most probable according to M′.
MaxEnt for n-best re-ranking
So far, we’ve used logistic regression for classification. Fixed set of classes, same for all inputs.
Word sense disambiguation:
Input
word in doc1 word in doc2
Possible outputs
sense 1, sense 2, sense 3 sense 1, sense 2, sense 3
Dependency parsing:
Input
parser config1 parser config2
Possible outputs action 1, . . . action n action 1, . . . action n
Why do it this way?
Why two stages?
Generative models typically faster to train and run, but can’t use arbitrary features.
In NLP, MaxEnt models may have so many features that extracting them from each example can be time-consuming, and training is even worse (see next lecture).
Why are the features a function of both inputs and outputs?
Because for re-ranking this matters: the outputs may not be pre-defined.
MaxEnt for n-best re-ranking
In reranking scenario, the options depend on the input. E.g., parsing, with n = 2:
Input: healthy dogs and cats Possible outputs:
NP
JJ NP
healthy
dogs and cats
NP
NP CC NP
NP and cats healthy dogs
NP CC NP
JJ
MaxEnt for constituency parsing
Now we have y = parse tree, x = sentence.
Features can mirror parent-annotated/lexicalized PCFG:
counts of each CFG rule used in y
pairs of words in head-head dependency relations in y
each word in x with its parent and grandparent categories in y. Note these are no longer binary features.
MaxEnt for n-best re-ranking
In reranking scenario, the options depend on the input. E.g., parsing, with n = 2:
Input: ate pizza with cheese Possible outputs:
VP
VP
PP
V NP
ateNP PP VNP
VP
P NP pizza P NP ate pizza with cheese
with cheese
Global features
Features can also capture global structure. E.g., from Charniak and Johnson (2005):
length difference of coordinated conjuncts NP vs NP
JJ NP
NP CC NP
NP and cats healthy dogs
healthy
dogs and cats
NP CC NP
JJ
Features for parsing
Altogether, Charniak and Johnson (2005) use 13 feature templates
with a total of 1,148,697 features
and that is after removing features occurring less than five
times
One important feature not mentioned earlier: the log prob of the parse under the generative model!
So, how does it do?
Parser performance
F1-measure (from precision/recall on constituents) on WSJ test:
standard PCFG
lexicalized PCFG (Charniak, 2000)
re-ranked LPCFG (Charniak and Johnson, 2005)
∼80% 1 89.7% 91.0%
Recent WSJ parser is 93.8%, combining NNets and ideas from parsing, language modelling (Choe et al., 2016)
But as discussed earlier, other languages/domains are still much worse.
Parser performance
F1-measure (from precision/recall on constituents) on WSJ test:
standard PCFG
lexicalized PCFG (Charniak, 2000)
re-ranked LPCFG (Charniak and Johnson, 2005)
1Figure from Charniak (1996): assumes POS tags as input
∼80% 1 89.7% 91.0%
Evaluating during development
Whenever we have a multistep system, worth asking: where should I put my effort to improve the system?
If my first stage (generative model) is terrible, then n needs to be very large to ensure it includes the correct parse.
Worst case: if computation is limited (n is small), maybe the correct parse isn’t there at all.
Then it doesn’t matter how good my second stage is, I won’t get the right answer.
Another use of oracles
Can be useful to compute oracle performance on the first stage.
Oracle always chooses the correct parse if it is available.
Difference between oracle and real system = how much better it could get by improving the 2nd stage model.
If oracle performance is very low, need to increase n or improve the first stage model.
Training the model
Two ways to think about training:
What is the goal of training (training objective)? How do we achieve that goal (training algorithm)?
How do we use the weights in practice?
A question asked in a previous class. The following (over-)simplification chooses the best y according to the model (in the case of a small set of labels).
Given an x:
For each y, calculate fi(⃗x,y) for all y and i For each y, calculate i wifi(⃗x,y)
Choose the y with the highest score:
y∗ =argmaxy i wifi(⃗x,y)
Training generative models
Easy to think in terms of how: counts/smoothing.
But don’t forget the what: What
Maximize the likelihood Other objectives2
How
take raw counts and normalize use smoothed counts
2Historically, smoothing methods were originally introduced purely as how: that is, without any particular justification as optimizing some objective function. However, as alluded to earlier, it was later discovered that many of these smoothing methods correspond to optimizing Bayesian objectives. So the what was discovered after the how.
Training logistic regression
Possible training objective:
Given annotated data, choose weights that make the labels most probable under the model.
That is, given items x(1) . . . x(N) with labels y(1) . . . y(N),
choose
(j) (j) wˆ=argmax logP(y |x )
w⃗
This is conditional maximum likelihood estimation
(CMLE).
Optimizing (regularized) cond. likelihood
Unlike generative models, we can’t simply count and normalize.
Instead, we use gradient-based methods, which iteratively update the weights.
Our objective is a function whose value depends on the weights.
So, compute the gradient (derivative) of the function with respect to the weights.
Update the weights to move toward the optimum of the objective function.
j
Regularization
Like MLE for generative models, CMLE can overfit training data.
For example, if some particular feature combination is only active for a single training example.
So, add a regularization term to the equation
encourages weights closer to 0 unless lots of evidence otherwise.
various methods; see JM3 or ML texts for details (optional).
In practice it may require some experimentation (dev set!) to choose which method and how strongly to penalize large weights.
Visual intuition
Changing w⃗ changes the value of the objective function.3 Follow the gradients to optimize the objective
(“hill-climbing”).
3Here, we are maximizing an objective such as log prob. Using an objective such as negative log prob would require minimizing; in this case the objective function is also called a loss function.
But what if…?
If there are multiple local optima, we won’t be guaranteed to find the global optimum.
Logistic regression: summary
model P(y|x) only, have no generative process
can use arbitrary local/global features, including correlated
ones
can use for classification, or choosing from n-best list.
training involves iteratively updating the weights, so typically slower than for generative models (especially if very many features, or if time-consuming to extract).
training objective has a single global optimum.
Similar ideas can be used for more complex models, e.g. sequence models for taggers that use spelling features.
Guarantees
Luckily, (supervised) logistic regression does not have this problem.
With or without standard regularization, the objective has a single global optimum.
Good: results more reproducible, don’t depend on initialization. But it is worth worrying about in general!
Unsupervised learning often has this problem (eg for HMMs, PCFGs, and logistic regression); so do neural networks.
Bad: results may depend on initialization, can vary from run to run.
Extension to neural network
Logistic regression can be viewed as a building block of neural networks (a perceptron).
Pictorially:
Extension to neural network
Adding a fully-connected layer creates one of the simplest types of neural network: a multi-layer perceptron (MLP).
Key features of MLP
Contains one or more hidden layers
A non-linear classifier: more powerful than logistic regression.
Also trained using gradient-based methods, but vulnerable to local optima, so can be more difficult to train.
Like other NNet architectures, really just a complex function computed by multiplying weights by inputs and passing through non-linearities. Not magic.
Key features of MLP
Contains one or more hidden layers
Each node applies a non-linear function to the sum of its inputs
Hidden layers can be viewed as learned representations
(embeddings) of the input
Recall that word embeddings represent words as vectors,
such that similar words have similar vectors.
(Actually, even basic logistic regression can produce word embeddings: see next week.)
Summary
Logistic regression: set features, set weights, compute dot product, exponentiate, normalise
Discussed what to do when labels are not fixed
Training is done using gradient descent techniques
Logistic regression is a simple case of a neural network (the perceptron)