COMP5046
Natural Language Processing
Lecture 9: Named Entity Recognition and Coreference Resolution
Dr. Caren Han
Semester 1, 2021
School of Computer Science, University of Sydney
0
LECTURE PLAN
Lecture 9: Named Entity Recognition and Coreference
1. Information Extraction
2. Named Entity Recognition (NER) and Evaluation
3. Traditional NER
4. Sequence Model for NER
5. Coreference Resolution
6. Coreference Model
7. Coreference Evaluation
8. Preview
1
Information Extraction Information Extraction
“The task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents”
Here are some questions..
• How to allow computation to be done on the unstructured data
• How to extract clear, factual information
• How to put in a semantically precise form that allows further inferences to be made by computer algorithms
1
Information Extraction
How to extract the structured clear, factual information
• Find and understand limited relevant parts of texts
• Gather information from many pieces of text
• Produce a structured representation of relevant information
relations (in the database sense) or a knowledge base
“5W1H”
who, what, where, when, why, how
Who: What: Where: When: How: …
1
Information Extraction
How to extract the structured clear, factual information
• Find and understand limited relevant parts of texts
• Gather information from many pieces of text
• Produce a structured representation of relevant information
relations (in the database sense) or a knowledge base
Subject
Relation
Object
Sydney
IS-A
Capital of NewSouth Wales
Sydney
IS-A
Australia’s largestcities
Sydney
KNOWN FOR
Sydney Opera House
…
…
…
Extracting
Textual abstract: Summary for human Structured information: Summary for machine
1
Information Extraction Information Extraction Pipeline with NLP
Parsing
Caren loves cats, and she likes playing with them
Caren loves cats, and she likes playing with them
Caren loves cats, and she likes playing with them
I love my cats
I love my cats
I love my cats I love my cats
[positive: 90.10%] [neutral: 4.70%] [negative: 5.10%]
[I/JJ] [love/VBP] [my/PRP] [cats/NNS]
[I] [love] [my] [cat] [I] [love] [my] [cats]
Understanding
Sentiment Analysis
Coreference Resolution
Entity
Extraction
PoS Tagging
Stemming
Tokenisation
NLP Stack Application
2
Named Entity Recognition (NER) What is Named Entity Recognition?
“The subtask of information extraction that seeks to locate and classify named entity mentions in unstructured text into pre-defined categories such as the person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.”
Why recognise Named Entities?
• Named entities can be indexed, linked off, etc.
• Sentiment can be attributed to companies or products
• A lot of relations are associations between named entities
• For question answering, answers are often named entities.
2
Named Entity Recognition (NER)
How to recognize Named Entities?
Identify and classify names in text
•
The University of Sydney (informally USYD, Sydney, Sydney Uni) is an Australian public research university in Sydney, Australia. Founded in 1850, it was Australia’s first university and is regarded as one of the world’s leading universities. (Wikipedia, University of Sydney)
Different types of named entity classes
Type Classes
3 class Location, Person, Organization
4 class Location, Person, Organization, Misc
7 class Location, Person, Organization, Money, Percent, Date, Time
*classes can be different based on annotated dataset
2
Named Entity Recognition (NER) How to recognize Named Entities?
Identify and classify names in text
Upenn CogComp-NLP http://macniece.seas.upenn.edu:4004/
Stanford CoreNLP 3.9.2 http://nlp.stanford.edu:8080/corenlp/process
2
Named Entity Recognition (NER) How to evaluate the NER performance?
The goal: predicting entities in a text *Standard evaluation is per entity, not per token
Caren Soyeon Han is working at Google at Sydney, Australia
gold predicted O
PER
PER PERO O O ORG O LOC LOC O O O O O ORG O LOC LOC
2
Named Entity Recognition (NER)
How to evaluate the NER performance? Precision and recall
PERSON
NOT PERSON
Precision =
Detected as ‘PERSON’ correctly
Total number of detected as ‘PERSON’
Recall =
Detected as ‘PERSON’ correctly
Total number of actual ‘PERSON’ entities
Detected as ‘PERSON’
2
Named Entity Recognition (NER)
How to evaluate the NER performance? Precision and recall
PERSON NOT PERSON
Detected as ‘PERSON’
True positives: The ‘PERSON’s that the model detected as ‘PERSON’
False positives: The NOT ‘PERSON’s that the model detected as ‘PERSON’
False negatives: The ‘PERSON’s that the model detected as NOT ‘PERSON’
True negatives: The ‘NOT PERSON’s that the model detected as NOT ‘PERSON’
2
Named Entity Recognition (NER) How to evaluate the NER performance?
The goal: predicting entities in a text *Standard evaluation is per entity, not per token
Caren Soyeon Han is working at Google at Sydney, Australia
gold predicted O
PER
PER PERO O O ORG O LOC LOC O O O O O ORG O LOC LOC
correct
not correct
selected
selected
True Positive
2
(TP)
False Positive
0
(FP)
not selected
1
False Negative
(FN)
0
True Negative
(TN)
2
Named Entity Recognition (NER) How to evaluate the NER performance?
The goal: predicting entities in a text *Standard evaluation is per entity, not per token
Caren Soyeon Han is working at Google at Sydney, Australia
gold predicted O
PER
PER PERO O O ORG O LOC LOC O O O O O ORG O LOC LOC
Precision and Recall are straightforward for text categorization or web search, where there is only one grain size (documents)
2
Named Entity Recognition (NER) Quick Exercise: F measure Calculation
Let’s calculate Precision, Recall, and F-measure together!
P = ?? R = ?? F1 = ?? F1 = 2 *
P*R P+R
correct
not correct
selected
2 (TP)
0 (FP)
not selected
1 (FN)
0 (TN)
2
Named Entity Recognition (NER)
Data for learning named entity
• Training counts joint frequencies in a corpus
• The more training data, the better
• Annotated corpora are small and expensive
Corpora
Source
Size
Class Type
muc-7
New York Times
164k tokens
per, org, loc, dates, times, money, percent
https://aclweb.org/aclwiki/MUC-7_(State_of_the_art)
conll-03
Reuters
301k
per, org, loc, misc
bbn
Wall Street Journal
1174k
https://catalog.ldc.upenn.edu/docs/LDC2005T33/BBN- Types-Subtypes.html
2
Named Entity Recognition (NER)
Data for learning named entity
• Models trained on one corpus perform poorly on others
train
F-score
muc
conll
bbn
muc
82.3
54.9
69.3
conll
69.9
86.9
60.2
bnn
80.2
58.0
88.0
2
Named Entity Recognition (NER) CoNLL 2003 NER dataset
• Performance measure: F = 2 * Precision * Recall / (Recall + Precision)
https://paperswithcode.com/sota/named-entity-recognition-ner-on-conll-2003
2
Named Entity Recognition (NER) Datasets for NER in English
The following table shows the list of datasets for English entity recognition.
https://github.com/juand-r/entity-recognition-datasets
https://paperswithcode.com/task/named-entity-recognition-ner
DUA: Data Use Agreement
LDC: Linguistic Data Consortium
CC-BY 4.0: Creative Commons Attribution 4.0
3
Traditional NER
Three standard approaches to NER
• Rule-based NER
• Classifier-based NER
• Sequence Model for NER
Traditional Approaches
3
Traditional NER Rule-based NER
• Entity references have internal and external language cues Mr. [per Scott Morrison] flew to [loc Beijing]
• Can recognise names using lists (or gazetteers):
– Personal titles: Mr, Miss, Dr, President
– Given names: Scott, David, James
– Corporate suffixes: & Co., Corp., Ltd.
– Organisations: Microsoft, IBM, Telstra
• and rules:
– personal title X ⇒ per
– X, location ⇒ loc or org
– travel verb to X ⇒ loc
• Effectively regular expressions, PoS Tagger
3
Traditional NER Rule-based NER
• Determining which person holds what office in what organization
– [person] , [office] of [org]
• Michael Spence, the vice-chancellor and principal of the University of Sydney
– [org] (named, appointed, etc.) [person] Prep [office] • WHO appointed Tedros Adhanom as Director-General
• Determining where an organization is located
– [org] in [loc]
• Google headquarters in California
– [org] [loc] (division, branch, headquarters, etc.) • Google London headquarters
3
Traditional NER
Statistical approaches are more portable
• Learn NER from annotated text
– weights (≈ rules) calculated from the corpus
– same machine learner, different language or domain
• Token-by-token classification (with any machine learning)
• Each token may be:
– not part of an entity (tag o)
– beginning an entity (tag b-per, b-org, etc.)
– continuing an entity (tag i-per, i-org, etc.)
• What about N-gram model?
3
Traditional NER
Various features for statistical NER
Unigram
Mr.
Scott
Morrison
flew
to
Beijing
Lowercase unigram
mr.
scott
morrison
flew
to
beijing
POS tag
nnp
nnp
nnp
vbd
to
nnp
length
3
5
4
4
2
7
In first-name gazetteer
no
yes
no
no
no
no
In location gazetteer
no
no
no
no
no
yes
3-letter suffix
Mr.
ott
son
lew
–
ing
2-letter suffix
r.
tt
on
ew
to
ng
1-letter suffix
.
t
n
w
o
g
Tag predictions
O
B-per
I-per
O
O
B-loc
3
Traditional NER
Various features for statistical NER
Unigram
Mr.
Scott
Morrison
flew
to
Beijing
Lowercase unigram
mr.
scott
morrison
flew
to
beijing
POS tag
nnp
nnp
nnp
vbd
to
nnp
length
3
5
4
4
2
7
In first-name gazetteer
no
yes
no
no
no
no
In location gazetteer
no
no
no
no
no
yes
3-letter suffix
Mr.
ott
son
lew
–
ing
2-letter suffix
r.
tt
on
ew
to
ng
1-letter suffix
.
t
n
w
o
g
Tag predictions
O
B-per
I-per
O
O
B-loc
Predictive Model
Mr. Scott Morrison lives in Sydney O B-PER I-PER O O B-LOC
3
Traditional NER
Traditional NER Approaches – Pros and Cons
Rule-based approaches
• Can be high-performing and efficient
• Require experts to make rules
• Rely heavily on gazetteers that are always incomplete
• Are not robust to new domains and languages
Statistical approaches
• Require (expert-)annotated training data
• May identify unforeseen patterns
• Can still make use of gazetteers
• Are robust for experimentation with new features
• Are largely portable to new languages and domains
4
Sequence Model for NER Sequence Model (N to N)
ADV VERB DET NOUN NOUN
Output: Part of Speech
Sequence 2 Sequence Learning
How is the weather today
Input: Text
4
Sequence Model for NER Sequence Model
PER PER O O O O O LOC
Output: NE tag
Entity class or other(O)
Sequence 2 Sequence Learning
Scott Morrison is a prime minister of Australia
Input: Text
4
Sequence Model for NER Encoding classes for sequence labeling
PER PER O O O O O LOC
Output: NE tag
Entity class or other(O)
Sequence 2 Sequence Learning
Scott Morrison is a prime minister of Australia Input: Text
The IOB (short for inside, outside, beginning) is a common tagging format
• I- prefix before a tag indicates that the tag is inside a chunk.
• B- prefix before a tag indicates that the tag is the beginning of a chunk.
• An O tag indicates that a token belongs to no chunk (outside).
4
Sequence Model for NER
Encoding classes for sequence labeling
The IO and IOB (inside, outside, beginning) is a common tagging format
Josiah tells
IO encoding PER O IOB encoding B-PER O
even IO encoding vs IOB encoding
• Computation Time? • Efficiency?
Caren John
PER PER B-PER B-PER
B-PER I-PER
Smith is
PER O I-PER O
I-PER
a student
O O n+1 O O 2n+1
4
Sequence Model for NER Features for sequence labeling
Words
• Current word (essentially like a learned dictionary)
• Previous/next word (context)
Other kinds of inferred linguistic classification
• Part-of-speech tags
Label context
• Previous (and perhaps next) label
4
Sequence Model for NER N to N Sequence model
• There are different NLP tasks that used N to N sequence model
POS tagging
Named Entity Recognition Word Segmentation
4
Sequence Model for NER Sequence Model (MEMM, CRF)
HMM MEMM CRF
4
Sequence Model for NER Sequence Inference for NER
• For a Maximum Entropy Markov Model (MEMM), the classifier makes a single decision at a time, conditioned on evidence from observations and previous decisions
Features
W0
in
W+1
Australia
W-1
lives
POS-1
VBZ
POS-2-POS-1
NN – VBZ
hasDigit?
0
…
…
-3 -2 -1 0 +1
Scott Morrison lives in Australia
NN NN VBZIN NN
(Toutanova et al. 2003, etc.)
4
Sequence Model for NER Sequence Inference for NER
Prediction (“O”)
Classification
Feature Extraction
Local Data
Classifier (e.g. MEMM, CRF, or RNN)
with optimization (gradient)
Features Label
W0
W+1
W-1
POS-1
…
…
Scott
Morrison lives Australia Sequence Level
in
in
Local Level
lives
O
4
Sequence Model for NER Named Entity Recognition
The goal: predicting named entity mentions in unstructured text into pre-defined categories such as the person names, organizations, locations
Upenn CogComp-NLP http://macniece.seas.upenn.edu:4004/
Caren Soyeon Han is working at Google at Sydney, Australia
gold predicted O
PER
PER PERO O O ORG O LOC LOC O O O O O ORG O LOC LOC
4
Sequence Model for NER
Named Entity Recognition with Bi-LSTM
We can easily apply Bi-LSTM (N to N Seq2Seq) Model to predict Named Entities
4
Sequence Model for NER Named Entity Recognition with Bi-LSTM
We can easily apply Bi-LSTM (N to N Seq2Seq) Model to predict Named Entities
*The model clearly contains incorrect predictions.
‘I’ cannot appear in the label of the first word. I-Per can only appear after B-Per. I-Org can also appear only after B-Org.
4
Sequence Model for NER Named Entity Recognition with Bi-LSTM
We can easily apply Bi-LSTM (N to N Seq2Seq) Model to predict Named Entities
*The model clearly contains incorrect predictions.
‘I’ cannot appear in the label of the first word. I-Per can only appear after B-Per. I-Org can also appear only after B-Org.
What if we teach the dependency between predicted entity names
4
Sequence Model for NER
Wait? What about HMM?
Hidden Markov Models (HMMs) are a class of probabilistic graphical model that allow us to predict a sequence of unknown (hidden) variables from a set of observed variables.
X1 X2 X3
hidden
x states
y possible observations
a state transition probabilities
b output probabilities
a b
Y1
Y2
Y3
• States are hidden
• Observable outcome linked to states
• Each state has observation probabilities
to determine the observable event
4
Sequence Model for NER Advanced HMM (MEMM or CRF)
• The CRF model has addressed the labeling bias issue and eliminated unreasonable hypotheses in HMM.
• MEMM adopts local variance normalization while CRF adopts global variance normalization.
HMM
Maximum-entropy Markov model (MEMM)
Conditional random field (CRF)
4
Sequence Model for NER Advanced HMM (MEMM or CRF)
• The CRF model has addressed the labeling bias issue and eliminated unreasonable hypotheses in HMM.
• MEMM adopts local variance normalization while CRF adopts global variance normalization.
4
Sequence Model for NER
Named Entity Recognition with Bi-LSTM with CRF
What if we put CRF on top of the Bi-LSTM model. By adding a CRF layer, the model can handle the dependency between predicted entity names
predicted
4
Sequence Model for NER Remember?
4
Sequence Model for NER Greedy Inference
• Greedy inference:
– We just start at the left, and use our classifier at each position to assign a label
– The classifier can depend on previous labeling decisions as well as observed data
• Advantages:
– Fast, no extra memory requirements
– Very easy to implement
– With rich features including observations to the right, it may perform quite well
• Disadvantage:
– Greedy. We make commit errors we cannot recover from
Scott Morrison lives Australia
in
4
Sequence Model for NER Beam Inference
• Beam inference:
– At each position keep the top k completesequences.
– Extend each sequence in each local way.
– The extensions compete for the k slots at the next position.
• Advantages:
– Fast; beam sizes of 3–5 are almost as good as exact inference in many cases.
– Easy to implement (no dynamic programming required).
• Disadvantage:
– Inexact: the globally best sequence can fall off the beam.
Scott Morrison lives Australia
in
4
Sequence Model for NER Viterbi Inference
• Viterbi inference:
– Dynamic programming or memorisation.
– Requires small window of state influence (e.g., past two states are relevant).
• Advantage:
– Exact: the global best sequence is returned.
• Disadvantage:
– Harder to implement long-distance state-state interactions
Scott Morrison lives in Australia
3
Probabilistic Approaches
Viterbi Algorithm
John will Pin Will
N
2/9 3/4
1/4 M 0
0
V
0
3
Probabilistic Approaches
Viterbi Algorithm
John will Pin Will
1/6
1/486
N1/9 N
2/9
0
M
1 0
0
V
0
1/9
1/4
3
Probabilistic Approaches
Viterbi Algorithm
John will Pin Will
1/6
1/486
N1/9 N
2/9
0
1/3
1/24
M 0 M
0
0
3/4
0
VV
00
3
Probabilistic Approaches
Viterbi Algorithm
John will Pin Will
1/6 1/486 1/432 1/1152
N1/9 N N N
3/4 2/9
0
1/4 M 0
1/3
1/9 2/9 1/9
1/24 0 1/1728
M M M
3/4 0 3/4
0 1/128 0
1/2592
0
0
V V V V
0 0 1/4 0
3
Probabilistic Approaches
Viterbi Algorithm
John will Pin Will
1/6 1/486 1/432 1/1152
N1/9 N N N
3/4 2/9
0
1/4 M 0
1/3
1/9 2/9 1/9
1/24 0 1/1728
M M M
3/4 0 3/4
0 1/128 0
1/2592
0
0
V V V V
0 0 1/4 0
3
Probabilistic Approaches
Viterbi Algorithm
John will Pin Will
1/6
1/1152
N N
3/4
2/9 1/3
1/9
1/24
M
3/4
1/2592
1/128
V
1/4
3
Probabilistic Approaches
Viterbi Algorithm
https://web.stanford.edu/~jurafsky/slp3/8.pdf
4
Sequence Model for NER
What if there is a language that do not have any annotation?
NER in Low Resource Language
Current State of the Art model: Han et al. 2019 from Usyd NLP Research Group
5
Coreference Resolution NER and Coreference Resolution
NER only produces a list of entities in a text.
• “I voted for Scott because he was most aligned with my values”
Then, How to trace it?
Coreference Resolution is the task of finding all expressions that
refer to the same entity in a text
• “I voted for Scott because he was most aligned with my values” – Scott he
– Imy
5
Coreference Resolution What is Coreference Resolution?
Finding all mentions that refer to the same entity
Donald Trump said he considered nominating Ivanka Trump to be president Donald Trump said he considered nominating Ivanka Trump to be president
of the World Bank because “she is very good with numbers,”
of the World Bank because “she is very good with numbers,” according to a
new interview.
5
Coreference Resolution What is Coreference Resolution?
Finding all mentions that refer to the same entity
Donald Trump said he considered nominating Ivanka Trump to be president of the World Bank because “she is very good with numbers,”
5
Coreference Resolution What is Coreference Resolution?
Finding all mentions that refer to the same entity
Donald said he considered nominating Ivanka Trump to be president of the World Bank because “she is very good with numbers,”
5
Coreference Resolution
How to conduct Coreference Resolution? 1. Detectthementions
* Mention: span of text referring to same entity
• Pronouns
e.g. I, your, it, she, him, etc.
• Named entities
e.g. people, places, organisation etc.
• Noun phrases
e.g. a cat, a big fat dog, etc.
5
Coreference Resolution
The difficulty in coreference resolution
1. Detectthementions
* Mention: span of text referring to same entity
Tricky mentions…
• • •
It was very interesting
No staff
The best university in Australia
How to handle this tricky mentions?
Classifiers!
5
Coreference Resolution
How to conduct Coreference Resolution?
1. Detect the mentions
Donald Trump said he considered nominating Ivanka Trump to be president of the World Bank because “she is very good with numbers,”
2. Cluster the mentions
Donald Trump said he considered nominating Ivanka Trump to be president of the World Bank because “she is very good with numbers,”
5
Coreference Resolution
How to cluster the mentions and find the coreference
Coreference
It occurs when two or more expressions in a text refer to the same person or thing.
• “Donald Trump is a president of the United States. Trump was born and raised in the New York City borough of Queens”
Anaphora
The use of a word referring back to a word used earlier in a text or conversation. Mostly noun phrases
• a word (anaphor) refers to another word (antecedent)
• “Donald Trump is a president of the United States. Before entering
politics, he was a businessman and television personality” antecedent anaphor
5
Coreference Resolution Coreference vs Anaphora
Coreference
Donald Trump Trump
Anaphora
Donald Trump he
5
Coreference Resolution
Not all anaphoric relations are coreferential
1. Not all noun phrases have reference
• Every student like his speech
• No student like his speech
2. Not all anaphoric relations are co-referential (bridging anaphora)
• I attended the meeting yesterday. The presentation was awesome!
Coreference
Multiple expressions same person or thing
anaphora
Pronominal anaphora
Adjectival anaphora
Bridging anaphora
cataphora
I almost stepped on it.
It was a big snake…
6
Coreference Model How to Cluster Mentions?
After detecting this all mentions in a text, we need to cluster them!
Ivanka
Donald
Ivanka was happy that Donald said he considered nominating her because she is very good with numbers
he
her
she
6
Coreference Model How to Cluster Mentions?
After detecting this all mentions in a text, we need to cluster them!
Ivanka
Donald
Ivanka was happy that Donald said he considered nominating her because she is very good with numbers
he
her
Gold cluster 1
Gold cluster 2
she
6
Coreference Model How to Cluster Mentions?
• Train a binary classifier that assigns every pair of mentions a probability of being coreferent: 𝑝(𝑚 ,𝑚 )
𝑖𝑗
𝑝(𝑚 , 𝑚 ) 𝑖𝑗
0
(absolute negative)
1
(absolute positive)
Ivanka
𝑝(𝑚 ,𝑚 ) 𝑖𝑗
Donald
Ivanka was happy that Donald said he considered nominating her because she is very good with numbers
he
her
she
6
Coreference Model Mention Pair Training
• N mentions in a document
• 𝑦 = 1 if mentions 𝑚 and 𝑚 are coreferent, -1 if otherwise
𝑖𝑗 𝑖𝑗
• Just train with regular cross-entropy loss (looks a bit different because it is binary classification)
Coreferent mentions pairs should get high probability, others should get low probability
Iteratethrough Iteratethroughcandidate mentions antecedents(previously
occurring mentions)
6
Coreference Model Mention Pair Testing
• Coreference resolution is a clustering task, but we are only scoring pairs of mentions… what to do?
• Pick some threshold (e.g., 0.5) and add coreference links between mention pairs where p(mi, mj) is above the threshold
Ivanka
𝑝(𝑚 ,𝑚 ) 𝑖𝑗
Donald
Ivanka was happy that Donald said he considered nominating her because she is very good with numbers
he
her
she
6
Coreference Model Mention Pair Testing
• Pick some threshold (e.g., 0.5) and add coreference links between mentionpairswhere𝑝(𝑚 ,𝑚)isabovethethreshold
• Take the transitive closure to get the clustering
𝑝(𝑚 ,𝑚 ) 𝑖𝑗
Ivanka was happy that Donald said he considered nominating her because she is very good with numbers
𝑖𝑗
Ivanka
Donald
he
her
she
Even though the model did not predict this coreference link, Ivanka and her are coreferent due to transitivity
6
Coreference Model Mention Pair Testing: Issue
• Assume that we have a long document with the following mentions
• Michael… he … his … him …
• … won the game because he …
Michael
Many mentions only have one clear antecedent but we are asking the model to predict all of them
Alternative solution: instead train the model to predict only one antecedent for each mention
Mention Ranking
he
his
him
he
6
Coreference Model
6
Coreference Model
6
Coreference Model
6
Coreference Model
6
Coreference Model Coreference Models: Training
• The current mention 𝑚 should be linked to any one of the candidate 𝑗
antecedents it’s coreferent with.
• Mathematically, maximize this probability:
i—1
(𝑦 =1)𝑝(𝑚,𝑚)
j=1 Iterate through candidate antecedents (previously occurring mentions)
𝑖𝑗 𝑖𝑗
For ones that …we want the model to are coreferent assign a high probability to mj…
6
Coreference Model Coreference Models: Training
• The current mention 𝑚 should be linked to any one of the candidate 𝑗
antecedents it’s coreferent with.
• Mathematically, maximize this probability:
i—1
(𝑦 =1)𝑝(𝑚,𝑚)
j=1 Iterate through candidate antecedents (previously occurring mentions)
𝑖𝑗 𝑖𝑗
For ones that …we want the model to are coreferent assign a high probability to mj…
The model could produce 0.9 probability for one of the correct antecedents and low probability for everything else
6
Coreference Model
Mention Ranking Models: Test Time
• Similar to mention-pair model except each mention is assigned only one antecedent
NA
How do we compute the probabilities?
• Non-neural statistical classifier
• Simple neural network
• More advanced model using LSTMs, attention
Ivanka
Donald
he
her
she
6
Coreference Model
How do we compute the probabilities?
End to End Model (Lee at al., 2017)
• Current state-of-the-art model for coreference resolution (before 2019)
• Mention ranking model
• Improvements over simple feed-forward NN
• Use an LSTM
• Use attention (will learn about this in Lecture 10)
• Do mention detection and coreference end-to-end
• No mention detection step
6
Coreference Model
End to End Model (Lee at al., 2017)
• First embed the words in the document using a word embedding matrix and a character-level embedding
6
Coreference Model
End to End Model (Lee at al., 2017)
• Then run a bidirectional LSTM over the document
6
Coreference Model
End to End Model (Lee at al., 2017)
• Next, represent each span of text i going from START(i) to END(i) as a vector
6
Coreference Model
End to End Model (Lee at al., 2017)
• Next, represent each span of text i going from START(i) to END(i) as a vector
General, General Electric, General Electric said, … Electric, Electrid said,… will all get its own vector representation
6
Coreference Model
End to End Model (Lee at al., 2017)
• Next, represent each span of text i going from START(i) to END(i) as a vector
Span Representation:
6
Coreference Model
End to End Model (Lee at al., 2017)
• Next, represent each span of text i going from START(i) to END(i) as a vector. For example, for “the postal service”
Span Representation:
6
Coreference Model
End to End Model (Lee at al., 2017)
• Next, represent each span of text i going from START(i) to END(i) as a vector. For example, for “the postal service”
Span Representation:
Bi-LSTM hidden states for span’s start and end
6
Coreference Model
End to End Model (Lee at al., 2017)
• Next, represent each span of text i going from START(i) to END(i) as a vector. For example, for “the postal service”
Span Representation:
Attention-based representation (Lecture 10)
6
Coreference Model
End to End Model (Lee at al., 2017)
• Next, represent each span of text i going from START(i) to END(i) as a vector. For example, for “the postal service”
Span Representation:
Additional features
6
Coreference Model
End to End Model (Lee at al., 2017)
• Next, represent each span of text i going from START(i) to END(i) as a vector.
Are spans i and j conference mentions? Is i a mention? Is j a mention? Do they look coreferent?
7
Coreference Evaluation How to evaluate coreference?
There are different types of metrics available for evaluating coreference, such as B-CUBED, MUC, CEAF, LEA, BLANC, or Often report the average over a few different metrics
Predicted Cluster 1 Predicted Cluster 2
Trump
Donald Trump
Donald
Actual clusters
her
him
She
Hillary Clinton
his
he
Gold cluster 1
Gold cluster 2
7
Coreference Evaluation How to evaluate coreference?
Let’s evaluate with B-CUBED metrics
• Compute Precision and Recall for each mention.
Predicted Cluster 1
Predicted Cluster 2
Trump
Actual clusters
P=4/5 R=4/6
Donald Trump
P = 2/4 R= 2/3
Hillary Clinton
P=2/4 R=2/6
Donald
her
him
P=1/5 R=1/3
She
he
his
Gold cluster 1
Gold cluster 2
7
Coreference Evaluation How to evaluate coreference?
Let’s evaluate with B-CUBED metrics
• Compute precision and recall for each mention.
• Average the individual Ps and Rs
Predicted Cluster 1
Predicted Cluster 2
Trump
Actual clusters
P=4/5 R=4/6
Donald Trump
P = 2/4 R= 2/3
Hillary Clinton
P=2/4 R=2/6
Donald
her
him
P=1/5 R=1/3
She
he
his
Gold cluster 1
Gold cluster 2
7
Coreference Evaluation Performance Comparison
OntoNotes dataset: ~3000 documents labeled by humans • English and Chinese data
Model
Approach
English
Chinese
Lee et al. (2010)
Rule-based system
~55
~50
Chen & Ng (2012)
[CoNLL 2012 Chinese winner]
Non-neural machine learning models
54.5
57.6
Fernandes (2012)
[CoNLL 2012 English winner]
60.7
51.6
Wiseman et al. (2015)
Neural mention ranker
63.3
—
Lee et al. (2017)
Neural mention ranker (end- to-end style)
67.2
—
UsydNLP (2019)
Neural mention ranker with lemma cross validation
74.87
—
8
Preview: Week 10
Attention and Reading Comprehension
Attention output
softmax
Attention distribution
Attention scores
Encoder recurrent layer
Encoder embedding layer
One-hot vector
How
are
you
?
I
am
fine
Decoder Recurrent Layer
Decoder embedding layer
One-hot vector
0
Preview: Week 11
Transformer and Machine Translation
/
Reference
Reference for this lecture
• Deng, L., & Liu, Y. (Eds.). (2018). Deep Learning in Natural Language Processing. Springer.
• Rao, D., & McMahan, B. (2019). Natural Language Processing with PyTorch: Build Intelligent Language Applications Using Deep Learning. ” O’Reilly Media, Inc.”.
• Manning, C. D., Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing. MIT press.
• Manning, C 2018, Natural Language Processing with Deep Learning, lecture notes, Stanford University
• Li, J., Galley, M., Brockett, C., Gao, J., & Dolan, B. (2015). A diversity-promoting objective function for neural conversation models. arXiv preprint arXiv:1510.03055.
• Jiang, S., & de Rijke, M. (2018). Why are Sequence-to-Sequence Models So Dull? Understanding the Low- Diversity Problem of Chatbots. arXiv preprint arXiv:1809.01941.
• Liu, C. W., Lowe, R., Serban, I. V., Noseworthy, M., Charlin, L., & Pineau, J. (2016). How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. arXiv preprint arXiv:1603.08023.