l21-summarisation-v4
COPYRIGHT 2021, THE UNIVERSITY OF MELBOURNE
1
COMP90042
Natural Language Processing
Lecture 21
Semester 1 2021 Week 11
Jey Han Lau
Summarisation
COMP90042 L21
2
Summarisation
• Distill the most important information from a text to
produce shortened or abridged version
• Examples
‣ outlines of a document
‣ abstracts of a scientific article
‣ headlines of a news article
‣ snippets of search result
COMP90042 L21
3
What to Summarise?
• Single-document summarisation
‣ Input: a single document
‣ Output: summary that characterise the content
• Multi-document summarisation
‣ Input: multiple documents
‣ Output: summary that captures the gist of all
documents
‣ E.g. summarise a news event from multiple
sources or perspectives
COMP90042 L21
4
How to Summarise?
• Extractive summarisation
‣ Summarise by selecting representative
sentences from documents
• Abstractive summarisation
‣ Summarise the content in your own words
‣ Summaries will often be paraphrases of the
original content
COMP90042 L21
5
Goal of Summarisation?
• Generic summarisation
‣ Summary gives important information in the
document(s)
• Query-focused summarisation
‣ Summary responds to a user query
‣ Similar to question answering
‣ But answer is much longer (not just a phrase)
COMP90042 L21
6
Query-Focused Summarisation
COMP90042 L21
7
Outline
• Extractive summarisation
‣ Single-document
‣ Multi-document
• Abstractive summarisation
‣ Single-document (deep learning models!)
• Evaluation
COMP90042 L21
8
Extractive: Single-Doc
COMP90042 L21
9
Summarisation System
• Content selection: select what sentences to
extract from the document
• Information ordering: decide how to order
extracted sentences
• Sentence realisation: cleanup to make sure
combined sentences are fluent
COMP90042 L21
10
Summarisation System
• We will focus on content selection
• For single-document summarisation, information
ordering not necessary
‣ present extracted sentences in original order
• Sentence realisation also not necessary if they are
presented in dot points
COMP90042 L21
11
Content Selection
• Not much data with ground truth extractive
sentences
• Mostly unsupervised methods
• Goal: Find sentences that are important or salient
COMP90042 L21
12
Method 1: TF-IDF
• Frequent words in a doc → salient
• But some generic words are very frequent but
uninformative
‣ function words
‣ stop words
• Weigh each word in document by its inverse
document frequency:
‣
w d
weight(w) = tfd,w × idfw
COMP90042 L21
13
Method 2: Log Likelihood Ratio
• Intuition: a word is salient if its probability in the input corpus is
very different to a background corpus
•
• is the ratio between:
‣ P(observing in ) and P(observing in ), assuming
‣ P(observing in ) and P(observing in ), assuming
weight(w) = {1, if − 2logλ(w) > 100, otherwise
λ(w)
w I w B
P(w | I) = P(w |B) = p
w I w B
P(w | I) = pI and P(w |B) = pB
(NIx ) px(1 − p)NI−x
(NBy ) py(1 − p)NB−y
(NIx ) pxI (1 − pI)NI−x (
NB
y ) pyB(1 − pB)NB−y
x + y
NI + NB
x
NI
y
NB
COMP90042 L21
14
Saliency of A Sentence?
•
• Only consider non-stop words in
weight(s) =
1
|S | ∑
w∈S
weight(w)
S
COMP90042 L21
15
Method 3: Sentence Centrality
• Alternative approach to ranking sentences
• Measure distance between sentences, and
choose sentences that are closer to other
sentences
• Use tf-idf BOW to represent sentence
• Use cosine similarity to measure distance
•
centrality(s) =
1
#sent ∑
s′�
costfidf(s, s′�)
COMP90042 L21
16
Final Extracted Summary
• Use top-ranked sentences as extracted summary
‣ Saliency (tf-idf or log likelihood ratio)
‣ Centrality
COMP90042 L21
17
Method 4: RST Parsing
With its distant orbit – 50 percent farther from the sun
than Earth – and slim atmospheric blanket, Mars
experiences frigid weather conditions. Surface
temperatures typically average about -70 degrees
Fahrenheit at the equator, and can dip to -123 degrees C
near the poles. Only the midday sun at tropical latitudes
is warm enough to thaw ice on occasion, but any liquid
water formed in this way would evaporate almost
instantly because of the low atmospheric pressure.
Although the atmosphere holds a small amount of water,
and water-ice clouds sometimes develop, most Martian
weather involves blowing dust or carbon dioxide.
COMP90042 L21
18
Method 4: RST Parsing
• Rhetorical structure theory (L12, Discourse):
explain how clauses are connected
• Define the types of relations between a nucleus
(main clause) and a satellite (supporting clause)
COMP90042 L21
19
Method 4: RST Parsing
• Nucleus more important than satellite
• A sentence that functions as a nucleus to more sentences = more
salient
Which sentence is the
best summary sentence?
PollEv.com/jeyhanlau569
http://PollEv.com/jeyhanlau569
http://PollEv.com/jeyhanlau569
COMP90042 L21
20
COMP90042 L21
21
Extractive: Multi-Doc
COMP90042 L21
22
Summarisation System
• Similar to single-document extractive
summarisation system
• Challenges:
‣ Redundancy in terms of information
‣ Sentence ordering
COMP90042 L21
23
Content Selection
• We can use the same unsupervised content
selection methods (tf-idf, log likelihood ratio,
centrality) to select salient sentences
• But ignore sentences that are redundant
COMP90042 L21
24
Maximum Marginal Relevance
• Iteratively select the best sentence to add to
summary
• Sentences to be added must be novel
• Penalise a candidate sentence if it’s similar to
extracted sentences:
‣
• Stop when a desired number of sentences are
added
MMR-penalty(s) = λ max
si∈𝒮
sim(s, si)
COMP90042 L21
25
Information Ordering
• Chronological ordering:
‣ Order by document dates
• Coherence:
‣ Order in a way that makes adjacent sentences similar
‣ Order based on how entities are organised (centering
theory, L12)
COMP90042 L21
26
Sentence Realisation
• Make sure entities are referred coherently
‣ Full name at first mention
‣ Last name at subsequent mentions
• Apply coreference methods to first extract names
• Write rules to clean up
COMP90042 L21
27
Sentence Realisation
COMP90042 L21
28
Abstractive: Single-Doc
COMP90042 L21
29
Example
• Paraphrase
• A very difficult task
• Can we train a neural network to generate
summary?
a detained iranian-american academic accused of acting against
national security has been released from a tehran prison after a
hefty bail was posted, a top judiciary official said tuesday
iranian-american academic held in tehran released on bail
COMP90042 L21
30
Encoder-Decoder?
Encoder RNN
RNN1
⽜牛
RNN1
吃
RNN1
草
Source sentence
Decoder RNN
RNN2
grass
RNN2
RNN2
cow
RNN2
eats
Target sentence
• What if we treat:
‣ Source sentence = “document”
‣ Target sentence = “summary”
COMP90042 L21
Decoder RNNEncoder RNN
RNN1
a
RNN1
detained
RNN1
tuesday
Source sentence
RNN2
bail
RNN2
RNN2
iranian-
RNN2
american
Target sentence
31
Encoder-Decoder?
a detained iranian-american academic accused of acting against
national security has been released from a tehran prison after a
hefty bail was posted, a top judiciary official said tuesday
iranian-american academic held in tehran released on bail
… …
COMP90042 L21
32
Data
• News headlines
• Document: First sentence of article
• Summary: News headline/title
• Technically more like a “headline generation task”
COMP90042 L21
33
And It Kind of Works…
Rush et al. (2015): A Neural Attention Model for Abstractive Sentence Summarisation
COMP90042 L21
34
More Summarisation Data
• But headline generation isn’t really exciting…
• Other summarisation data:
‣ CNN/Dailymail: 300K articles, summary in bullets
‣ Newsroom: 1.3M articles, summary by authors
– Diverse; 38 major publications
‣ XSum: 200K BBC articles
– Summary is more abstractive than other
datasets
COMP90042 L21
35
Improvements
• Attention mechanism
• Richer word features: POS tags, NER tags, tf-idf
• Hierarchical encoders
‣ One LSTM for words
‣ Another LSTM for sentences
input
hidden
output
input
hidden
(word)
hidden
(sent)
output
Nallapati et al. (2016): Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond
COMP90042 L21
36
• Has the potential to generate new details not in the
source document
• Unable to handle unseen words in the source document
• Information bottleneck: a vector is used to represent the
source document
• Can only generate one summary
PollEv.com/jeyhanlau569
Potential issues of an attention encoder-
decoder summarisation system?
http://PollEv.com/jeyhanlau569
http://PollEv.com/jeyhanlau569
COMP90042 L21
37
COMP90042 L21
38See et al. (2017): Get To The Point: Summarization with Pointer-Generator Networks
Encoder-decoder with Attention
COMP90042 L21
39
Encoder-decoder with Attention + Copying
See et al. (2017): Get To The Point: Summarization with Pointer-Generator Networks
scalar, e.g. 0.8
probability of “copying”
P(Argentina) = (1 − pgen) × Pattn(Argentina) + pgen × Pvoc(Argentina)
COMP90042 L21
40
Copy Mechanism
• Generate summaries that reproduce details in the
document
• Can produce out-of-vocab words in the summary
by copying them in the document
‣ e.g. smergle = out of vocabulary
‣ p(smergle) = attention probability + generation
probability = attention probability
COMP90042 L21
41
Latest Development
• State-of-the-art models use transformers instead
of RNNs
• Lots of pre-training
• Note: BERT not directly applicable because we
need a unidirectional decoder (BERT is only an
encoder)
COMP90042 L21
42
Evaluation
COMP90042 L21
43
ROUGE
(Recall Oriented Understudy for Gisting Evaluation)
• Similar to BLEU, evaluates the degree of word
overlap between generated summary and
reference/human summary
• But recall oriented
• Measures overlap in N-grams separately (e.g.
from 1 to 3)
• ROUGE-2: calculates the percentage of bigrams
from the reference that are in the generated
summary
COMP90042 L21
44
ROUGE-2: Example
• Ref 1: Water spinach is a green leafy vegetable grown in the tropics.
• Ref 2: Water spinach is a commonly eaten leaf vegetable of Asia.
• Generated summary: Water spinach is a leaf vegetable commonly
eaten in tropical areas of Asia.
• ROUGE-2 =
3 + 6
10 + 9
ROUGE-2
COMP90042 L21
45
A Final Word
• Research focus on single-document abstractive
summarisation
‣ Mostly news data
• But many types of data for summarisation:
‣ Images, videos
‣ Graphs
‣ Structured data: e.g. patient records, tables
• Multi-document abstractive summarisation