Summarisation
COMP90042
Natural Language Processing
Lecture 21
Semester 1 2021 Week 11 Jey Han Lau
COPYRIGHT 2021, THE UNIVERSITY OF MELBOURNE
1
COMP90042
L21
•
•
Distill the most important information from a text to produce shortened or abridged version
Summarisation
Examples
‣ outlines of a document
‣ abstracts of a scientific article ‣ headlines of a news article
‣ snippets of search result
2
COMP90042
L21
What to Summarise?
• Single-documentsummarisation
‣ Input: a single document
‣ Output: summary that characterise the content
• Multi-documentsummarisation
‣ Input: multiple documents
‣ Output: summary that captures the gist of all documents
‣ E.g. summarise a news event from multiple sources or perspectives
3
COMP90042
L21
•
Extractive summarisation
‣ Summarise by selecting representative sentences from documents
•
Abstractive summarisation
‣ Summarise the content in your own words
‣ Summaries will often be paraphrases of the original content
How to Summarise?
4
COMP90042
L21
•
Generic summarisation
‣ Summary gives important information in the document(s)
•
Query-focused summarisation
‣ Summary responds to a user query
‣ Similar to question answering
‣ But answer is much longer (not just a phrase)
Goal of Summarisation?
5
COMP90042
L21
Query-Focused Summarisation
6
COMP90042
L21
•
Extractive summarisation ‣ Single-document
‣ Multi-document
•
•
Abstractive summarisation
‣ Single-document (deep learning models!)
Evaluation
Outline
7
COMP90042
L21
Extractive: Single-Doc
8
COMP90042
L21
•
•
•
Content selection: select what sentences to extract from the document
Summarisation System
Information ordering: decide how to order extracted sentences
Sentence realisation: cleanup to make sure combined sentences are fluent
9
COMP90042
L21
• •
We will focus on content selection
For single-document summarisation, information
ordering not necessary
‣ present extracted sentences in original order
•
Sentence realisation also not necessary if they are presented in dot points
Summarisation System
10
COMP90042
L21
•
• •
Not much data with ground truth extractive sentences
Content Selection
Mostly unsupervised methods
Goal: Find sentences that are important or salient
11
COMP90042
L21
• •
Frequent words in a doc → salient
But some generic words are very frequent but
uninformative
‣ function words ‣ stop words
Weigh each word w in document d by its inverse document frequency:
weight(w) = tfd,w × idfw
•
‣
Method 1: TF-IDF
12
COMP90042
L21
Method 2: Log Likelihood Ratio
• Intuition:awordissalientifitsprobabilityintheinputcorpusis
very different {to a background corpus weight(w) = 1, if − 2logλ(w) > 10
0, otherwise λ(w) is the ratio between:
(NI)px(1−p)NI−x x
(NB)py(1−p)NB−y y
• •
‣ ‣
P(observing w in I) and P(observing w in B), assuming P(w|I) = P(w|B) = p x + y
NI + NB P(observing w in I) and P(observing w in B), assuming
P(w|I) = pI and P(w|B) = pB
NI px(1 − pI)NI−x x y NB py(1 − pB)NB−y
()NN() xIIByB
13
COMP90042
L21
• •
Saliency of A Sentence?
weight(s) = 1 ∑ weight(w) |S| w∈S
Only consider non-stop words in S
14
COMP90042
L21
• •
• •
•
Alternative approach to ranking sentences
Measure distance between sentences, and choose sentences that are closer to other sentences
Method 3: Sentence Centrality
Use tf-idf BOW to represent sentence Use cosine similarity to measure distance
centrality(s) = 1 ∑ costfidf (s, s′) #sent s′
15
COMP90042
L21
•
Use top-ranked sentences as extracted summary ‣ Saliency (tf-idf or log likelihood ratio)
‣ Centrality
Final Extracted Summary
16
COMP90042
L21
Method 4: RST Parsing
With its distant orbit – 50 percent farther from the sun than Earth – and slim atmospheric blanket, Mars experiences frigid weather conditions. Surface temperatures typically average about -70 degrees Fahrenheit at the equator, and can dip to -123 degrees C near the poles. Only the midday sun at tropical latitudes is warm enough to thaw ice on occasion, but any liquid water formed in this way would evaporate almost instantly because of the low atmospheric pressure. Although the atmosphere holds a small amount of water, and water-ice clouds sometimes develop, most Martian weather involves blowing dust or carbon dioxide.
17
COMP90042
L21
•
•
Rhetorical structure theory (L12, Discourse): explain how clauses are connected
Method 4: RST Parsing
Define the types of relations between a nucleus (main clause) and a satellite (supporting clause)
18
COMP90042
L21
Method 4: RST Parsing • Nucleusmoreimportantthansatellite
• Asentencethatfunctionsasanucleustomoresentences=more salient
Which sentence is the
best summary sentence?
PollEv.com/jeyhanlau569
19
COMP90042
L21
20
COMP90042
L21
Extractive: Multi-Doc
21
COMP90042
L21
•
•
Similar to single-document extractive summarisation system
Summarisation System
Challenges:
‣ Redundancy in terms of information ‣ Sentence ordering
22
COMP90042
L21
•
•
We can use the same unsupervised content selection methods (tf-idf, log likelihood ratio, centrality) to select salient sentences
Content Selection
But ignore sentences that are redundant
23
COMP90042
L21
•
• •
‣
•
Maximum Marginal Relevance
Iteratively select the best sentence to add to summary
Sentences to be added must be novel Penalise a candidate sentence if it’s similar to
extracted sentences:
MMR-penalty(s) = λ max sim(s, si) si∈𝒮
Stop when a desired number of sentences are added
24
COMP90042
L21
•
•
Chronological ordering:
‣ Order by document dates Coherence:
‣ Order in a way that makes adjacent sentences similar
‣ Order based on how entities are organised (centering theory, L12)
Information Ordering
25
COMP90042
L21
•
Make sure entities are referred coherently ‣ Full name at first mention
‣ Last name at subsequent mentions
• •
Apply coreference methods to first extract names Write rules to clean up
Sentence Realisation
26
COMP90042
L21
Sentence Realisation
27
COMP90042
L21
Abstractive: Single-Doc
28
COMP90042
L21
Example
a detained iranian-american academic accused of acting against national security has been released from a tehran prison after a hefty bail was posted, a top judiciary official said tuesday
iranian-american academic held in tehran released on bail
•
•
•
Paraphrase
A very difficult task
Can we train a neural network to generate summary?
29
COMP90042
L21
Encoder-Decoder?
RNN1 RNN1 RNN1
⽜牛吃草
Encoder RNN
Source sentence
cow eats
grass
RNN2
grass
RNN2 RNN2 RNN2
Decoder RNN
Target sentence
• Whatifwetreat:
‣ Source sentence = “document” ‣ Target sentence = “summary”
30
COMP90042
L21
Encoder-Decoder?
RNN1 RNN1 … RNN1
Encoder RNN
a detained
tuesday
Source sentence
iranian-
RNN2
Decoder RNN
american
RNN2
iranian-
academic
RNN2
american
…
RNN2
bail
Target sentence
a detained iranian-american academic accused of acting against national security has been released from a tehran prison after a hefty bail was posted, a top judiciary official said tuesday
iranian-american academic held in tehran released on bail
31
COMP90042
L21
• • • •
News headlines
Document: First sentence of article
Summary: News headline/title
Technically more like a “headline generation task”
Data
32
COMP90042
L21
And It Kind of Works…
Rush et al. (2015): A Neural Attention Model for Abstractive Sentence Summarisation 33
COMP90042
L21
• •
But headline generation isn’t really exciting… Other summarisation data:
‣ CNN/Dailymail: 300K articles, summary in bullets ‣ Newsroom: 1.3M articles, summary by authors
–
–
More Summarisation Data
Diverse; 38 major publications ‣ XSum: 200K BBC articles
Summary is more abstractive than other datasets
34
COMP90042
L21
Improvements
• Attentionmechanism
• Richerwordfeatures:POStags,NERtags,tf-idf • Hierarchicalencoders
output
hidden
input
‣ One LSTM for words
‣ Another LSTM for sentences
output
hidden (sent)
hidden (word)
input
Nallapati et al. (2016): Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond 35
COMP90042
L21
Potential issues of an attention encoder- decoder summarisation system?
• Has the potential to generate new details not in the source document
• Unable to handle unseen words in the source document
• Information bottleneck: a vector is used to represent the
source document
• Can only generate one summary
PollEv.com/jeyhanlau569
36
COMP90042
L21
37
COMP90042
L21
Encoder-decoder with Attention
See et al. (2017): Get To The Point: Summarization with Pointer-Generator Networks 38
COMP90042
P(Argentina) = (1 − pgen) × Pattn(Argentina) + pgen × Pvoc(Argentina)
L21
probability of “copying”
scalar, e.g. 0.8
Encoder-decoder with Attention + Copying
See et al. (2017): Get To The Point: Summarization with Pointer-Generator Networks 39
COMP90042
L21
•
•
Generate summaries that reproduce details in the document
Copy Mechanism
Can produce out-of-vocab words in the summary by copying them in the document
‣ e.g. smergle = out of vocabulary
‣ p(smergle) = attention probability + generation probability = attention probability
40
COMP90042
L21
•
• •
State-of-the-art models use transformers instead of RNNs
Latest Development
Lots of pre-training
Note: BERT not directly applicable because we need a unidirectional decoder (BERT is only an encoder)
41
COMP90042
L21
Evaluation
42
COMP90042
L21
•
• •
•
Similar to BLEU, evaluates the degree of word overlap between generated summary and reference/human summary
ROUGE
(Recall Oriented Understudy for Gisting Evaluation)
But recall oriented
Measures overlap in N-grams separately (e.g. from 1 to 3)
ROUGE-2: calculates the percentage of bigrams from the reference that are in the generated summary
43
COMP90042
L21
ROUGE-2: Example
ROUGE-2
• Ref1:Waterspinachisagreenleafyvegetablegrowninthetropics.
• Ref2:WaterspinachisacommonlyeatenleafvegetableofAsia.
• Generatedsummary:Waterspinachisaleafvegetablecommonly eaten in tropical areas of Asia.
3+6 10+9
•
ROUGE-2 =
44
COMP90042
L21
•
Research focus on single-document abstractive summarisation
‣ Mostly news data
But many types of data for summarisation:
‣ Images, videos
‣ Graphs
‣ Structured data: e.g. patient records, tables
•
•
Multi-document abstractive summarisation
A Final Word
45