CS计算机代考程序代写 information retrieval AI deep learning data structure algorithm database Excel COMP5046

COMP5046
Natural Language Processing
Lecture 8: Language Model and Natural Language Generation
Dr. Caren Han
Semester 1, 2021
School of Computer Science, University of Sydney

0
Announcements
Thank you for all your hard work!
• We know the assignment 1 is tough!
• We really appreciate the effort you’re putting into this course!

0
LECTURE PLAN
I know your brain is working only for assignment 1 now so please sit back and enjoy the easy-going lecture today! Prepare your popcorn and some drink

0
The course topics
What will you learn in this course?
Week 1: Introduction to Natural Language Processing (NLP)
Week 2: Word Embeddings (Word Vector for Meaning)
Week 3: Word Classification with Machine Learning I Week 4: Word Classification with Machine Learning II
NLP and Machine Learning
Week 5: Language Fundamental
Week 6: Part of Speech Tagging
Week 7: Dependency Parsing
Week 8: Language Model and Natural Language Generation
NLP Techniques
Week 9: Information Extraction: Named Entity Recognition
Advanced Topic
Week 10: Advanced NLP: Attention and Reading Comprehension
Week 11: Advanced NLP: Transformer and Machine Translation
Week 12: Advanced NLP: Pretrained Model in NLP
Week 13: Future of NLP and Exam Review

0
LECTURE PLAN
Lecture 8: Language Model and Natural Language Generation
1. Language Model
2. Traditional Language Model
3. Neural Language Model
4. Natural Language Generation
5. NLG Tasks
6. Language Model and NLG Evaluation

1
Language Model
What is Language Model
• •
is the task of predicting what word comes next based on the given words. is a probabilistic model which predicts the probability that a sequence of tokens belongs to a language.
Can you come
therehere close
hear their further …
𝑥1, 𝑥2, 𝑥3
Given a sequence of words, Can, you, come, compute the probability
distribution of the next word.
(can be any word in the vocabulary)

1
Language Model
What is Language Model
• is the task of predicting what word comes next based on the given words.
• is a probabilistic model which predicts the probability that a sequence of
tokens belongs to a language.
Can you come here Can you come there
P(Can, you, come, here) vs
therehere close
hear their further …
P(Can, you, come, there)
P(here|can, you, come) vs P(there|can, you, come)

1
Language Model
Language Modeling in NLP
• The probabilities returned by a language model are mostly useful to compare the likelihood that different sentences are “good sentences”. Useful in many practical tasks, for example:
Spell correction/Automatic Speech Recognition
• I would like to read that bookt
Natural Language Generation
• Dialogue (chit chat and task-based)
• Abstractive Summarisation
• Machine Translation
• Creative Writing: Story Telling, …
Closest words= [book, boog, boat, …]

1
Language Model
Do we use Language Model?

1
Language Model
Yes but sometimes fail..

1
Language Model
Language Model in Dialog System
Without anti-language model With anti-language model

1
Language Model
Language Modeling in Natural Language Generation Conditional Language Modeling: the task of predicting the next
word, given the words so far, and also some other input x: Natural Language Generation Tasks
• Dialogue (chit chat and task-based) x=dialogue history, y=next utterance
• Abstractive Summarisation x=input text, y=summarized text
• Machine Translation (in later week) x=source sentence, y=target sentence

1
Language Models
Tips for using Language Model (you already knew!)
It is extremely important to collect and learn the model with the corpus that includes documents about the domain that your system/application will be used.
Medical Documents
Financial Documents

2
Traditional Language Models
Statistical Language Model (SLM)
• Conditional Language Modeling: the task of predicting the next word, given the words so far, and also some other input x:
“An adorable little boy is spreading smiles”
P(An, adorable, little, boy, is, spreading, smiles)
= P(An) × P(adorable|An) × P(little|An adorable) × P(boy|An adorable little) × P(is|An adorable little boy) × P(spreading|An adorable little boy is) × P(smiles|An adorable little boy is spreading)

2
Traditional Language Models
Statistical Language Model (SLM)
• Conditional Language Modeling: the task of predicting the next word, given the words so far, and also some other input x:
“An adorable little boy is” P(is|An adorable little boy)?
Trained Corpus Simplest method
= Count(An adorable little boy is)/Count(An adorable little boy) = 30/100 =0.3
An adorable little boy is …………… ………………………………………………….
…………… An adorable little boy is ………………………………………………….
…………………………………………………. An adorable little boy laughed ………………………………………………….
…………………………………………………. ………………………………………………….
Q: What if there is no ‘An adorable little boy is’ phrase in the corpus?

2
Traditional Language Models
N-gram Language Models
• An N-gram is a sequence of N words.
• An N-gram model predicts the probability of a given N-gram within
any sequence of words in the language.
“An adorable little boy is spreading smiles”
A n-gram is a chunk of n consecutive words.
• unigrams : an, adorable, little, boy, is, spreading, smiles
• bigrams : an adorable, adorable little, little boy, boy is, is spreading, spreading smiles
• trigrams : an adorable little, adorable little boy, little boy is, boy is spreading, is
spreading smiles
• 4-grams : an adorable little boy, adorable little boy is, little boy is spreading, boy is
spreading smiles

2
Traditional Language Models
N-gram Language Models: Exercise
• Assume that we learn a trigram language model
“An adorable little boy is spreading
? ”
P(w|is spreading) =
Count(is spreading w) / Count(is spreading)
Trained Corpus
n-1 words only
(3-1) words only
boy is spreading smile……………….. ………………………………………………….
…………… boy is spreading rumours ………………………………………………….
…………………………………………………. An adorable little boy is spreading ………………………………………………….
…………………………………………………. …………………………………………………. …………………………………………………. ………………………………………………….
P(rumours|is spreading)
= Count(is spreading rumours)/Count(is spreading) = 500/1000 =0.5
P(smiles|is spreading)
= Count(is spreading smiles)/Count(is spreading) = 200/1000 =0.2

2
Traditional Language Models
N-gram Language Models: Beautiful Formula ☺
• Simplifying assumption: the next word, , depends only on
the preceding n-1 words.
• How do we get these n-gram and (n-1)-gram probabilities? Counting them!

2
Traditional Language Models
N-gram Language Model Limitation: Trade-off Issue
• Mostly, n=2 works better than n=1 in n-gram language model
• We learned a trigram language model
“An adorable little boy is spreading ? ”
n-1 words only
(3-1) words only
• Find the optimal n is important! (OOV issue or Model Size issue)
P(w|is spreading) =
Count(is spreading w) / Count(is spreading)
Need to store count for all n-grams that you saw in the corpus.
If you increase n or corpus, the model size will be increased!

2
Traditional Language Models
N-gram Language Model Limitation: Zero Count Issue
P(w|is spreading) =
Count(is spreading w) / Count(is spreading)
1. What if the ‘is spreading w’ phrase never occurred in the corpus? The probability will be 0.
• Alternative solution: Smoothing (Add small 𝛿 to the count
for every w in the corpus)
2. What if the ‘is spreading’ phrase never occurred in the corpus? It is impossible to calculate the probability for any w.
• Alternative solution: Backoff (Just condition on “spreading”
instead)

2
Traditional Language Models
N-gram Language Model Demo
Generating text with a n-gram language model

Playing with n-grams

2
Traditional Language Models
Try some Language Model – Word Prediction
Generating text with a n-gram
(Yu et al. ACL 2018) https://www.aclweb.org/anthology/C18-2028.pdf

3
Neural Language Model
Traditional Neural Language Model
smiles butter
“An adorable little boy is spreading
? ”
Vocab Output distribution
Pros
No Trade-off issue
Cons
softmax
• Window size selection issue (increasing window size enlarges W)
• Input vectors are multiplied by
completely different weights in W
(No symmetry in how the inputs are processed)
Hidden layer tanh
fixed window window size =3
Word embeddings (optional)
one hot vector boy
is spreading

3
Neural Language Model
Recap: RNN (Recurrent Neural Network)
Neural Network + Memory = Recurrent Neural Network
Output Layer
Memory
Memory
=
Hidden Layer
Input Layer
NOTICE: the same function and the same set of parameters W are used at every time step

3
Neural Language Model
Recap: RNN (Recurrent Neural Network)
Neural Network + Memory = Recurrent Neural Network
NOTICE: the same function and the same set of parameters W are used at every time step
New hidden state
A function
with parameters W
h𝑡
=
Previous state input (h𝑡−1, 𝑥𝑡)
𝑓 𝑊

3
Neural Language Model
RNN-based Language Model
“An adorable little boy is spreading
softmax h𝑒1 tanh
Word embeddings (optional)
one hot vector
? ”
h𝑡 = 𝑡𝑎𝑛h(𝑊 h𝑡−1+ 𝑊 𝑒𝑡+ 𝑏 )
boy is spreading

3
Neural Language Model
RNN-based Language Model
“An adorable little boy is spreading
Pros
? ”
• Can process any length input
• Can use information from many step back
• Model size does not increase
• Same weights applied on every time step (Symmetry)
Cons
• Slow computation
• Difficult to access information from many step back (remember what we learned in lecture 4?)

3
Neural Language Model
Training a RNN-based Language Model
“An adorable little boy is spreading
Word embeddings (optional)
one hot vector
?
”
loss
….
predicted ….
softmax
tanh ….
boy
is
spreading
….

3
Neural Language Model
RNN-based Language Model
“An adorable little boy is spreading ? ”
is
spreading
sample
smiles ….
You can use a RNN Language Model to generate text by repeated sampling. Sampled output is next step’s input
sample
sample
….
….
Word embeddings (optional)
one hot vector
boy is
….
spreading

3
Neural Language Model
Recap: Character-based RNN Language Model
Target Characters
pace
Input Layer (One hot vector) s p a c

3
Neural Language Model
RNN with trained language model
During training, we feed the gold (aka reference) target, regardless of what each cell predicts. This training method is called Teacher Forcing.
Without Teacher Forcing With Teacher Forcing

3
Neural Language Model
Seq2Seq Model with trained language model
During training, we feed the gold (aka reference) target sentence into the decoder, regardless of what the decoder predicts. This training method is called Teacher Forcing.
Encoder recurrent layer
Encoder embedding layer
One-hot vector
How are
you ?
[Encoder]
I am
fine

One-hot vector
Decoder Output Layer
Decoder Recurrent Layer
Decoder embedding layer
One-hot vector
I
am fine [Decoder]

4
Natural Language Generation
Decoding Algorithm
Now we have trained the conditional neural language model!
How do we use the language model to generate text?
1) Greedy Decoding or 2) Beam Search

4
Natural Language Generation
Decoding Algorithm 1: Greedy Decoding
• Generate/decode the sentence by taking argmax on each step of the decoder • Take most probable word on each step
• Use that as the next word, and feed it as input on the next step
• Keep going until you produce
I
argmax
am fine
argmax argmax

argmax
Issue

I am
[Decoder]
backtracking
fine
• Greedy decoding has no way to undo decisions!! (Ungrammatical, unnatural)
• How to fix this issue?
Exhaustive search decoding: We could try computing all possible sequences

4
Natural Language Generation
Decoding Algorithm: Beam Search
A standard beam search algorithm with an alphabet of {ε,a,b} with a beam size 3.
urre ro o e urre ro o e urre ro o e yoee eeo yoee eeo yoee eeo
ey r

4
Natural Language Generation
Decoding Algorithm: Beam Search
• A search algorithm which aims to find a high-probability sequence (not necessarily the optimal sequence, though) by tracking multiple possible sequences at once.
• On each step of decoder, keep track of the k most probable partial sequences (which we call hypotheses)
• K is the beam size (in practice around 5 to 10)
• After you reach some stopping criterion, choose the sequence with the highest probability (factoring in some adjustment for length)

4
Natural Language Generation
Decoding Algorithm: Beam Search
Assume that k(beam size)=2
Take top k words and compute scores
-0.5 = log PLM (he|) he

I
-0.1 = log PLM (I|)

4
Natural Language Generation
Decoding Algorithm: Beam Search
Assume that k(beam size)=2
Find top k next words and calculate scores
-1.5 = log PLM (was|, he) + (-0.5) was
-2.5 = log PLM (is|, he) + (-0.5) is
-1.1 = log PLM (am|, I) + (-0.1) am
-2.1 = log PLM (was|, I) + (-0.1) was
-0.5 = log PLM (he|) he

I
-0.1 = log PLM (I|)

4
Natural Language Generation
Decoding Algorithm: Beam Search
Assume that k(beam size)=2
Of these 𝒌𝟐hypotheses, keep only k with highest scores

-0.5 = log PLM (he|) he
-1.5 = log PLM (was|, he) + (-0.5) was
-2.5 = log PLM (is|, he) + (-0.5) is
-1.1 = log PLM (am|, I) + (-0.1) am
-2.1 = log PLM (was|, I) + (-0.1) was
I
-0.1 = log PLM (I|)

4
Natural Language Generation
Decoding Algorithm: Beam Search
Assume that k(beam size)=2
Find top k next words and calculate scores
-1.5 = log PLM (was|, he) + (-0.5)
-3.1 = log PLM (great|, he, was) + (-1.5)
great
-3.5 = log PLM (nice|, he, was) + (-1.5)
nice
was
he
is
-1.1 = log PLM (am|, I) + (-0.1)

am
fine
-2.1 = log PLM (fine|, I, am) + (-1.1)
sad
-2.8 = log PLM (sad|, I, am) + (-1.1)
I
was

4
Natural Language Generation
Decoding Algorithm: Beam Search
Assume that k(beam size)=2
-1.5 = log PLM (was|, he) + (-0.5)
-3.1 = log PLM (great|, he, was) + (-1.5)
great
-3.5 = log PLM (nice|, he, was) + (-1.5)
nice
Of these 𝒌𝟐hypotheses, keep only k with highest scores
was
he
is
-1.1 = log PLM (am|, I) + (-0.1)

am
fine
-2.1 = log PLM (fine|, I, am) + (-1.1)
sad
-2.8 = log PLM (sad|, I, am) + (-1.1)
I
was

4
Natural Language Generation
Decoding Algorithm: Beam Search
Assume that k(beam size)=2
Select the top scoring hypothesis !
-1.5 = log PLM (was|, he) + (-0.5)
-3.1 = log PLM (great|, he, was) + (-1.5)
great
-3.5 = log PLM (nice|, he, was) + (-1.5)
nice
was
he
is
-1.1 = log PLM (am|, I) + (-0.1)

am
fine
-2.1 = log PLM (fine|, I, am) + (-1.1)
sad
-2.8 = log PLM (sad|, I, am) + (-1.1)
I
was

4
Natural Language Generation
The effect of beam size k
• Small k has similar problems to greedy decoding (k=1) • Why?
• Large k means you consider more hypotheses
• Solve the issues in greedy decoding
• Produce, other issues:
• Computationally expensive
• In open-ended tasks like chit-chat dialogue, large k can
make output more generic

4
Natural Language Generation
The effect of beam size k in chit chatbot
I mostly eat a fresh and raw diet, so I save on groceries
Beam size
Model response
1
I love to eat healthy and eat healthy
2
That is a good thing to have
3
I am a nurse so I do not eat raw food
4
I am a nurse so I am a nurse
5
Do you have any hobbies?
6
What do you do for a living?
7
What do you do for a living?
8
What do you do for a living?
Lower beam size
More on topic but non-sensical
Human
Machine Answer
Higher beam size
Converges to safe, “correct” response, but it’s generic and less relevant

4
Natural Language Generation
The effect of beam size k in chit chatbot
Beam size=10 Beam size=10 and anti-language model

4
Natural Language Generation
Sampling-based decoding
Pure sampling
• On each step t, randomly sample from the probability distribution
𝑃 to obtain your next word. 𝑡
• Like greedy decoding, but using sample instead of argmax Top-n sampling*
• On each step t, randomly sample from 𝑃 , restricted to just the top-n 𝑡
most probable words
• Like pure sampling, but truncate the probability distribution
• n=1 is greedy search, n=V is pure sampling
• Increase n to get more diverse/risky output
• Decrease n to get more generic/safe output
*Usually called top-k sampling, but here we’re avoiding confusion with beam size k

4
Natural Language Generation
Natural Language Generation
Dialog Tree from Westworld

4
Natural Language Generation
Language Model

5
NLG Tasks
Language Modeling in Natural Language Generation
Natural Language Generation Tasks
• Dialogue (chit chat and goal-oriented conversational agent) x=dialogue history, y=next utterance
• Abstractive Summarisation x=input text, y=summarized text

5
NLG Tasks
Dialog: Conversational Agent
A conversational agent is a software program which interprets and responds to statements made by users in ordinary natural language. It integrates computational linguistics techniques with communication over the internet
Open Domain
Closed Domain
Conversation Agent Framework
Impossible
General AI [Hardest]
Rule- based [Easiest]
Retrieval- Based
Generative- Based
Responses
Smart Machine [Hard]
Conversation

5
NLG Tasks
Conversational Agent
A conversational agent is a software program which interprets and responds to statements made by users in ordinary natural language. It integrates computational linguistics techniques with communication over the internet
Goal-oriented Conversational Agent
Designed for a particular task, utilizing short conversations to get information from the user to help complete this task
Chatbots (Chat-oriented Conversational Agent)
Designed to handle full conversations, mimicking the unstructured flow of a human to human conversation

5
NLG Tasks
Goal-oriented Conversational Agent
Designed for a particular task, utilizing short conversations to get information from the user to help complete this task
Apple Siri
IBM Watson BankBot

5
NLG Tasks
Goal-oriented Conversational Agent
Frame-based Approach
• Based on a “domain ontology”
A knowledge structure representing user intentions
• One or more Frame
Each a collection of slots
Each slot having a value
A set of slots, to be filled with information of a given type Each associated with a question to the user
Slot Type
ORIGIN city DEST city DEPT DATE date DEPT TIME time AIRLINE line
Question
What city are you leaving from? Where are you going?
What day would you like to leave? What time would you like to leave? What is your preferred airline?

5
NLG Tasks
Goal-oriented Conversational Agent
Dialogue is structured in a sequence of predetermined utterance
• Ask the user for a departure city
• Ask for a destination city
• Ask for a time
• Ask whether the trip is round-‐trip or not

5
NLG Tasks
Goal-oriented Conversational Agent
• System completely controls the conversation with the user.
• It asks the user a series of questions
• Ignoring (or misinterpreting) anything the user says that is not a direct
answer to the system’s questions

5
NLG Tasks
Dialogue Initiative
Systems that control conversation like this are: system initiative or single initiative
Initiative: who has control of conversation
In normal human to human dialogue, initiative shifts back and forth between participants

5
NLG Tasks
System Initiative
System completely controls the conversation
• Simple to build
• User always knows what they can say next
• System always knows what user can say next
• Good for Very Simple tasks (entering a credit card, booking a flight)
• Too limited: does not generate any new text, they just pick a response from a fixed set
• A lot of hard coded rules have to be written so not much intelligent

5
NLG Tasks
System Initiative: Issue
“Hi, I’d like to fly from Sydney Tuesday morning; I want a flight from Melbourne to Perth one way leaving after 5 p.m. on Wednesday.”
– Answering more than one question in a sentence

5
NLG Tasks
Mixed Initiative
Conversational initiative can shift between system and user
“Hi, I’d like to fly from Sydney Tuesday morning; I want a flight from Melbourne to Perth one way leaving after 5 p.m. on Wednesday.”
A kind of mixed initiative
• use the structure of the frame to guide dialogue
• System asks questions of user, filling any slots that user specifies • When frame is filled, do database query
• If user answers 3 questions at once, system can fill 3 slots and not ask these questions again!

5
NLG Tasks
Mixed Initiative
• There are many ways to represent the meaning of sentences
• For speech dialogue systems, most common approach is
“Frame and slot semantics”.
“Show me morning flights from Sydney to Perth on Tuesday.”
DOMAIN: INTENT: ORIGIN-CITY: ORIGIN-DATE: ORIGIN-TIME: DEST-CITY:
AIR-TRAVEL SHOW-FLIGHTS Sydney Tuesday morning
Perth

5
NLG Tasks
Condition-Action Rules
Active Ontology: Relational network of concepts
• Data structures: a meeting has:
• a date and time,
• a location,
• a topic
• a list of attendees
• Rule sets that perform actions for concepts
• The date concept turns string
• Monday at 2pm into
• Date object date(DAY,MONTH,YEAR,HOURS,MINUTES)
Rule: Condition + Action

5
NLG Tasks
Improvements to the Rule-based Approach
Machine Learning classifiers to map words to semantic frame-fillers
Given a set of labeled sentences
• “I want to fly to Sydney on Tuesday”
• Destination: Sydney
• Depart-date: Tuesday
Build a classifier to map from one to the other
Requirements: Lots of Labeled Data

5
NLG Tasks
Chatbot
Designed to handle full conversations, mimicking the unstructured flow of a human to human conversation
Rule-based
• Pattern-Action Rules (Eliza)
• Pattern-Action Rules + A mental model (Parry)
Corpus-based (from large chat corpus)
• Information Retrieval
• Deep Neural Networks

5
NLG Tasks
Chatbot: Eliza (1966)
Try Eliza
http://psych.fullerton.edu/mbirnbaum/psych101/Eliza.htm https://playclassic.games/game/play-eliza-online/play/

5
NLG Tasks
Chatbot: Eliza (1966)
Domain: Rogerian Psychology Interview
• Draw the patient out by reflecting patient’s statements back at them
• Rare type of conversation in which one can “assume the pose of
knowing almost nothing of the real world”
Patient: “I went for a long boat ride” Psychiatrist: “Tell me about boats”
• You don’t assume she didn’t know what a boat is
• You assume she had some conversational goal

5
NLG Tasks
Chabot: Eliza (1966)
Pattern matching
if the input matches
(first bunch of words) “you” (second bunch of words) “me”
response with
“What makes you think I” (second bunch of words) “you?”
if the input matches
“You are” (bunch of words)
response with
“So, I’m” (bunch of words) “, am I?”
Very basic reconstruction rules
“me” → “you” “my”→“your” etc.

5
NLG Tasks
Chatbot: Eliza (1966)
Some programmed responses to special keywords
if the word “mother” appears anywhere, reply with “Don’t you talk about my mother”
Randomisation to avoid getting stuck in a rut When all else fails, some stock responses,
“Tell me more” “Fascinating” “I see”

5
NLG Tasks
Chatbot: Parry (1972)
Same pattern-‐response structure as Eliza
Persona
• 28-‐year-‐old single man, post office clerk
• no siblings and lives alone
• Sensitive about his physical appearance, his family, his religion, his
education and the topic of sex.
• Hobbies are movies and gambling on horseracing,
• Recently attacked a bookie, claiming the bookie did not pay off in a
bet.
• Afterwards worried about possible underworld retaliation
• Eager to tell his story to non-‐threating listeners.

5
NLG Tasks
Chatbot: Parry (1972)

5
NLG Tasks
Chatbot
Rule-based
• Pattern-Action Rules (Eliza)
• Pattern-Action Rules + A mental model (Parry)
Corpus-based (from large chat corpus)
• Information Retrieval
• Deep Neural Networks

5
NLG Tasks
Information Retrieval (IR) based Chatbot
• Mine conversations of human chats or human-machine chats
• Microblogs: Twitter etc.
• Movie Dialogs
• With large corpus
Cleverbot Microsoft Xiaoice
Microsoft Tay

https://www.cleverbot.com/
https://arxiv.org/pdf/1812.08989.pdf

5
NLG Tasks
Information Retrieval (IR) based Chatbot
Xiaoice

5
NLG Tasks
Information Retrieval (IR) based Chatbot
1. Return the response to the most similar turn
• Take user’s turn (q) and find a (tf-idf) similar turn t in the corpus C
q=“do you like Doctor Who” t=“do you like Doctor Strange”
• Grab whatever the response was to t.
Yes, love it!
2. Return the most similar turn
Do you like Doctor Strangelove?

5
NLG Tasks
Information Retrieval (IR) based Chatbot
1. Also fine to use other features like user features, or prior turns
2. Or non-dialogue text
• COBOT chatbot (Isbell et al., 2000)
• sentences from the Unabomber Manifesto by Theodore
Kaczynski, articles on alien abduction, the scripts of “The Big
Lebowski” and “Planet of the Apes”. 3. Wikipedia text

5
NLG Tasks
Deep-learning Chatbots
• Think of response generation as a task of transducing from the user’s prior turn to the system’s turn.
• Train on:
• Movie Dialogs
• Twitter Conversations
• Train a deep neural network
• Map from user 1 turn to user 2 response

5
NLG Tasks
Seq2seq model architecture

5
NLG Tasks
Deep learning chatbots
Trained on 127M Twitter context-message-responsetriples

5
NLG Tasks
Neural based NLG in Dialog: Issue
• Problem: became apparent that a naïve application of standard seq2seq methods has serious pervasive deficiency for (chitchat) dialogue:
• Either because it’s generic (e.g. “I don’t know”)
• Or because changing the subject to something unrelated
• Boring response
• Repetition problem
• Lack of consistent persona problem
What else do we have?

NLG Tasks
Template-based generation
• The most common approach in spoken natural language generation.
• In simplest form, words fill in slots:
“Flights from ORIGIN to DEST on DEPT_DATE DEPT_TIME. Just one moment please”
• Most common NLG used in commercial systems
• Used in conjunction with concatenative TTS (text-to-speech) to make
natural sounding output
5

5
NLG Tasks
Template-based generation
Pros
• Conceptually Simple: No specialized knowledge required to develop
• Tailored to the domain, so often good quality
Cons
• Lacks generality: Repeatedly encode linguistic rules (e.g. subject-verb agreement)
• Little variation in style
• Difficult to grow/maintain: Each utterance must be manually added
Improvement?
• Need deeper utterance representations
• Linguistic rules to manipulate them

5
NLG Tasks
Rule-based Generation
What to say How to say
Content Planning
• What information must be communicated? • Content selection and ordering
Sentence Planning
• What words and syntactic constructions will be used for describing the content?
• Aggretation: What elements can be grouped together for more natural-sounding, succinct output?
• Lexicalisation: What word are used to express the various entities? Realisation
• How is it all combined into a sentence that is syntactically and morphologically correct?
Content Planner
Sentence Planner
Surface Realiser

5
NLG Tasks
Rule-based Generation
Assume that the dialog system need to tell the user about the restaurant
Content Planning
• Select Information ordering
• has(sushitrain, crusine(bad))
• has(sushitrain, decor(good))
Sentence Planning
• Choose syntactic templates
• Choose lexicon
• Bad→awful; crusine→food quality
• Good→excellent; decor→décor
• Generate expressions
• Entity→this restaurant
Realisation
• Choose correct verb: HAVE→has
• No article needed for feature names
HAVE
Subj Obj
ENTITY
FEATURE MODIFIER
“This restaurant has awful food quality but excellent décor”

5
NLG Tasks
Summary
Goal‐oriented Conversational Agent:
• Ontology + hand-written rules for slot fillers
• Machine learning classifiers to fill slots
Chatbots:
• Simple rule-based systems
• IR‐based: mine datasets of conversations.
• Neural net models with more data
The future…
• Need to acquire that data
• Integrate goal‐based and chatbot-based systems

5
NLG Tasks
Summarisation: two strategies
Extractive Summarisation
• Select parts (typically sentences) of the original text to form a summary.
Abstractive Summarisation
• Generate new text using natural language generation techniques.

5
NLG Tasks
Summarisation: two strategies
Single document
Multiple documents
or
50%
Very Brief Headline
10%
Extract Indicative
Background
100%
Brief
Long
Abstract
ABSTRACTS
Informative
Just the news
EXTRACTS

5
NLG Tasks
Summarisation: two strategies
FIG INE LTN TEE ERR
ERA XIPT T N R I RGEO ATN
CA TT
I OO NN
I
ABSTRACTS
EXTRACTS

5
NLG Tasks
Other NLG Tasks: Visual StoryTelling (Kim et al., NAACL 2018)

6
Language Model and NLG Evaluation
How to evaluate the Language Model?
The standard evaluation metric for Language Models is perplexity. Perplexed
So, Lower Perplexity is better!

6
Language Model and NLG Evaluation
How to evaluate the Language Model?
The standard evaluation metric for Language Models is perplexity. Perplexed
Inverse probability of corpus, according to Language Model
Normalized by number of words

6
Language Model and NLG Evaluation
How to evaluate the Language Model?
Language Model Approaches Performance Evaluated by Facebook Research
Can we use the perplexity for NLG Evaluation?
No. Captures how powerful your LM is, but doesn’t tell you anything about generation
Table 2. Comparison on 1B word in perplexity (lower the better). Note that Jozefowicz et al., uses 32 GPUs for training. We only use 1 GPU.
N-gram model
Neural model
https://research.fb.com/building-an-efficient-neural-language-model-over-a-billion-words/

6
Language Model and NLG Evaluation
How to evaluate the Natural Language Generation?
Unfortunately, No automatic metrics to adequately capture overall quality
There are some metrics to capture particular aspects of generated text:
• Fluency (compute probability – well-trained Language model)
• Correct style (Language Model trained on target corpus)
• Diversity (rare word usage, uniqueness of n-grams)
• Relevance to input (semantic similarity measures)
• Simple things like length and repetition
• Task-specific metrics e.g. compression rate for summarization
• Though these don’t measure overall quality, they can help us track some
important qualities that we care about.

6
Language Model and NLG Evaluation
How to evaluate the Natural Language Generation?
Human Evaluation
• Human judgments are regarded as the gold standard
• Of course, we know that human eval is slow and expensive
• Supposing you do have access to human evaluation: Does human
evaluation solve all of your problems?
Humans …
• are inconsistent
• can be illogical
• lose concentration
• misinterpret your question
• can’t always explain why they feel the way they do

6
Language Model and NLG Evaluation
Natural Language Generation: Long way to go

/
Reference
Reference for this lecture
• Deng, L., & Liu, Y. (Eds.). (2018). Deep Learning in Natural Language Processing. Springer.
• Rao, D., & McMahan, B. (2019). Natural Language Processing with PyTorch: Build Intelligent Language Applications Using Deep Learning. ” O’Reilly Media, Inc.”.
• Manning, C. D., Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing. MIT press.
• Manning, C 2018, Natural Language Processing with Deep Learning, lecture notes, Stanford University
• Li, J., Galley, M., Brockett, C., Gao, J., & Dolan, B. (2015). A diversity-promoting objective function for neural conversation models. arXiv preprint arXiv:1510.03055.
• Jiang, S., & de Rijke, M. (2018). Why are Sequence-to-Sequence Models So Dull? Understanding the Low- Diversity Problem of Chatbots. arXiv preprint arXiv:1809.01941.
• Liu, C. W., Lowe, R., Serban, I. V., Noseworthy, M., Charlin, L., & Pineau, J. (2016). How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. arXiv preprint arXiv:1603.08023.

Related Posts