COMP5046
Natural Language Processing
Lecture 10: Attention and Question Answering (Reading Comprehension)
Dr. Caren Han
Semester 1, 2021
School of Computer Science, University of Sydney
0
LECTURE PLAN
Lecture 10: Attention and Question Answering (Reading Comprehension)
1. Question Answering
2. Knowledge-based Question Answering
3. IR-based Question Answering (Reading Comprehension)
4. Attention
5. Reading Comprehension with Attention
6. Visual Question Answering
1
Question Answering Question Answering
Question answering (QA) is a computer science discipline within the fields of information retrieval and natural language processing (NLP), which is concerned with building systems that automatically answer questions posed by humans in a natural language.
Different types of questions:
General questions, with Yes/No answers • e.g. Are you a student?
Wh- Questions, start with: who, what, where, when, why, how, how many
• e.g. When did you get to this lecture?
• e.g. What is the weather like in London?
1
Question Answering Question Answering
Question answering (QA) is a computer science discipline within the fields of information retrieval and natural language processing (NLP), which is concerned with building systems that automatically answer questions posed by humans in a natural language.
Different types of questions:
Choice Questions, where you have some options inside the question
Factoid questions, where the complete answer can be found inside a text. The answer to such questions consist of one or several words that go one after another
1
Question Answering Question
Three Questions for building a QA System
• What do the answers look like?
• Where can I get the answers from?
• What does my training data look like?
0
Question Answering Research Areas Research Areas in Question Answering
Research Area
Details
Knowledge-based QA (Semantic Parsing)
• Answer is a logical form, possible executed against a Knowledge Base
• Context is a Knowledge Base
Information Retrieval-based QA
• Answer sentence selection
• Reading Comprehension
• Answer is a document, paragraph, sentence
• Context is a corpus of documents
or a specific document
Visual QA
• Answer is simple and factual
• Context is one/multiple image(s)
Library Reference
• Answer is another question
• Context is the structured knowledge
available in the library and the librarians view of it.
2
Knowledge-based Question Answering Semantic Parsing
Answering a natural language question by mapping it to a query over a structured database (formal representation of its meaning).
Question
Logical Form
KB Query
Knowledge base
Answer
When was Justin Bieber born?
birth-year(Justin Bieber, x)
Justin Bieber was born in 1994
2
Knowledge-based Question Answering Semantic Parsing
Answering a natural language question by mapping it to a query over a structured database (formal representation of its meaning).
Question
Logical Form
KB Query
Knowledge base
Mapping from a text string to any logical form
How to map? Map either to some version of predicate calculus or a query language like SQL or SPARQL
https://query.wikidata.org/
Use expensive supervised data?
Require experts for the manual annotation process….
When was Justin Bieber born?
Question
Logical Form
When was Justin Bieber born?
birth-year(Justin Bieber, x)
What is the largest state?
argmax(λx.state(x),λx.size(x))
birth-year(Justin Bieber, x)
Answer
Justin Bieber was born in 1994
2
Knowledge-based Question Answering Seq2Seq model for semantic parser
How to transfer the text to the logical form?
A basic deep learning approach to semantic parsing
[Decoder]
[Encoder]
1. Encode sentence with sequence models
x.age(x)
person(x) “Justin Bieber”
2. Decode with standard seq2seq model (training – teacher forcing)
Age of Justin Bieber
2
Knowledge-based Question Answering Semantic Parsing
Answering a natural language question by mapping it to a query over a structured database (formal representation of its meaning).
Question
Logical Form
KB Query
Knowledge base
Answer questions that ask about one of the missing arguments in a triple
When was Justin Bieber born?
birth-year(Justin Bieber, x)
Subject Predicate (relation) Object
Justin Bieber birth-year 1994
Frédéric Chopin birth-year 1810
………
• DBPedia • Freebase
Answer
How to produce the answer?
• Seq2seq
• Template based generation
Justin Bieber was born in 1994
2
Knowledge-based Question Answering Pros and Cons of Knowledge-based QA
• Logical Form instead of (direct) answer makes system robust
• Answer independent of question and parsing mechanism
• Constrained to queriable questions in Database Schema
• Difficult to find the well-structured training dataset
0
Question Answering
Research Areas in Question Answering
Research Area
Details
Knowledge-based QA (Semantic Parsing)
• Answer is a logical form, possible executed against a Knowledge Base
• Context is a Knowledge Base
Information Retrieval-based QA
• Answer sentence selection
• Reading Comprehension
• Answer is a document, paragraph, sentence
• Context is a corpus of documents
or a specific document
Visual QA
• Answer is simple and factual
• Context is one/multiple image(s)
Library Reference
• Answer is another question
• Context is the structured knowledge
available in the library and the librarians view of it.
3
Information Retrieval-based Question Answering Information Retrieval-based Question Answering
Answering a user’s question by finding short text segments, sentences, or documents on the web or collection of document
Q&A Service
IR System
(e.g. wiki/google search)
Extract answer from passage
Answer
Machine Reading Comprehension
Question
Get Passage Passage
Answer
3
Reading Comprehension
Information Retrieval-based Question Answering
Answering a user’s question by finding short text segments, sentences, or documents on the web or collection of document
• Reading Comprehension and Answer Sentence Selection:
– Finding an answer in a paragraph or a document
– Picking a suitable sentence from a corpus that can be used to answer a question
3
Reading Comprehension Reading Comprehension
To answer these questions, you need to first gather information by collecting answer-related sentences from the article.
Can we teach this to machine?
Yes, we can!
Machine Comprehension of Text
(Burges 2013)
3
Reading Comprehension Reading Comprehension
To answer these questions, you need to first gather information by collecting answer-related sentences from the article.
Can we teach this to machine?
A machine comprehends a passage of text if, for any question regarding that text that can be answered correctly by a majority of native speakers
3
Reading Comprehension Reading Comprehension
To answer these questions, you need to first gather information by collecting answer-related sentences from the article.
Why do we need to teach this?
The ability to comprehend text will lead us to a better search and solve lots of NLP problems!
3
Reading Comprehension Corpora for Reading Comprehension
Dataset
Answer Type
Domain
MCTest (Richardson et al. 2013)
Multiple choice
Children’s stories
CNN/Daily Mail (Hermann et al. 2015)
Spans
News
Children’s book test (Hill et al. 2016)
Multiple choice
Children’s stories
SQuAD (Rajpurkar et al., 2016)
Spans
Wikipedia
MS MARCO (Nguyen et al., 2016)
Free-from text, Unanswerable
Web Search
NewsQA (Trischler et al., 2017)
Spans
News
SearchQA (Dunn et al., 2017)
Spans
Jeopardy
TriviaQA (Joshi et al., 2017)
Spans
Trivia
RACE (Lai et al., 2017)
Multiple choice
Mid/High School Exams
Narrative QA (Kocisky et al., 2018)
Free-form text
Movie Scripts, Literature
SQuAD 2.0 (Rajpurkar et al., 2018)
Spans, Unanswerable
Wikipedia
3
Reading Comprehension
TriviaQA: A Large Scale Dataset for Reading Comprehension
http://nlp.cs.washington.edu/triviaqa/sample.html
3
Reading Comprehension
TriviaQA: A Large Scale Dataset for Reading Comprehension
Our UsydNLP achieved the No.1 in the TriviaQA Leaderboard (Web Setting)!
http://nlp.cs.washington.edu/triviaqa/sample.html
3
Reading Comprehension
SQuAD: Stanford Question Answering Dataset
https://rajpurkar.github.io/SQuAD-explorer/
3
Reading Comprehension
A Generic Neural Model for Reading Comprehension
Step1: For both documents and questions, convert words to word vectors
A partly … Perito Moreno … How are …
Document (D)
Question (Q)
A partly submerged glacier cave on Perito Moreno Glacier. The ice facade is approximately 60 m high. Ice formations in the Titlis glacier cave. A glacier cave is a cave formed within the ice of a glacier. Glacier caves are often called ice caves, but the latter term is properly used to describe bedrock caves that contain year-round ice
How are glacier caves formed?
3
Reading Comprehension
A Generic Neural Model for Reading Comprehension
Step2: Encode context (documents) and question with sequence models
Sequence Model Sequence Model
A partly … Perito Moreno … How are …
Document (D)
Question (Q)
A partly submerged glacier cave on Perito Moreno Glacier. The ice facade is approximately 60 m high. Ice formations in the Titlis glacier cave. A glacier cave is a cave formed within the ice of a glacier. Glacier caves are often called ice caves, but the latter term is properly used to describe bedrock caves that contain year-round ice
How are glacier caves formed?
3
Reading Comprehension
A Generic Neural Model for Reading Comprehension
Step3: Combine context (documents) and question with an attention
Attention
Sequence Model Sequence Model
A partly … Perito Moreno … How are …
Document (D)
Question (Q)
A partly submerged glacier cave on Perito Moreno Glacier. The ice facade is approximately 60 m high. Ice formations in the Titlis glacier cave. A glacier cave is a cave formed within the ice of a glacier. Glacier caves are often called ice caves, but the latter term is properly used to describe bedrock caves that contain year-round ice
How are glacier caves formed?
3
Reading Comprehension
A Generic Neural Model for Reading Comprehension
Step4: Select answer from attention map by using a classifier or with
generative setup
Answer
Attention
Sequence Model
A partly
Document (D)
… Perito
Moreno
Sequence Model
… How are …
Question (Q)
A partly submerged glacier cave on Perito Moreno Glacier. The ice facade is approximately 60 m high. Ice formations in the Titlis glacier cave. A glacier cave is a cave formed within the ice of a glacier. Glacier caves are often called ice caves, but the latter term is properly used to describe bedrock caves that contain year-round ice
How are glacier caves formed?
3
Reading Comprehension
A Generic Neural Model for Reading Comprehension
What is the Attention? Why we need this? Answer
Attention
Sequence Model
A partly … Perito Moreno
Document (D)
Sequence Model
… How are …
Question (Q)
A partly submerged glacier cave on Perito Moreno Glacier. The ice facade is approximately 60 m high. Ice formations in the Titlis glacier cave. A glacier cave is a cave formed within the ice of a glacier. Glacier caves are often called ice caves, but the latter term is properly used to describe bedrock caves that contain year-round ice
How are glacier caves formed?
4
Attention
Seq2Seq Model: Recap
I am fine
One-hot vector
Decoder Output Layer
Decoder Recurrent Layer
Decoder embedding layer
One-hot vector
Encoding of the source sentence
Encoder recurrent layer
Encoder embedding layer
One-hot vector
am fine [Decoder]
How are
you ?
[Encoder]
4
Attention
Seq2Seq Model: the bottleneck problem
Encoding of the source sentence
Needs to capture all information about the source sentence.
I am fine
One-hot vector
Decoder Output Layer
Decoder Recurrent Layer
Decoder embedding layer
One-hot vector
Encoder recurrent layer
Encoder embedding layer
One-hot vector
am fine [Decoder]
How are
you ?
Solution to this bottleneck problem?
Attention!
[Encoder]
+RNN drawback! Vanishing Gradient
4
Attention
Seq2Seq with Attention
I am fine
One-hot vector
Decoder Output Layer
Decoder Recurrent Layer
Decoder embedding layer
One-hot vector
Attention
Encoder recurrent layer
Encoder embedding layer
One-hot vector
am fine
[Decoder]
How are
you ?
What is Attention?
On each step of the decoder, use direct connection to the encoder to focus on a particular part of the input sequence
4
Attention
Seq2Seq with Attention
Attention scores
Encoder recurrent layer
Encoder embedding layer
One-hot vector
How are
you ?
Decoder Recurrent Layer
Decoder embedding layer
One-hot vector
dot product
4
Attention
Seq2Seq with Attention
Attention scores
Encoder recurrent layer
Encoder embedding layer
One-hot vector
How are
you ?
Decoder Recurrent Layer
Decoder embedding layer
One-hot vector
dot product
4
Attention
Seq2Seq with Attention
Attention distribution
Attention scores
Encoder recurrent layer
Encoder embedding layer
One-hot vector
How are
you ?
Decoder Recurrent Layer
Decoder embedding layer
One-hot vector
Take softmax to turn the scores into a probability distribution
softmax
4
Attention
Seq2Seq with Attention
On this decoder step, focusing on the encoder hidden state
Attention distribution
Attention scores
Encoder recurrent layer
Encoder embedding layer
One-hot vector
softmax
How
are
you ?
Decoder Recurrent Layer
Decoder embedding layer
One-hot vector
4
Attention
Seq2Seq with Attention
The attention output mostly contains information from the hidden states that received high attention.
Calculate a weighted sum of the encoder hidden states with the attention distribution
Attention output
Attention distribution
Attention scores
Encoder recurrent layer
Encoder embedding layer
One-hot vector
How are
you ?
Decoder Recurrent Layer
Decoder embedding layer
One-hot vector
softmax
4
Attention
Seq2Seq with Attention
Attention output
softmax
I
Concatenate attention output with decoder hidden state
Attention distribution
Attention scores
Encoder recurrent layer
Encoder embedding layer
One-hot vector
How
are
you ?
Decoder Recurrent Layer
Decoder embedding layer
One-hot vector
4
Attention
Seq2Seq with Attention
Attention output
softmax
am
Attention distribution
Attention scores
Encoder recurrent layer
Encoder embedding layer
One-hot vector
How
are
you ?
Decoder Recurrent Layer
Decoder embedding layer
One-hot vector
4
Attention
Seq2Seq with Attention
Attention output
softmax
fine
Attention distribution
Attention scores
Encoder recurrent layer
Encoder embedding layer
One-hot vector
How
are
you ?
Decoder Recurrent Layer
Decoder embedding layer
One-hot vector
4
Attention
Seq2Seq with Attention
Attention output
softmax
Attention distribution
Attention scores
Encoder recurrent layer
Encoder embedding layer
One-hot vector
How
are
you ?
I am fine
Decoder Recurrent Layer
Decoder embedding layer
One-hot vector
4
Attention
Seq2Seq with Attention: Machine Translation Attention
output
!
Attention distribution
Attention scores
Encoder recurrent layer
Encoder embedding layer
One-hot vector
softmax
I
love
you !
Decoder Recurrent Layer
Decoder embedding layer
One-hot vector
4
Attention
Seq2Seq with Attention (Equations)
• •
1. 2.
3.
4.
Encoder hidden states: Decoder hidden state:
Attention score :
(on timestep t)
(for timestep t)
Use softmax to get the attention distribution, (for timestep t) (this is a probability distribution and sums to 1)
Attention Output: Use to take a weighted sum of the encoder hidden states
Then, concatenate the attention output with the decoder hidden state and proceed as in the non-attention seq2seq model
4
Attention
Why we use Attention? The benefit!
Improve performance
• Allow decoder to focus on certain parts of the source Solving the bottleneck problem
• Allow decoder to directly look at the source (input) Reducing vanishing gradient problem
• Provide shortcut to faraway states
Providing some interpretability
• Inspect attention distribution, and show what the decoder was focusing on
4
Attention
Attention is now a general component in Deep Learning NLP
Attention is great way to improve the sequence to sequence model.
You can use attention in many architecture (not just seq2seq) and many NLP tasks (not just dialog system/NLG, Translation)
More general definition of attention:
Given a set of vector values, and a vector query, attention is a technique to compute a weighted sum of the values, dependent on the query.
For example, in the seq2seq + attention model, each decoder hidden state (query) attends to all the encoder hidden states (values).
4
Attention
Attention variants
There are several ways to compute attention score.
• Encoder hidden states:
• Decoder hidden state: (on timestep t)
Attention Name
Attention score function
Reference
Content-base
𝑠𝑐𝑜𝑟𝑒(𝑠𝑡,h𝑖) = 𝑐𝑜𝑠𝑖𝑛𝑒[𝑠𝑡,h𝑖]
Graves 2014
Dot-product
𝑠𝑐𝑜𝑟𝑒𝑠,h =𝑠⊤h 𝑡𝑖 𝑡𝑖
Luong 2015
Scaled Dot-product
𝑠⊤h
𝑖
*NOTE: very similar to the dot-product attention except for a scaling factor; where n is the dimension of the source hidden state.
𝑠𝑐𝑜𝑟𝑒 𝑠𝑡,h𝑖 = 𝑡
𝑛
Vaswani 2017
Additive
𝑠𝑐𝑜𝑟𝑒 𝑠 ,h = 𝑣⊤tanh(𝑊 [𝑠 ;h ]) 𝑡𝑖𝑎𝑎𝑡𝑖
Vaswani 2017
General
𝑠𝑐𝑜𝑟𝑒𝑠,h =𝑠⊤Wh 𝑡𝑖𝑡𝑎𝑖
*NOTE: where W𝑎 is a trainable weight matrix in the attention layer.
Luong 2015
Location-based
𝑎 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑊 𝑠 𝑡,i 𝑎𝑡
*Note: This simplifies the softmax alignment to only depend on the target position.
Luong 2015
*The papers (Luong 2015 and Vaswani 2017) can be found in the canvas content page
Graves 2014 (https://arxiv.org/pdf/1410.5401.pdf)
4
Attention Attention variants
There are several ways to compute attention score.
• Encoder hidden states:
• Decoder hidden state:
(on timestep t)
where𝑊 isatrainableweightmatrix 𝑎
*NOTE: very similar to the dot-product attention except for a scaling factor; where n is the dimension of the source hidden state.
*Note: This simplifies the softmax alignment to only depend on the target position.
4
Attention
Categories of Attention Mechanism
A summary of broader categories of attention mechanisms
Name
Definition
Citation
Global or Local
• Global: Attending to the entire input state space.
• Local: Attending to the part of input state space
(i.e. a patch of the input image.)
Luong 2015
Self-Attention
Relating different positions of the same input sequence. Theoretically the self-attention can adopt any attention score functions, but just replace the target sequence with the same input sequence.
Cheng 2016
*The papers (Luong 2015 and Cheng 2016) can be found in the canvas content page
4
Attention
Categories of Attention Mechanism (1)
Global/Local Attention
• Global: Attending to the entire input state space.
• Local: Attending to the part of input state space
4
Attention
Categories of Attention Mechanism (2)
Self-Attention
The long short-term memory network (Cheng et al., 2016) paper used self-attention to do machine reading. In the example below, the self-attention mechanism enables us to learn the correlation between the current words and the previous part of the sentence.
The current word is in red and the size of the blue shade indicates the activation level.
5
Reading Comprehension with Attention A Generic Neural Model for Reading Comprehension
Answer
Attention
Sequence Model Sequence Model
A partly … Perito Moreno … How are …
Document (D)
Question (Q)
A partly submerged glacier cave on Perito Moreno Glacier. The ice facade is approximately 60 m high. Ice formations in the Titlis glacier cave. A glacier cave is a cave formed within the ice of a glacier. Glacier caves are often called ice caves, but the latter term is properly used to describe bedrock caves that contain year-round ice
How are glacier caves formed?
5
Reading Comprehension with Attention
Bi-LSTM for Reading Comprehension with Attention
Predict start token?
Predict start token? start token?
Predict end token?
Predict Predict end token? end token?
Predict
Softmax
… …
D1
…
…
… …
A
Document (D)
partly …
Perito
Moreno
Softmax
D2
Softmax
D3
LSTM
Attention
LSTM
LSTM
Summary
LSTM
LSTM
LSTM
…
… How are …
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
…
Question (Q)
A partly submerged glacier cave on Perito Moreno Glacier. The ice facade is approximately 60 m high. Ice formations in the Titlis glacier cave. A glacier cave is a cave formed within the ice of a glacier. Glacier caves are often called ice caves, but the latter term is properly used to describe bedrock caves that contain year-round ice
How are glacier caves formed?
5
Reading Comprehension with Attention
Bi-LSTM for Reading Comprehension with Attention
Predict start token?
Predict start token? start token?
Predict end token?
Predict Predict end token? end token?
Predict
Softmax
Softmax
Softmax
… …
Attention
D1
D2
D3
Summary
…
…
… …
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
A
Document (D)
partly …
Perito
Moreno
…
… How are …
…
Question (Q)
A partly submerged glacier cave on Perito Moreno Glacier. The ice facade is approximately 60 m high. Ice formations in the Titlis glacier cave. A glacier cave is a cave formed within the ice of a glacier. Glacier caves are often called ice caves, but the latter term is properly used to describe bedrock caves that contain year-round ice
How are glacier caves formed?
5
Reading Comprehension with Attention
Bi-LSTM for Reading Comprehension with Attention
Start Token
End Token
Softmax
Softmax
Softmax
… …
D1
D2
D3
Summary
…
…
… …
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
A partly …
Document (D)
Perito
Moreno
…
… How are …
…
Question (Q)
A partly submerged glacier cave on Perito Moreno Glacier. The ice facade is approximately 60 m high. Ice formations in the Titlis glacier cave. A glacier cave is a cave formed within the ice of a glacier. Glacier caves are often called ice caves, but the latter term is properly used to describe bedrock caves that contain year-round ice
How are glacier caves formed?
5
Reading Comprehension with Attention
Bi-Directional Attention Flow (Bi-DAF)
Bi-Directional Attention Flow for Machine Comprehension (Seo et al. 2017)
5
Reading Comprehension with Attention Bi-Directional Attention Flow (Bi-DAF)
Attention Flow layer is the core idea!
• Variants and improvements to the Bi-DAF architecture over the years
Attention should flow both ways:
1) the context→the question (C2Q)
2) the question→the context (Q2C)
Both attentions are derived from a shared similarity matrix between the context (H) and the query (U), where Stj indicates the similarity between t-th context word and j-th query word
5
Reading Comprehension with Attention Bi-Directional Attention Flow (Bi-DAF)
Attention Flow layer is the core idea!
• Variants and improvements to the Bi-DAF architecture over the years
Attention should flow both ways:
1) the context→the question (C2Q)
2) the question→the context (Q2C)
1. Context-to-Question (C2Q) attention:
• which query words are most relevant to each context word
5
Reading Comprehension with Attention Bi-Directional Attention Flow (Bi-DAF)
Attention Flow layer is the core idea!
• Variants and improvements to the Bi-DAF architecture over the years
Attention should flow both ways:
1) the context→the question (C2Q)
2) the question→the context (Q2C)
2. Question-to-Context (Q2C) attention:
• the weighted sum of the most important words in the context with respect to the query – slight asymmetry through max
5
Reading Comprehension with Attention Bi-Directional Attention Flow (Bi-DAF)
A “modelling” layer:
• Another deep (2-layer) Bi-LSTM over the passage And answer span selection is more complex:
• Start: Pass output of BiDAF and modelling layer concatenated to a dense FF layer and then a softmax
• End: Put output of modelling layer M through another BiLSTM to give M2 and then concatenate with BiDAF layer and again put through dense FF layer and a softmax
5
Reading Comprehension with Attention
Dynamic Coattention Networks for Question Answering (Xiong 2017)
Coattention provides a two-way attention between the context and the question.
Dynamic pointer Decoder
Co-Attention encoder
Document (D)
Start Token
End Token
Question (Q)
A partly submerged glacier cave on Perito Moreno Glacier. The ice facade is approximately 60 m high. Ice formations in the Titlis glacier cave. A glacier cave is a cave formed within the ice of a glacier. Glacier caves are often called ice caves, but the latter term is properly used to describe bedrock caves that contain year-round ice
How are glacier caves formed?
5
Reading Comprehension with Attention
Dynamic Coattention Networks for Question Answering (Xiong 2017)
• Coattention layer again provides a two-way attention between the context and the question
• Coattention involves a second-level attention computation: – attending over representations that are themselves attention outputs
/
More…?
More Advanced Architecture? Preview for following weeks
The transformer, based solely on attention mechanisms.
(Peters et al, 2018) (Devlin et al, 2018)
Using Contextual word representations
(Vaswani et al, 2017)
0
Question Answering
Research Areas in Question Answering
Research Area
Details
Knowledge-based QA (Semantic Parsing)
• Answer is a logical form, possible executed against a Knowledge Base
• Context is a Knowledge Base
Information Retrieval-based QA
• Answer sentence selection
• Reading Comprehension
• Answer is a document, paragraph, sentence
• Context is a corpus of documents
or a specific document
Visual QA
• Answer is simple and factual
• Context is one/multiple image(s)
Library Reference
• Answer is another question
• Context is the structured knowledge
available in the library and the librarians view of it.
6
Visual Question Answering Textual Question Answering: Recap
Answer questions by exploiting pure natural language.
Document / Passage Question
Answer
Caren watched TV last night. There was a guy playing tennis. Caren did not know who he is. He was wearing white shirts …
What was he doing?
Playing tennis
6
Visual Question Answering
Visual QA
Several questions require context outside of pure language.
Question
Answer
What was he doing?
Playing tennis
6
Visual Question Answering Visual QA Datasets
Recently, there are a number of visual QA datasets have sprung up. Some of the more popular ones include:
6
Visual Question Answering How does it work?
2. Context is a single picture using convolutional neural network (CNN)
1. Encode sentence with sequence models
3. Single word Answer
6
Visual Question Answering How does it work?
The idea of Visual QA is exactly same as reading comprehension-oriented. Why can’t we use Attention then? (Yang et al. 2015)
Question: What are sitting in the basket on a bicycle?
Yang et al. (2015): Stacked Attention Networks for Image Question Answering
Answer: dogs
6
Visual Question Answering
Visual QA with Attention
Let’s try some example. VisualQA (http://vqa.cloudcv.org/)
6
Visual Question Answering
Visual QA with Attention (Usyd NLP Group 2020)
6
Visual Question Answering
Visual QA with Attention (Usyd NLP Group 2020)
Let’s have a look at the demo!!
6
Visual Question Answering
Visual QA with Attention (Usyd NLP Group 2020)
Testing Result
/
Additional QA
There is no reason to limit to just IR-based or Knowledge-based
Using multiple information sources? IBM Watson!
/
The big picture of NLP
The purpose of Natural Language Processing: Overview
Understanding
Searching
Dialog
Translation
Search
….
Sentiment Analysis
Topic Classification
Topic Modelling
….
Entity Extraction
When Sebastian Thrun …
Claudia sat on a stool
She sells seashells
Drinking, Drank, Drunk How is the weather today
[she/PRP] [sells/VBZ] [seashells/NNS]
Drink
[How] [is] [the] [weather] [today]
Parsing
PoS Tagging
Stemming
Tokenisation
NLP Stack Application
/
Reference
Reference for this lecture
• Deng, L., & Liu, Y. (Eds.). (2018). Deep Learning in Natural Language Processing. Springer.
• Rao, D., & McMahan, B. (2019). Natural Language Processing with PyTorch: Build Intelligent Language Applications Using Deep Learning. ” O’Reilly Media, Inc.”.
• Manning, C. D., Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing. MIT press.
• Manning, C 2018, Natural Language Processing with Deep Learning, lecture notes, Stanford University
• Berant, J., Chou, A., Frostig, R., & Liang, P. (2013). Semantic parsing on freebase from question-answer pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (pp. 1533- 1544).
• Chen, D., Bolton, J., & Manning, C. D. (2016). A thorough examination of the cnn/daily mail reading comprehension task. arXiv preprint arXiv:1606.02858.
• Chen, D., Fisch, A., Weston, J., & Bordes, A. (2017). Reading wikipedia to answer open-domain questions. arXiv preprint arXiv:1704.00051.
• Seo, M., Kembhavi, A., Farhadi, A., & Hajishirzi, H. (2016). Bidirectional attention flow for machine comprehension. arXiv preprint arXiv:1611.01603.
• Yang, Z., He, X., Gao, J., Deng, L., & Smola, A. (2016). Stacked attention networks for image question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 21-29).
• Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008).
• Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.