CS计算机代考程序代写 information retrieval database chain Bayesian algorithm Computational

Computational
Linguistics
CSC 485 Summer 2020
11
11. Question Answering and Textual Inference
Gerald Penn
Department of Computer Science, University of Toronto
(slides borrowed from Nate Chambers, Roxana Girju, Sanda Harabagiu, Chris Manning and Frank Rudzicz)
Copyright © 2017 Gerald Penn. All rights reserved.

Modern QA from text
The common person’s view? [From a novel] “I like the Internet. Really, I do. Any time I need a piece of
shareware or I want to find out the weather in Bogota … I’m the first guy to get the modem humming. But as a source of information, it sucks. You got a billion pieces of data, struggling to be heard and seen and downloaded, and anything I want to know seems to get trampled underfoot in the crowd.”
M. Marshall. The Straw Men. HarperCollins Publishers, 2002.
 An idea originating from the IR community
 With massive collections of full-text documents, simply finding relevant documents is of limited use: we want answers from textbases
 QA: give the user a (short) answer to their question, perhaps supported by evidence.

Question answering (QA)
Which woman has won more than 1 Nobel prize?
• Question Answering (QA) usually involves a specific answer to a question.
(Marie Curie)

Document retrieval vs IR
• One strategy is to turn question answering into information retrieval (IR) and let the human complete the task.

Question answering (QA)

Knowledge-based QA
1. Build a structured semantic representation of the query.
• Extract times, dates, locations, entities using regular expressions.
• Fit to well-known templates.
2. Query databases with these semantics. • Ontologies (Wikipedia infoboxes).
• Restaurant review databases.
• Calendars.
• Movie schedules. •…

IR-based QA

IR-based QA
Information retrieval
Question answering

Sample TREC questions
1. Who is the author of the book, “The Iron Lady: A Biography of Margaret Thatcher”?
2. What was the monetary value of the Nobel Peace Prize in 1989?
3. What does the Peugeot company manufacture?
4. How much did Mercury spend on advertising in 1993?
5. What is the name of the managing director of Apricot
Computer?
6. Why did David Koresh ask the FBI for a word processor?
7. What debts did Qintex group leave?
8. What is the name of the rare neurological disease with
symptoms such as: involuntary movements (tics), swearing, and incoherent vocalizations (grunts, shouts, etc.)?

Query types
• Different kinds of questions can be asked. • Factoid questions, e.g.,
• How often were the peace talks in Ireland delayed or disrupted as a result of acts of violence?
• Narrative (open-ended) questions, e.g.,
• Can you tell me about contemporary interest in the
Greek philosophy of stoicism?
• Complex/hybrid questions, e.g.,
• Who was involved in the Schengen agreement to
eliminate border controls in Western Europe and what did they hope to accomplish?

People want to ask questions…
Examples from AltaVista query log (late 1990s)
who invented surf music?
how to make stink bombs
where are the snowdens of yesteryear?
which english translation of the bible is used in official catholic liturgies? how to do clayart
how to copy psx
how tall is the sears tower?
Examples from Excite query log (12/1999)
how can i find someone in texas
where can i find information on puritan religion?
what are the 7 wonders of the world
how can i eliminate stress
What vacuum cleaner does Consumers Guide recommend
Around 10% of early query logs

 Question answering is not new
 Question answering systems can be found in
many areas of NLP
 Natural language database systems
 A lot of early NLP work on these: e.g., LUNAR system
 There’s still Microsoft English Query
 Spoken dialog systems
 Currently very active and commercially relevant

 Focusing on open-domain QA is new focus
 MURAX (Kupiec 1993): Encyclopedia answers
 Hirschman: Reading comprehension tests
 TREC QA competition: 1999–
 But not really new either: Simmons et al. 1965
 Take an encyclopedia and load it onto a computer.
 Take a question and parse it into a logical form
 Perform simple information retrieval to get relevant texts, parse those into a logical form, match and rank
 What do worms eat? Worms eat ???  Candidates
 Worms eat grass
 Grass is eaten by worms
 Birds eat worms

Online QA Examples
 LCC: http://www.languagecomputer.com/demos/question-answering.html
 AnswerBus is an open-domain question answering system: www.answerbus.com
 EasyAsk, AnswerLogic, AnswerFriend, Start, Quasm, Mulder, Webclopedia, ISI TextMap, etc.
 Google

IBM’s Watson
Game Control System
Clue Grid
Human 1
Human 2
Clues, Scores & Other Game Data
Decisions to Buzz and Bet
Strategy
Watson’s
Game
Controller Text-to-Speech
Clue & Category
Answers & Confidences
Watson’s QA Engine
2,880 IBM Power750 Compute Cores
15 TB of Memory
Content equivalent to ~ 1,000,000 books
source: A Brief Overview and Thoughts for Healthcare Education and Performance Improvement by the IBM Watson team

IBM’s Watson: search
This man became the 44th President of the United States in 2008

IBM’s Watson: search
• Title-oriented search: In some cases, the solution is in the title of highly-ranked documents.
• E.g.,
This pizza delivery boy celebrated New Year’s at Applied Cryogenics.

IBM’s Watson: selection
• Once candidates have been gathered from various sources and methods, rank them according to various scores (IBM Watson uses >50 scoring metrics).
In cell division, mitosis splits the nucleus & cytokinesis splits this liquid cushioning the nucleus

IBM’s Watson: selection
• One aspect of Jeopardy! is that ‘answers’ are often posed with puns that have to be disambiguated.
Bilbo shouldn’t have played riddles in the dark with this shady character
from WordNet’s Synonym-sets

 Question answering competition at TREC started with answering a set of 500 fact-based questions, e.g., “When was Mozart born?”.
 For the first three years systems were allowed to return 5 ranked answer snippets (50/250 bytes) to each question.
 Mean Reciprocal Rank (MRR) scoring:
 1, 0.5, 0.33, 0.25, 0.2, 0 for 1, 2, 3, 4, 5, 6+ rankings
 Mainly Named Entity answers (person, place, date, …)
 From 2002 the systems were only allowed to return a single exact answer and the notion of confidence was introduced.

 Each task features a collection from a domain, e.g., news articles:
 AP newswire, 1998-2000
 New York Times newswire, 1998-2000
 Xinhua News Agency newswire, 1996-2000
 Usually about 1,000,000 documents in the collection. Roughly 3GB of text.
 This was once a lot of text to process entirely using advanced NLP techniques so the systems usually consisted of an initial information retrieval phase followed by more advanced processing.
 Allowed to supplement this text with use of the web, and other knowledge bases?
 See also HotPot QA: https://hotpotqa.github.io

 Best TREC systems score ~60-80% !!!
 Recent task: write a Wikipedia page on topic X.
 Approaches and successes have varied a fair deal
 Knowledge-rich approaches, using a vast array of NLP techniques have been most successful
 Notably Harabagiu, Moldovan et al. – SMU/UTD/LCC
 AskMSR (2001): first wildly successful purely statistical system, stressing how much could be achieved by very simple methods with enough text (and now various copycats)

AskMSR: Shallow approach
 In what year did Abraham Lincoln die?
 Ignore hard documents and find easy ones

AskMSR: Details
1
3
2
5
4

Query rewriting:
Answer similar to Question


 
Classify question into seven categories Who is/was/are/were…?
When is/did/will/are/were …?
Where is/are/were …?
Nonsense, but who cares? It’s only a few more queries to Google.
a. Category-specific transformation rules “Where is the Louvre Museum located”
→ “is the Louvre Museum located”
→ “the is Louvre Museum located”
→ “the Louvre is Museum located”
→ “the Louvre Museum is located”
→ “the Louvre Museum located is”
b. Expected answer “Datatype” (eg, Date, Person, Location) When was the French Revolution? → DATE
c. Hand-crafted classification/rewrite/datatype rules  Can they be learned?

Mining N-Grams
 Send query to search engine; use result snippets
 Enumerate all N-grams in all retrieved snippets
 Use hash table and other fancy footwork to make this
efficient
 Weight of an n-gram: occurrence count, each weighted by “reliability” (weight) of rewrite that fetched the document
 Example: “Who created the character of Scrooge?”
 Dickens – 117
 Christmas Carol – 78
 Charles Dickens – 75
 Disney – 72
 Carl Banks – 54
 A Christmas – 41
 Christmas Carol – 45
 Uncle – 31

Filtering N-Grams
 Each question type is associated with one or
more “data-type filters” = regular expression
 When…
 Where…
 What …
 Who …
Date Location
Person
 Boost score of n-grams that do match regexp
 Lower score of n-grams that don’t match regexp

Step 5: Tiling the Answers
Scores 20
15
10
merged,
discard old n-grams
Mr Charles Dickens
N-Grams
tile highest-scoring n-gram
Charles Dickens
Dickens
Mr Charles
Score 45
N-Grams Repeat, until no more overlap

Results
 Standard TREC contest test-bed: ~1M documents; 900 questions
 Doesn’t do so well (but in top 9 of ~30 participants!)
 MRR = 0.262
 Right answer ranked about #4–5 on average
 Why? Because it relies on the enormity of the Web!  Using the Web as a whole, not just TREC’s 1M
documents
 MRR = 0.42
 On average, right answer is ranked about #2–3

Limitations
 In many scenarios we only have a small set of documents
 e.g., monitoring an individuals email…
 Works best/only for “Trivial Pursuit”-style
fact-based questions
 Limited/brittle repertoire of
 question categories
 answer data types/filters
 query rewriting rules

Full NLP QA: LCC (Harabagiu/Moldovan)
[below is the Architecture of LCC’s QA system circa 2003]
Question Processing
Factoid Question
List Question
Document Processing
Factoid Answer Processing
Factoid Answer
Question Parse
Semantic Transformation
Recognition of Expected Answer Type
Keyword Extraction
Single Factoid Passages
Multiple List Passages
Multiple Definition Passages
Passage Retrieval Document Index
Answer Extraction Answer Justification
Answer Reranking Theorem Prover
Axiomatic Knowledge Base
Named Entity Recognition (CICERO LITE)
Answer Type Hierarchy (WordNet)
List Answer Processing
List Answer
Definition Question
Question Processing
Question Parse
Pattern Matching Keyword Extraction
Pattern Repository
Answer Extraction Threshold Cutoff
AQUAINT Document Collection
Definition Answer Processing
Definition Answer
Answer Extraction Pattern Matching

Value from sophisticated NLP Pasca and Harabagiu (2001)
 Good IR is needed: SMART paragraph retrieval
 Large taxonomy of question types and expected answer
types is crucial
 Statistical parser used to parse questions and relevant text for answers, and to build KB
 Further value comes from deeper NLP and inferencing

Answer types in State-of-the-art QA systems
Question
Docs
answer type
Answer
Features
Question Expansion
Answer type
 Labels questions with answer type based on a taxonomy
IR
Answer Type Hierarchy
Answer Selection
Answer Type Prediction
 Person, location, weight, temperature, year, vehicle
 Classifies questions (e.g. by using a maximum entropy model)
Ranked set of passages

Answer Types
 Of course, determining the answer type isn’t that easy…
 Who questions can have organizations as answers
 Who sells the most hybrid cars?
 Which questions can have people as answers  Which president went to war with Mexico?

Lexical Terms Extraction: Input to Information Retrieval
 Questions approximated by sets of unrelated words (lexical terms)
 Similar to bag-of-word IR models: but choose nominal non-stop words and verbs
Question (from TREC QA track)
Lexical terms
Q002: What was the monetary value of the Nobel Peace Prize in 1989?
monetary, value, Nobel, Peace, Prize, 1989
Q003: What does the Peugeot company manufacture?
Peugeot, company, manufacture
Q004: How much did Mercury spend on advertising in 1993?
Mercury, spend, advertising, 1993

Keyword Selection Algorithm
1. Select all non-stopwords in quotations
2. Select all NNP words in recognized named entities
3. Select all complex nominals with their adjectival modifiers
4. Select all other complex nominals
5. Select all nouns with adjectival modifiers
6. Select all other nouns
7. Select all verbs
8. Select the answer type word

Passage Extraction Loop
 Passage Extraction Component
 Extracts passages that contain all selected keywords
 Passage size dynamic
 Start position dynamic
 Passage quality and keyword adjustment
 In the first iteration use the first 6 keyword selection
heuristics
 If the number of passages is lower than a threshold ⇒
query is too strict ⇒ drop a keyword
 If the number of passages is higher than a threshold ⇒ query is too relaxed ⇒ add a keyword

Passage Scoring
 Passage ordering is performed using a sort that involves three scores:
 The number of words from the question that are recognized in the same sequence in the window
 The number of words that separate the most distant keywords in the window
 The number of unmatched keywords in the window

Rank candidate answers in retrieved passages
Q066: Name the first private citizen to fly in space.
 Answer type: Person
 Text passage:
“Among them was Christa McAuliffe, the first private citizen to fly in space. Karen Allen, best known for her starring role in “Raiders of the Lost Ark”, plays McAuliffe. Brian Kerwin is featured as shuttle pilot Mike Smith…”
 Best candidate answer: Christa McAuliffe

Extracting Answers for Factoid Questions: NER!
 In TREC 2003 the LCC QA system extracted 289 correct answers for factoid questions
 The Name Entity Recognizer was responsible for 234 of them
 Current QA is largely based on the high accuracy recognition of a large variety of Named Entity types
QUANTITY
55
ORGANIZATION
15
PRICE
3
NUMBER
45
AUTHORED WORK
11
SCIENCE NAME
2
DATE
35
PRODUCT
11
ACRONYM
1
PERSON
31
CONTINENT
5
ADDRESS
1
COUNTRY
21
PROVINCE
5
ALPHABET
1
OTHER LOCATIONS
19
QUOTE
5
URI
1
CITY
19
UNIVERSITY
3

Semantics and Reasoning for QA: Predicate-argument structure
 Q336: When was Microsoft established?
 This question is difficult because Microsoft tends to
establish lots of things…
Microsoft plans to establish manufacturing partnerships in Brazil and Mexico in May.
 Need to be able to detect sentences in which ‘Microsoft’ is object of ‘establish’ or close synonym.
 Matching sentence:
Microsoft Corp was founded in the US in 1975, incorporated
in 1981, and established in the UK in 1982.
 Requires analysis of sentence syntax/semantics!

Semantics and Reasoning for QA: Syntax to Logical Forms
• Syntactic analysis plus semantic => logical form
• Mapping of question and potential answer LFs to find the best
match

Abductive inference
 System attempts inference to justify an answer (often following lexical chains)
 Their inference is a kind of funny middle ground between logic and pattern matching
 But very effective: 30% improvement
 Q: When was the internal combustion engine
invented?
 A: The first internal-combustion engine was built
in 1867.
 invent → create_mentally → create → build

Question Answering Example
Q: How hot does the inside of an active volcano get?  get(TEMPERATURE, inside(volcano(active)))
A: “lava fragments belched out of the mountain were as hot as 300 degrees Fahrenheit”
 fragments(X, lava, temperature(degrees(300)),
 belched(X, mountain))
 volcano ISA mountain
 lava ISPARTOF volcano ■ lava inside volcano
 fragments of lava HAVEPROPERTIESOF lava
 The needed semantic information is in WordNet definitions, and was successfully translated into a form that was used for rough ‘proofs’

Answer Validation motivates the Robust Textual Inference Task

The task: Can systems correctly perform ‘local textual inferences’ [individual inference steps]?
 On the assumption that some piece of text (T) is true, does this imply the truth of some other hypothesis text (H)?
 Sydney was the host city of the 2000 Olympics →
 The Olympics have been held in Sydney TRUE
 The format could be used for evaluating extended inferential chains or knowledge
 But, in practice, fairly direct stuff

The textual inference task
 Does text T justify an inference to hypothesis H?
 Emphasis on variability of linguistic expression
 Robust, accurate textual inference would enable:
 Semantic search: H: lobbyists attempting to bribe U.S. legislators
T: The A.P. named two more senators who received contributions engineered by lobbyist Jack Abramoff in return for political favors.
 Question answering: H: Who bought J.D. Edwards?
T: Thanks to its recent acquisition of J.D. Edwards, Oracle will soon
be able…
 Customer email response
 Relation extraction (database building)
 Document summarization

Natural Examples: Reading Comprehension
 (CNN Student News) — January 24, 2006
 Answer the following questions about today’s featured news stories. Write your answers in the space provided.
 1. Where is the country of Somalia located? What ocean borders this country?
 2. Why did crew members from the USS Winston S. Churchill recently stop a small vessel off the coast of Somalia? What action did the crew of the Churchill take?

Verification of terms [Dan Roth]  Non-disclosure Agreement
WHEREAS Recipient is desirous of obtaining said confidential information for purposes of evaluation thereof and as a basis for further discussions with Owner regarding assistance with development of the confidential information for the benefit of Owner or for the mutual benefit of Owner and Recipient;
THEREFORE, Recipient hereby agrees to receive the information in confidence and to treat it as confidential for all purposes. Recipient will not divulge or use in any manner any of said confidential information unless by written consent from Owner, and Recipient will use at least the same efforts it regularly employs for its own confidential information to avoid disclosure to others.
Provided, however, that this obligation to treat information confidentially will not apply to any information already in Recipient’s possession or to any information that is generally available to the public or becomes generally available through no act or influence of Recipient. Recipient will inform Owner of the public nature or Recipient’s possession of the information without delay after Owner’s disclosure thereof or will be stopped from asserting such as defense to remedy under this agreement.
Each party acknowledges that all of the disclosing party’s Confidential Information is owned solely by the disclosing party (or its licensors and/or other vendors) and that the unauthorized disclosure or use of such Confidential Information would cause irreparable harm and significant injury, the degree of which may be difficult to ascertain. Accordingly, each party agrees that the disclosing party will have the right to obtain an immediate injunction enjoining any breach of this Agreement, as well as the right to pursueany and all other rights and remedies available at law or in equity for such a breach.
Recipient will exercise its best efforts to conduct its evaluation within a reasonable time after Owner’s disclosure and will provide Owner with its assessment thereof without delay. Recipient will return all information, including all copies thereof, to Owner upon request. This agreement shall remain in effect for ten years after the date of it’s execution, and it shall be construed under the laws of the State of Texas.
 Conditions I care about:
 All information discussed is freely shareable unless
other party indicates in advance that it is confidential
 TRUE? FALSE?

Stanford three-stage architecture
[MacCartney et al. 2006]
T: India buys missiles. ⊨ H: India acquires arms.
1. linguistic analysis
buys
2. graph alignment
3. features & classification
nsubj
India
dobj
missiles
nsubj
dobj
missiles
yes
tuned threshold
no
buys
–0.53
acquires
nsubj dobj
India arms
fi
+ +
wi
0.10 0.30
=
Feature
Structure match
Alignment: good
acquires
India
0.00
nsubj
India
India
buys …
dobj
arms
–0.75
–1.28
score =
–0.88
POS NNP
NER LOCATION IDF 0.027
POS VBZ NER –
IDF 0.045
……

Textual inference as graph alignment
[Haghighi et al. 05, de Salvo Braz et al. 05]

T: CNN reported that thirteen soldiers lost their lives in today’s ambush.
H: Several troops were killed in the ambush.
reported
nsubj ccomp
CNN lost
killed
aux
were
nsubj
dobj
in
nsubjpass
troops
amod
several
in
soldiers lives
nn dep
thirteen their
ambush
poss
today’s
ambush
det
the
 Find least cost alignment of H to part of T, using locally decomposable cost model (lexical and structural costs)
 Assumption: good alignment ⇒ valid inference

Problem: graph embedding isn’t sufficient
 
To be tractable, alignment scoring must be local But valid inference can hinge on non-local factors:
T1: GM acknowledged that accounting deficiencies had led to errors. H: Accounting deficiencies led to errors.
T2: Newsweek retracted its report that GM acknowledged that accounting
deficiencies had led to errors.
H: Accounting deficiencies led to errors
⊭

Features of valid inferences
 After alignment, extract features of inference
 Look for global characteristics of valid and invalid inferences
 Features embody crude semantic theories
 Feature categories: adjuncts, modals, quantifiers, implicatives, antonymy, tenses, structure, explicit numbers & dates
 Alignment score is also an important feature
 Extracted features ⇒ statistical model ⇒ score
 Can learn feature weights using logistic regression
 Or, can use hand-tuned weights
 (Score ≥ threshold) ? ⇒ prediction: yes/no
 Threshold can be tuned

Structural (mis-)match features
T: Ahmadinejad attacked the “threat” to bring the issue of Iran’s nuclear activity to the UN Security Council by the US, France, Britain and Germany.
H: Ahmadinejad attacked the UN Security Council. (FALSE)
 Check the main predicate of the hypothesis and its match in the text to assess compatibility using syntactic grammatical relations:
 Object of attack in hypothesis is not related to object of attack in text

Factives & other implicatives
T: Libya has tried, with limited success, to develop its own indigenous missile, and to extend the range of its aging SCUD force for many years under the Al Fatah and other missile programs.
H: Libya has developed its own domestic missile program. (FALSE) T: Scientists have discovered that drinking tea protects against
heart disease by improving the function of the artery walls. H: Tea protects from some disease. (TRUE)
 Evaluate governing verbs for implicativity
 Unknown: say, tell, suspect, try, …
 Fact: know, wonderful, …
 True: manage to, …
 False: doubtful, misbelieve, …
 Need to check for negative context

Restrictive adjuncts
 We can check whether adding/dropping restrictive adjuncts is licensed relative to upward and downward entailing contexts
 In all, Zerich bought $422 million worth of oil from Iraq, according to the Volcker committee
 ⊭ Zerich bought oil from Iraq during the embargo
 Zerich didn’t buy any oil from Iraq, according to the
Volcker committee
 ⊨ Zerich didn’t buy oil from Iraq during the embargo

QA beyond TREC Collections
 Answers to complex questions that require a longer answer
 What is a PIC Freeze?
 Can I travel with Ameripass in Mexico?
 Soricut and Brill 2006
 Use the web (real FAQ websites)
 Otterbacher et al. 2005
 Random walk model similar to PageRank
 Daume and Marcu 2006
 Formal model for query expansion

Not all problems are solved yet!
 Where do lobsters like to live?  on a Canadian airline
 Where are zebras most likely found?
 near dumps
 in the dictionary
 Why can’t ostriches fly?
 Because of American economic sanctions
 What’s the population of Mexico?  Three
 What can trigger an allergic reaction?
 ..something that can trigger an allergic reaction

References
R. F. Simmons, Natural language question-answering systems: 1969. Communications of the ACM. Volume 13, 1970.
M. Banko, E. Brill, S. Dumais, and J. Lin. 202. AskMSR: Question Answering Using the Worldwide Web.
In Proceedings of 2002 AAAI SYMPOSIUM on Mining Answers from Text and Knowledge Bases, March 2002
S. Dumais, M. Banko, E. Brill, J. Lin, A. Ng. 2002. Web Question Answering: Is More Always Better? SIGIR 2002.
D. Ravichandran and E.H. Hovy. 2002.
Learning Surface Patterns for a Question Answering System. ACL conference, July 2002.
M. Pasca and S. Harabagiu. 2001.
High Performance Question/Answering. In ACM SIGIR-2001, New Orleans LA, pages 366-374.

L. Hirschman, M. Light, E. Breck and J. Burger. Deep Read: A Reading Comprehension System. In ACL 1999.
M. Light, G. Mann, E. Riloff and E. Breck. Analyses for Elucidating Current Question Answering Technology. Journal of Natural Language Engineering, Vol. 7, No. 4 (2001).
M. M. Soubbotin. Patterns of Potential Answer Expressions as Clues to the Right Answer. TREC 2001.
H. Daume and D. Marcu. Bayesian Query-Focused Summarization. ACL 2006
J. Otterbacher, G. Erkan, and D. Radev. Using random walks for question-focused sentence retrieval. In Proceedings of HLT-EMNLP, 2005
R. Soricut and E. Brill. 2006. Automatic Question Answering Using the Web: Beyond the Factoid. Journal of Information Retrieval – Special Issue on Web Information Retrieval, 9:191-206