CS计算机代考程序代写 scheme IOS finance decision tree AI Interpretation of Natural Language Rules in

Interpretation of Natural Language Rules in
Conversational Machine Reading

Marzieh Saeidi1∗, Max Bartolo1*, Patrick Lewis1*, Sameer Singh1,2,
Tim Rocktäschel3, Mike Sheldon1, Guillaume Bouchard1, and Sebastian Riedel1,3

1Bloomsbury AI
2University of California, Irvine

3University College London

{marzieh.saeidi,maxbartolo,patrick.s.h.lewis}@gmail.com

Abstract

Most work in machine reading focuses on
question answering problems where the an-
swer is directly expressed in the text to read.
However, many real-world question answer-
ing problems require the reading of text not
because it contains the literal answer, but be-
cause it contains a recipe to derive an answer
together with the reader’s background knowl-
edge. One example is the task of interpret-
ing regulations to answer “Can I…?” or “Do
I have to…?” questions such as “I am work-
ing in Canada. Do I have to carry on pay-
ing UK National Insurance?” after reading a
UK government website about this topic. This
task requires both the interpretation of rules
and the application of background knowledge.
It is further complicated due to the fact that,
in practice, most questions are underspecified,
and a human assistant will regularly have to
ask clarification questions such as “How long
have you been working abroad?” when the an-
swer cannot be directly derived from the ques-
tion and text. In this paper, we formalise this
task and develop a crowd-sourcing strategy to
collect 32k task instances based on real-world
rules and crowd-generated questions and sce-
narios. We analyse the challenges of this task
and assess its difficulty by evaluating the per-
formance of rule-based and machine-learning
baselines. We observe promising results when
no background knowledge is necessary, and
substantial room for improvement whenever
background knowledge is needed.

1 Introduction

There has been significant progress in teaching ma-
chines to read text and answer questions when the
answer is directly expressed in the text (Rajpurkar
et al., 2016; Joshi et al., 2017; Welbl et al., 2018;
Hermann et al., 2015). However, in many settings,

∗These three authors contributed equally

What do I have to
carry on paying?

Do I need to carry
on paying National

Insurance?

I am working for an
employer in Germany.

Yes

Have you been working
abroad 52 weeks or less?

Yes

You’ll carry on
paying National
Insurance for the
first 52 weeks
you’re abroad if
you’re working for
an employer
outside the EEA.

in
pu

t
ou

tp
ut

ou
tp

ut
in

pu
t

Instance 1

Instance 2

Support Text Scenario

Question

Follow-up

Answer

Do I need to carry on
paying UK National

Insurance?

I am working for an
employer in Canada.

Yes

Have you been working abroad 52
weeks or less?

Yes

You’ll carry on paying National
Insurance for the first 52 weeks
you’re abroad if you’re working for
an employer outside the EEA.

in
pu

t
ou

tp
ut

ou
tp

ut
in

pu
t

Utterance 1

Utterance 2

Rule Text

Scenario

Question

Follow-up

Answer

National Insurance

You’ll carry on paying National
Insurance for the first 52 weeks
you’re abroad if you’re working for
an employer outside the EEA.

in
pu

t
ou

tp
ut

ou
tp

ut
in

pu
t

Instance 1 Instance 2

Support Text

Question

Answer

Tradition QA

Figure 1: An example of two utterances for rule
interpretation. In the first utterance, a follow-up
question is generated. In the second, the scenario,
history and background knowledge (Canada is not
in the EEA) is used to arrive at the answer “Yes”.

the text contains rules expressed in natural lan-
guage that can be used to infer the answer when
combined with background knowledge, rather than
the literal answer. For example, to answer some-
one’s question “I am working for an employer in
Canada. Do I need to carry on paying National
Insurance?” with “Yes”, one needs to read that
“You’ll carry on paying National Insurance if you’re
working for an employer outside the EEA” and un-
derstand how the rule and question determine the
answer.

Answering questions that require rule interpre-
tation is often further complicated due to missing
information in the question. For example, as il-
lustrated in Figure 1 (Utterance 1), the actual rule
also mentions that National Insurance only needs
to be paid for the first 52 weeks when abroad. This
means that we cannot answer the original question
without knowing how long the user has already

ar
X

iv
:1

80
9.

01
49

4v
1

[
cs

.C
L

]
2

8
A

ug
2

01
8

been working abroad. Hence, the correct response
in this conversational context is to issue another
query such as “Have you been working abroad 52
weeks or less?”

To capture the fact that question answering in
the above scenario requires a dialog, we hence con-
sider the following conversational machine read-
ing (CMR) problem as displayed in Figure 1: Given
an input question, a context scenario of the ques-
tion, a snippet of supporting rule text containing a
rule, and a history of previous follow-up questions
and answers, predict the answer to the question
(“Yes”or “No”) or, if needed, generate a follow-up
question whose answer is necessary to answer the
original question. Our goal in this paper is to create
a corpus for this task, understand its challenges,
and develop initial models that can address it.

To collect a dataset for this task, we could give a
textual rule to an annotator and ask them to provide
an input question, scenario, and dialog in one go.
This poses two problems. First, this setup would
give us very little control. For example, users
would decide which follow-up questions become
part of the scenario and which are answered with
“Yes” or “No”. Ultimately, this can lead to bias
because annotators might tend to answer “Yes”,
or focus on the first condition. Second, the more
complex the task, the more likely crowd annotators
are to make mistakes. To mitigate these effects, we
aim to break up the utterance annotation as much
as possible.

We hence develop an annotation protocol in
which annotators collaborate with virtual users—
agents that give system-produced answers to
follow-up questions—to incrementally construct
a dialog based on a snippet of rule text and a sim-
ple underspecified initial question (e.g., “Do I need
to …?”), and then produce a more elaborate ques-
tion based on this dialog (e.g., “I am … Do I need
to…?”). By controlling the answers of the virtual
user, we control the ratio of “Yes” and “No” an-
swers. And by showing only subsets of the dialog
to the annotator that produces the scenario, we can
control what the scenario is capturing. The ques-
tion, rule text and dialogs are then used to produce
utterances of the kind we see in Figure 1. Annota-
tors show substantial agreement when constructing
dialogs with a three-way annotator agreement at a
Fleiss’ Kappa level of 0.71.1 Likewise, we find that

1This is well within the range of what is considered as
substantial agreement (Artstein and Poesio, 2008).

our crowd-annotators produce questions that are
coherent with the given dialogs with high accuracy.

In theory, the task could be addressed by an end-
to-end neural network that encodes the question,
history and previous dialog, and then decodes a
Yes/No answer or question. In practice, we test
this hypothesis using a seq2seq model (Sutskever
et al., 2014; Cho et al., 2014), with and without
copy mechanisms (Gu et al., 2016) to reflect how
follow-up questions often use lexical content from
the rule text. We find that despite a training set
size of 21,890 training utterances, successful mod-
els for this task need a stronger inductive bias due
to the inherent challenges of the task: interpret-
ing natural language rules, generating questions,
and reasoning with background knowledge. We
develop heuristics that can work better in terms of
identifying what questions to ask, but they still fail
to interpret scenarios correctly. To further motivate
the task, we also show in oracle experiments that a
CMR system can help humans to answer questions
faster and more accurately.

This paper makes the following contributions:
1. We introduce the task of conversational ma-

chine reading and provide evaluation metrics.
2. We develop an annotation protocol to collect

annotations for conversational machine read-
ing, suitable for use in crowd-sourcing plat-
forms such as Amazon Mechanical Turk.

3. We provide a corpus of over 32k conversa-
tional machine reading utterances, from do-
mains such as grant descriptions, traffic laws
and benefit programs, and include an analysis
of the challenges the corpus poses.

4. We develop and compare several baseline
models for the task and subtasks.

2 Task Definition

Figure 1 shows an example of a conversational ma-
chine reading problem. A user has a question that
relates to a specific rule or part of a regulation, such
as “Do I need to carry on paying National Insur-
ance?”. In addition, a natural language description
of the context or scenario, such as “I am working
for an employer in Canada”, is provided. The ques-
tion will need to be answered using a small snippet
of supporting rule text. Akin to machine reading
problems in previous work (Rajpurkar et al., 2016;
Hermann et al., 2015), we assume that this snip-
pet is pre-identified. We generally assume that the
question is underspecified, in the sense that the

question often does not provide enough informa-
tion to be answered directly. However, an agent can
use the supporting rule text to infer what needs to
be asked in order to determine the final answer. In
Figure 1, for example, a reasonable follow-up ques-
tion is “Have you been working abroad 52 weeks
or less?”.

We formalise the above task on a per-utterance
basis. A given dialog corresponds to a sequence of
prediction problems, one for each utterance the
system needs to produce. Let W be a vocabu-
lary. Let q = w1 . . . wnq be an input question and
r = w1 . . . wnr an input support rule text, where
wi ∈ W is a word from a vocabulary. Further-
more, let h = (f1, a1) . . . (fnh , anh) be a dialog
history where each fi ∈ W ∗ is a follow-up ques-
tion, and each ai ∈ {YES, NO} is a follow-up an-
swer. Let s be a scenario describing the context of
the question. We will refer to x = (q, r, h, s) as
the input. Given an input x, our task is to predict
an answer y ∈ {YES, NO, IRRELEVANT} ∪ W ∗
that specifies whether the answer to the input ques-
tion, in the context of the rule text and the previ-
ous follow-up question dialog, is either YES, NO,
IRRELEVANT or another follow-up question in W ∗.
Here IRRELEVANT is the target answer whenever
a rule text is not related to the question q.

3 Annotation Protocol

Our annotation protocol is depicted in Figure 2 and
has four high-level stages: Rule Text Extraction,
Question Generation, Dialog Generation and Sce-
nario Annotation. We present these stages below,
together with discussion of our quality-assurance
mechanisms and method to generate negative data.
For more details, such as annotation interfaces, we
refer the reader to Appendix A.

3.1 Rule Text Extraction Stage

First, we identify the source documents that con-
tain the rules we would like to annotate. Source
documents can be found in Appendix C. We then
convert each document to a set of rule texts using
a heuristic which identifies and groups paragraphs
and bulleted lists. To preserve readability during
the annotation, we also split by a maximum rule
text length and a maximum number of bullets.

3.2 Question Generation Stage

For each rule text we ask annotators to come up
with an input question. Annotators are instructed to

ask questions that cannot be answered directly but
instead require follow-up questions. This means
that the question should a) match the topic of the
support rule text, and b) be underspecified. At
present, this part of the annotation is done by expert
annotators, but in future work we plan to crowd-
source this step as well.

3.3 Dialog Generation Stage
In this stage, we view human annotators as assis-
tants that help users reach the answer to the input
question. Because the question was designed to be
broad and to omit important information, human
annotators will have to ask for this information us-
ing the rule text to figure out which question to
ask. The follow-up question is then sent to a vir-
tual user, i.e., a program that simply generates a
random YES or NO answer. If the input question
can be answered with this new information, the an-
notator should enter the respective answer. If not,
the annotator should provide the next follow-up
question and the process is repeated.

When the virtual user is providing random YES
and NO answers in the dialog generation stage,
we are traversing a specific branch of a decision
tree. We want the corpus to reflect all possible
dialogs for each question and rule text. Hence, we
ask annotators to label additional branches. For
example, if the first annotator received a YES as
the answer to the second follow-up question in
Figure 3, the second annotator (orange) receives a
NO.

3.4 Scenario Annotation Stage
In the final stage, we choose parts of the dialogs cre-
ated in the previous stage and present this to an an-
notator. For example, the annotator sees “Are you
working or preparing for work?” and NO. They
are then asked to write a scenario that is consistent
with this dialog such as “I am currently out of work
after being laid off from my last job, but am not
able to look for any yet.”. The number of questions
and answers that the annotator is presented with for
generating a scenario can vary from one to the full
length of a dialog. Users are encouraged to para-
phrase the questions and not to use many words
from the dialog.

In an attempt to make these scenarios closer to
the real-world situations where a user may provide
a lot of unnecessary information to an operator,
not only do we present users with one or more
questions and answers from a specific dialog but

Do I need to carry on
paying National
Insurance?

Annotator

Virtual User

Do I need to carry on paying
National Insurance?

Annotator Annotator

Do I need to carry on paying
National Insurance?

I am working for an
employer in Canada.

Yes

Have you been working
abroad 52 weeks or less?

Yes

Are you working for an
employer outside the EEA

Yes

Yes

Have you been working abroad
52 weeks or less?

Yes

sees only

Question
Generation
Stage

Dialog
Generation
Stage

Scenario
Annotation
Stage

You’ll carry on paying National
Insurance for the first 52 weeks
you’re abroad if you’re working
for an employer outside the
EEA.

You’ll carry on paying National
Insurance for the first 52 weeks
you’re abroad if you’re working
for an employer outside the
EEA.

You’ll carry on paying National
Insurance for the first 52 weeks
you’re abroad if you’re working
for an employer outside the
EEA.

You’ll carry on paying National
Insurance for the first 52 weeks
you’re abroad if you’re working
for an employer outside the
EEA.

Figure 2: The different stages of the annotation process (excluding the rule text extraction stage). First
a human annotator generates an underspecified input question (question generation). Then, a virtual
user and a human annotator collaborate to produce a dialog of follow-up questions and answers (dialog
generation). Finally, a scenario is generated from parts of the dialog, and these parts are omitted in the
final result.

Have you been working
abroad 52 weeks or less?

Are you working for an
employer outside the EEA

Yes No

Yes No

Yes

No

No

Figure 3: We use different annotators (indicated by
different colors) to create the complete dialog tree.

also with one question from a random dialog. The
annotators are asked to come up with a scenario
that fits all the questions and answers.

Finally, a dialog is produced by combining the
scenario with the input question and rule text from
the previous stages. In addition, all dialog utter-
ances that were not shown to the final annotator
are included as well as they complement the in-
formation in the scenario. Given a dialog of this
form, we can create utterances that are described
in Section 2.

As a result of this stage of annotation, we create a
corpus of scenarios and questions where the correct
answers (YES, NO or IRRELEVANT) to questions
can be derived from the related scenarios. This
corpus and its challenges will be discussed in Sec-
tion 4.2.2.

3.5 Negative Examples

To facilitate the future application of the models
to large-scale rule-based documents instead of rule

text, we deem it to be imperative for the data to
contain negative examples of both questions and
scenarios.

We define a negative question as a question that
is not relevant to the rule text. In this case, we ex-
pect models to produce the answer IRRELEVANT.
For a given rule text and question pair, a negative
example is generated by sampling a random ques-
tion from the set of all possible questions, exclud-
ing the question itself and questions sourced from
the same document using a methodology similar to
the work of Levy et al. (2017).

The data created so far is biased in the sense
that when a scenario is given, at least one of the
follow-up questions in a dialog can be answered.
In practice, we expect users to also provide back-
ground scenarios that are completely irrelevant to
the input question. Therefore, we sample a nega-
tive scenario for each input question and rule text
pair, (q, r) in our data. We uniformly sample from
the scenarios created in Section 3.4 for all question
and rule text pairs (q′, r′) unequal to (q, r). For
more details, we point the reader to Appendix D.

3.6 Quality Control

We employ a range of quality control measures
throughout the process. In particular, we:

1. Re-annotate pre-terminal nodes in the dia-
log trees if they have identical YES and NO
branches.

2. Ask annotators to validate the previous dialog
in case previous utterances where created by
different annotators.

3. Assess a sample of annotations for each an-

notator and keep only those annotators with
quality scores higher than a certain threshold.

4. We require annotators to pass a qualification
test before selecting them for our tasks. We
also require high approval rates and restrict
location to the UK, US, or Canada.

Further details are provided in Appendix B.

3.7 Cost, Duration and Scalability

The cost of different stages of annotation is as fol-
lows. An annotator was paid $0.15 for an initial
question (948 questions), $0.11 for a dialog part
(3000 dialog parts) and $0.20 for a scenario (6,600
scenarios). It takes in total 2 weeks to complete the
annotation process. Considering that all the annota-
tion stages can be done through crowdsourcing and
in a relatively short time period and at a reasonable
cost using established validation procedures, the
dataset can be scaled up without major bottlenecks
or an impact on the quality.

4 ShARC

In this section, we present the Shaping Answers
with Rules through Conversation (ShARC) dataset.2

4.1 Dataset Size and Quality

The dataset is built up from of 948 distinct snip-
pets of rule text. Each has an input question and
a “dialog tree”. At each step in the dialog, there
is a followup question posed and the tree branches
depending on the answer to the followup question
(yes/no). The ShARC dataset is comprised of all
individual “utterances” from every tree, i.e. ev-
ery possible point/node in any dialog tree. There
are 6058 of these utterances. In addition, there
are 6637 scenarios that provide more information,
allowing some questions in the dialog tree to be
“skipped” as the answers can be inferred from the
scenario. Scenarios therefore modify the dialog
trees, which creates new trees. When combined
with scenarios and negative sampled scenarios, the
total number of distinct utterances became 37087.
As a final step, utterances were removed where the
scenario referred to a portion of the dialog tree that
was unreachable for that utterance, leaving a final
dataset size of 32436 utterances.3

2The dataset and its Codalab challenge can be found at
https://sharc-data.github.io.

3One may argue that the the size of the dataset is not
sufficient for training end-to-end neural models. While we
believe that the availability of large datasets such as SNLI or
SQuAD has helped drive the state-of-the-art forward on related

We break these into train, development and test
sets such that each dataset contains approximately
the same proportion of sources from each domain,
targeting a 70%/10%/20% split.

To evaluate the quality of dialog generation
HITs, we sample a subset of 200 rule texts and ques-
tions and allow each HIT to be annotated by three
distinct workers. In terms of deciding whether the
answer is a YES, NO or some follow-up question,
the three annotators reach an answer agreement of
72.3%. We also calculate Cohen’s Kappa, a mea-
sure designed for situations with two annotators.
We randomly select two out of the three annota-
tions and compute the unweighted kappa values,
repeated for 100 times and averaged to give a value
of 0.82.

The above metrics measure whether annota-
tors agree in terms of deciding between YES, NO
or some follow-up question, but not whether the
follow-up questions are equivalent. To approxi-
mate this, we calculate BLEU scores between pairs
of annotators when they both predict follow-up
questions. Generally, we find high agreement: An-
notators reach average BLEU scores of 0.71, 0.63,
0.58 and 0.58 for maximum orders of 1, 2, 3 and 4
respectively.

To get an indication of human performance on
the sub-task of classifying whether a response
should be a YES, NO or FOLLOW-UP QUESTION,
we use a similar methodology to (Rajpurkar et al.,
2016) by considering the second answer to each
question as the human prediction and taking the
majority vote as ground truth. The resulting human
accuracy is 93.9%.

To evaluate the quality of the scenarios, we sam-
ple 100 scenarios randomly and ask two expert
annotators to validate them. We perform validation
for two cases: 1) scenarios generated by turkers
who did not attempt the qualification test and were
not filtered by our validation process, 2) scenar-
ios that are generated by turkers who have passed
the qualification test and validation process. In the
second case, annotators approved an average of
89 of the scenarios whereas in the first case, they
only approved an average of 38. This shows that
the qualification test and the validation process im-

tasks, relying solely on large datasets to push the boundaries
of AI cannot be as practical as developing better models for
incorporating common sense and external knowledge which
we believe ShARC is a good test-bed for. Furthermore, the
proposed annotation protocol and evaluation procedure can be
used to reliably extend the dataset or create datasets for new
domains.

https://sharc-data.github.io

proved the quality of the generated scenarios by
more than double. In both cases, the annotators
agreed on the validity of 91-92 of the scenarios.
For further details on dataset quality, the reader is
referred to Appendix B.

4.2 Challenges

We analyse the challenges involved in solving con-
versational machine reading in ShARC. We divide
these into two parts: challenges that arise when
interpreting rules, and challenges that arise when
interpreting scenarios.

4.2.1 Interpreting Rules
When no scenarios are available, the task reduces
to a) identifying the follow-up questions within
the rule text, b) understanding whether a follow-up
question has already been answered in the history,
and c) determining the logical structure of the rule
(e.g. disjunction vs. conjunction vs. conjunction of
disjunctions) .

To illustrate the challenges that these sub-tasks
involve, we manually categorise a random sample
of 100 (qi, ri) pairs. We identify 9 phenomena of
interest, and estimate their frequency within the
corpus. Here we briefly highlight some categories
of interest, but full details, including examples, can
be found in Appendix G.

A large fraction of problems involve the
identification of at least two conditions, and
approximately 41% and 27% of the cases involve
logical disjunctions and conjunctions respectively.
These can appear in linguistic coordination
structures as well as bullet points. Often, differ-
entiating between conjunctions and disjunctions
is easy when considering bullets—key phrases
such as “if all of the following hold” can give
this away. However, in 13% of the cases, no
such cues are given and we have to rely on lan-
guage understanding to differentiate. For example:

Q: Do I qualify for Statutory Maternity Leave?
R: You qualify for Statutory Maternity Leave if

– you’re an employee not a “worker”
– you give your employer the correct notice

4.2.2 Interpreting Scenarios
Scenario interpretation can be considered as a
multi-sentence entailment task. Given a sce-
nario (premise) of (usually) several sentences,
and a question (hypothesis), a system should out-

put YES (ENTAILMENT), NO (CONTRADICTION)
or IRRELEVANT (NEUTRAL). In this context,
IRRELEVANT indicates that the answer to the ques-
tion cannot be inferred from the scenario.

Different types of reasoning are required to in-
terpret the scenarios. Examples include numeri-
cal reasoning, temporal reasoning and implication
(common sense and external knowledge). We man-
ually label 100 scenarios with the type of reasoning
required to answer their questions. Table 1 shows
examples of different types of reasoning and their
percentages. Note that these percentages do not add
up to 100% as interpreting a scenario may require
more than one type of reasoning.

5 Experiments

To assess the difficulty of ShARC as a machine
learning problem, we investigate a set of baseline
approaches on the end-to-end task as well as the im-
portant sub-tasks we identified. The baselines are
chosen to assess and demonstrate both feasibility
and difficulty of the tasks.

Metrics For all following classification tasks, we
use micro- and macro- averaged accuracies. For the
follow-up generation task, we compute the BLEU
scores at orders 1, 2, 3 and 4 computed between the
gold follow-up questions, yi and follow-up ques-
tion ŷi = wŷi,1, wŷi,2 . . . wŷi,n for all utterances i
in the evaluation dataset.

5.1 Classification (excluding Scenarios)
On each turn, a CMR system needs to decide, either
explicitly or implicitly, whether the answer is YES
or NO, whether the question is not relevant to the
rule text (IRRELEVANT), or whether a follow-up
question is necessary—an outcome we label as
MORE. In the following experiments, we will test
whether one can learn to make this decision using
the ShARC training data.

When a non-empty scenario is given, this task
also requires an understanding of how scenarios
answer follow-up questions. In order to focus on
the challenges of rule interpretation, here we only
consider empty scenarios.

Formally, for an utterance x = (q, r, h, s), we
require models to predict an answer y where y ∈
{YES, NO, IRRELEVANT, MORE}. Since we con-
sider only the classification task without scenario
influence, we consider the subset of utterances such
that s = NULL. This data subset consists of 4026
train, 431 dev and 1601 test utterances.

Category Questions Scenario %

Explicit Has your wife reached state pension age? Yes My wife just recently reached the age for state pension 25%

Temporal Did you own it before April 1982? Yes I purchased the property on June 5, 1980. 10%

Geographic Do you normally live in the UK? No I’m a resident of Germany. 7%

Numeric Do you work less than 24 hours a week between
you? No

My wife and I work long hours and get between 90 –
110 hours per week between the two of us.

12%

Paraphrase Are you working or preparing for work? No I am currently out of work after being laid off from my
last job, but am not able to look for any yet.

19%

Implication Are you the baby’s father? No My girlfriend is having a baby by her ex. 51%

Table 1: Types of reasoning and their proportions in the dataset based on 100 samples. Implication includes
reasoning beyond what is explicitly stated in the text, including common sense reasoning and external
knowledge.

Model Micro Acc. Macro Acc.

Random 0.254 0.250
Surface LR 0.555 0.511
Heuristic 0.791 0.779
Random Forest 0.808 0.797

CNN 0.677 0.681

Table 2: Selected Results of the baseline models on
the classification sub-task.

Baselines We evaluate various baselines includ-
ing random, a surface logistic regression applied to
a TFIDF representation of the rule text, question
and history, a rule-based heuristic which makes
predictions depending on the number of overlap-
ping words between the rule text and question, de-
tecting conjunctive or disjunctive rules, detecting
negative mismatch between the rule text and the
question and what the answer to the last follow-up
history was, a feature-engineered Random Forest
and a Convolutional Neural Network applied to
the tokenised inputs of the concatenated rule text,
question and history.

Results We find that, for this classification sub-
task, Random Forest slightly outperforms the
heuristic. All learnt models considerably outper-
form the random and majority baselines.

5.2 Follow-up Question Generation without
Scenarios

When the target utterance is a follow-up question,
we still have to determine what that follow-up ques-
tion is. For an utterance x = (q, r, h, s), we re-
quire models to predict an answer y where y is the
next follow-up question, y = wy,1, wy,2 . . . wy,n =
fm+1 if x has history of length m. We there-

fore consider the subset of utterances such that
s = NULL and y 6∈ {YES, NO, IRRELEVANT}.
This data subset consists of 1071 train, 112 dev and
424 test utterances.

Baselines We first consider several simple base-
lines to explore the relationship between our evalu-
ation metric and the task. As annotators are encour-
aged to re-use the words from rule text when gen-
erating follow-up questions, a baseline that simply
returns the final sentence of the rule text performs
surprisingly well. We also implement a rule-based
model that uses several heuristics.

If framed as a seq2seq task, a modified Copy-
Net is most promising (Gu et al., 2016). We also
experiment with span extraction/sequence-tagging
approaches to identify relevant spans from the rule
text that correspond to the next follow-up ques-
tions. We find that Bidirectional Attention Flow
(Seo et al., 2017) performed well.4 Further imple-
mentation details can be found in Appendix H.

Results Our results, shown in Table 3 indicate
that systems that return contiguous spans from the
rule text perform better according to our BLEU
metric. We speculate that the logical forms in the
data are challenging for existing models to extract
and manipulate, which may suggest why the ex-
plicit rule-based system performed best. We fur-
ther note that only the rule-based and NMT-Copy
models are capable of generating genuine questions
rather than spans or sentences.

5.3 Scenario Interpretation

Many utterances require the interpretation of the
scenario associated with a question. If the scenario

4We use AllenNLP implementations of BiDAF & DAM

Model BLEU-1 BLEU-2 BLEU-3 BLEU-4

First Sent. 0.221 0.144 0.119 0.106

NMT-Copy 0.339 0.206 0.139 0.102
BiDAF 0.450 0.375 0.338 0.312
Rule-based 0.533 0.437 0.379 0.344

Table 3: Selected Results of the baseline models on
follow-up question generation.

Model Micro Acc. Macro Acc.

Random 0.330 0.326
Surface LR 0.682 0.333

DAM (SNLI) 0.479 0.362
DAM (ShARC) 0.492 0.322

Table 4: Results of entailment models on ShARC.

is understood, certain follow-up questions can be
skipped because they are answered within the sce-
nario. In this section, we investigate how difficult
scenario interpretation is by training models to an-
swer follow-up questions based on scenarios.

Baselines We use a random baseline and also im-
plement a surface logistic regression applied to a
TFIDF representation of the combined scenario and
the question. For neural models, we use Decom-
posed Attention Model (DAM) (Parikh et al., 2016)
trained on each the SNLI and ShARC corpora us-
ing ELMO embeddings (Peters et al., 2018).4

Results Table 4 shows the result of our baseline
models on the entailment corpus of ShARC test
set. Results show poor performance especially for
the macro accuracy metric of both simple baselines
and neural state-of-the-art entailment models. This
performance highlights the challenges that the sce-
nario interpretation task of ShARC presents, many
of which are discussed in Section 4.2.2.

5.4 Conversational Machine Reading

The CMR task requires all of the above abilities. To
understand its core challenges, we compare base-
lines that are trained end-to-end vs. baselines that
reuse solutions for the above subtasks.

Baselines We present a Combined Model (CM)
which is a pipeline of the best performing Random
Forest classification model, rule-based follow-up
question generation model and Surface LR entail-
ment model. We first run the classification model
to predict YES, NO, MORE or IRRELEVANT. If
MORE is predicted, the Follow-up Question Gen-

Model Micro Acc Macro Acc BLEU-1 BLEU-4

CM 0.619 0.689 0.544 0.344
NMT 0.448 0.428 0.340 0.078

Table 5: Results of the models on the CMR task.

eration model is used to produce a follow-up ques-
tion, f1. The rule text and produced follow-up
question are then passed as inputs to the Sce-
nario Interpretation model. If the output of this is
IRRELEVANT, then the CM predicts f1, otherwise,
these steps are repeated recursively until the clas-
sification model no longer predicts MORE or the
entailment model predicts IRRELEVANT, in which
case the model produces a final answer. We also in-
vestigate an extension of the NMT-copy model on
the end-to-end task. Input sequences are encoded
as a concatenation of the rule text, question, sce-
nario and history. The model consists of a shared
encoder LSTM, a 4-class classification head with
attention, and a decoder GRU to generate followup
questions. The model was trained by alternating
training the classifier via standard softmax-cross en-
tropy loss and the followup generator via seq2seq.
At test time, the input is first classified, and if the
predicted class is MORE, the follow-up generator
is used to generate a followup question, f1. A sim-
pler model without the separate classification head
failed to produce predictive results.

Results We find that the combined model outper-
forms the neural end-to-end model on the CMR
task, however, the fact that the neural model has
learned to classify better than random and also
predict follow-up questions is encouraging for de-
signing more sophisticated neural models for this
task.

User Study In order to evaluate the utility of con-
versational machine reading, we run a user study
that compares CMR to when such an agent is not
available, i.e. the user has to read the rule text and
determine themselves the answer to the question.
On the other hand, with the agent, the user does not
read the rule text, instead only responds to follow-
up questions. Our results show that users using the
conversational agent reach conclusions > 2 times
faster than ones that are not, but more importantly,
they are also much more accurate (93% as com-
pared to 68%). Details of the experiments and the
results are included in Appendix I.

6 Related Work

This work relates to several areas of active research.

Machine Reading In our task, systems answer
questions about units of texts. In this sense, it
is most related to work in Machine Reading (Ra-
jpurkar et al., 2016; Seo et al., 2017; Weissenborn
et al., 2017). The core difference lies in the con-
versational nature of our task: in traditional Ma-
chine Reading the questions can be answered right
away; in our setting, clarification questions are
often needed. The domain of text we consider
is also different (regulatory vs Wikipedia, books,
newswire).

Dialog The task we propose is, at its heart, about
conducting a dialog (Weizenbaum, 1966; Serban
et al., 2018; Bordes and Weston, 2016). Within
this scope, our work is closest to work in dialog-
based QA where complex information needs are
addressed using a series of questions. In this space,
previous approaches have been looking primarily
at QA dialogs about images (Das et al., 2017) and
knowledge graphs (Saha et al., 2018; Iyyer et al.,
2017). In parallel to our work, both Choi et al.
(2018) and Reddy et al. (2018) have to began to
investigate QA dialogs with background text. Our
work not only differs in the domain covered (regula-
tory text vs wikipedia), but also in the fact that our
task requires the interpretation of complex rules,
application of background knowledge, and the for-
mulation of free-form clarification questions. Rao
and Daume III (2018) investigate how to generate
clarification questions but this does not require the
understanding of explicit natural language rules.

Rule Extraction From Text There is a long
line of work in the automatic extraction of rules
from text (Silvestro, 1988; Moulin and Rousseau,
1992; Delisle et al., 1994; Hassanpour et al., 2011;
Moulin and Rousseau, 1992). The work tackles a
similar problem—interpretation of rules and reg-
ulatory text—but frames it as a text-to-structure
task as opposed to end-to-end question-answering.
For example, Delisle et al. (1994) maps text to
horn clauses. This can be very effective, and good
results are reported, but suffers from the general
problem of such approaches: they require careful
ontology building, layers of error-prone linguistic
preprocessing, and are difficult for non-experts to
create annotations for.

Question Generation Our task involves the au-
tomatic generation of natural language questions.
Previous work in question generation has focussed
on producing questions for a given text, such that
the questions can be answered using this text (Van-
derwende, 2008; M. Olney et al., 2012; Rus et al.,
2011). In our case, the questions to generate are
derived from the background text but cannot be
answered by them. Mostafazadeh et al. (2016)
investigate how to generate natural follow-up ques-
tions based on the content of an image. Besides
not working in a visual context, our task is also
different because we see question generation as a
sub-task of question answering.

7 Conclusion

In this paper we present a new task as well as an
annotation protocol, a dataset, and a set of base-
lines. The task is challenging and requires models
to generate language, copy tokens, and make log-
ical inferences. Through the use of an interactive
and dialog-based annotation interface, we achieve
good agreement rates at a low cost. Initial baseline
results suggest that substantial improvements are
possible and require sophisticated integration of
entailment-like reasoning and question generation.

Acknowledgements

This work was supported by in part by an Allen Dis-
tinguished Investigator Award and in part by Allen
Institute for Artificial Intelligence (AI2) award to
UCI.

References
Ron Artstein and Massimo Poesio. 2008. Inter-coder

agreement for computational linguistics. Computa-
tional Linguistics, 34(4):555–596.

Antoine Bordes and Jason Weston. 2016. Learn-
ing end-to-end goal-oriented dialog. CoRR,
abs/1605.07683.

Kyunghyun Cho, Bart Van Merriënboer, Caglar Gul-
cehre, Dzmitry Bahdanau, Fethi Bougares, Holger
Schwenk, and Yoshua Bengio. 2014. Learning
phrase representations using rnn encoder-decoder
for statistical machine translation. arXiv preprint
arXiv:1406.1078.

Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-
tau Yih, Yejin Choi, Percy Liang, and Luke Zettle-
moyer. 2018. QuAC : Question Answering in Con-
text. In EMNLP. ArXiv: 1808.07036.

Abhishek Das, Satwik Kottur, Khushi Gupta, Avi
Singh, Deshraj Yadav, Jos MF Moura, Devi Parikh,
and Dhruv Batra. 2017. Visual dialog. In Proceed-
ings of the IEEE Conference on Computer Vision
and Pattern Recognition, volume 2.

Sylvain Delisle, Ken Barker, Jean franois Delannoy,
Stan Matwin, and Stan Szpakowicz. 1994. From
text to horn clauses: Combining linguistic analysis
and machine learning. In In 10th Canadian AI Conf,
pages 9–16.

Jiatao Gu, Zhengdong Lu, Hang Li, and Victor
O. K. Li. 2016. Incorporating copying mech-
anism in sequence-to-sequence learning. CoRR,
abs/1603.06393.

Saeed Hassanpour, Martin O’Connor, and Amar Das.
2011. A framework for the automatic extraction of
rules from online text.

Karl Moritz Hermann, Tomas Kocisky, Edward Grefen-
stette, Lasse Espeholt, Will Kay, Mustafa Suleyman,
and Phil Blunsom. 2015. Teaching machines to read
and comprehend. In Advances in Neural Informa-
tion Processing Systems, pages 1693–1701.

Mohit Iyyer, Wen-tau Yih, and Ming-Wei Chang. 2017.
Search-based Neural Structured Learning for Se-
quential Question Answering. Proceedings of the
55th Annual Meeting of the Association for Com-
putational Linguistics (Volume 1: Long Papers),
1:1821–1831.

Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke
Zettlemoyer. 2017. Triviaqa: A large scale distantly
supervised challenge dataset for reading comprehen-
sion. In Proceedings of the 55th Annual Meeting of
the Association for Computational Linguistics, Van-
couver, Canada. Association for Computational Lin-
guistics.

Omer Levy, Minjoon Seo, Eunsol Choi, and Luke
Zettlemoyer. 2017. Zero-shot relation extrac-
tion via reading comprehension. arXiv preprint
arXiv:1706.04115.

Andrew M. Olney, Arthur Graesser, and Natalie Person.
2012. Question generation from concept maps. 3.

Nasrin Mostafazadeh, Ishan Misra, Jacob Devlin, Mar-
garet Mitchell, Xiaodong He, and Lucy Vander-
wende. 2016. Generating natural questions about an
image. CoRR, abs/1603.06059.

B. Moulin and D. Rousseau. 1992. Automated knowl-
edge acquisition from regulatory texts. IEEE Expert,
7(5):27–35.

Ankur P Parikh, Oscar Täckström, Dipanjan Das, and
Jakob Uszkoreit. 2016. A decomposable attention
model for natural language inference. arXiv preprint
arXiv:1606.01933.

Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt
Gardner, Christopher Clark, Kenton Lee, and Luke
Zettlemoyer. 2018. Deep contextualized word repre-
sentations. arXiv preprint arXiv:1802.05365.

P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. 2016.
Squad: 100,000+ questions for machine comprehen-
sion of text. In Empirical Methods in Natural Lan-
guage Processing (EMNLP).

Sudha Rao and Hal Daume III. 2018. Learning to Ask
Good Questions: Ranking Clarification Questions
using Neural Expected Value of Perfect Information.
In Proceedings of the 56th Annual Meeting of the As-
sociation for Computational Linguistics (Volume 1:
Long Papers), pages 2737–2746, Melbourne, Aus-
tralia. Association for Computational Linguistics.

Siva Reddy, Danqi Chen, and Christopher D. Man-
ning. 2018. CoQA: A Conversational Question An-
swering Challenge. arXiv:1808.07042 [cs]. ArXiv:
1808.07042 Citation Key: reddyCoQAConversation-
alQuestion2018.

Vasile Rus, Paul Piwek, Svetlana Stoyanchev, Brendan
Wyse, Mihai Lintean, and Cristian Moldovan. 2011.
Question generation shared task and evaluation chal-
lenge: Status report. In Proceedings of the 13th Eu-
ropean Workshop on Natural Language Generation,
ENLG ’11, pages 318–320, Stroudsburg, PA, USA.
Association for Computational Linguistics.

Amrita Saha, Vardaan Pahuja, Mitesh Khapra, Karthik
Sankaranarayanan, and Sarath Chandar. 2018. Com-
plex Sequential Question Answering: Towards
Learning to Converse Over Linked Question Answer
Pairs with a Knowledge Graph.

Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi, and
Hannaneh Hajishirzi. 2017. Bidirectional attention
flow for machine comprehension. In The Inter-
national Conference on Learning Representations
(ICLR).

Iulian Vlad Serban, Ryan Lowe, Peter Henderson, Lau-
rent Charlin, and Joelle Pineau. 2018. A Survey
of Available Corpora For Building Data-Driven Di-
alogue Systems: The Journal Version. Dialogue &
Discourse, 9(1):1–49.

Kenneth Silvestro. 1988. Using explanations for
knowledge-base acquisition. International Journal
of Man-Machine Studies, 29(2):159 – 169.

Rion Snow, Brendan O’Connor, Daniel Jurafsky, and
Andrew Y Ng. 2008. Cheap and fast—but is it
good?: evaluating non-expert annotations for natu-
ral language tasks. In Proceedings of the conference
on empirical methods in natural language process-
ing, pages 254–263. Association for Computational
Linguistics.

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.
Sequence to sequence learning with neural networks.
In Advances in neural information processing sys-
tems, pages 3104–3112.

Lucy Vanderwende. 2008. The importance of being
important: Question generation. In In Proceedings
of the Workshop on the Question Generation Shared
Task and Evaluation Challenge.

Dirk Weissenborn, Georg Wiese, and Laura Seiffe.
2017. Fastqa: A simple and efficient neu-
ral architecture for question answering. CoRR,
abs/1703.04816.

Joseph Weizenbaum. 1966. ELIZAa computer pro-
gram for the study of natural language communica-
tion between man and machine. Communications of
the ACM, 9(1):36–45.

Johannes Welbl, Pontus Stenetorp, and Sebastian
Riedel. 2018. Constructing datasets for multi-hop
reading comprehension across documents. Transac-
tions of ACL, abs/1710.06481.

Supplementary Materials for EMNLP 2018 Paper:
Interpretation of Natural Language Rules in Conversational Machine Reading

A Annotation Interfaces

Figure 4 shows the Mechanical-Turk interface we
developed for the dialog generation stage. Note
that the interface also contains a mechanism to
validate previous utterances in case they have been
generated by different annotators.

Figure 4: The dialog-style web interface encour-
ages workers to extract all the rule text-relevant
evidence required to answer the initial question in
the form of YES/NO follow-up questions.

Figure 5 shows the annotation interface for the
scenario generation task, where the first question is
relevant and the second question is not relevant.

B Quality Control

In this section, we present several measure that we
take in order to create a high quality dataset.

Irregularity Detection A convenient property
of the formulation of the reasoning process as a
binary decision tree is class exclusivity at the fi-
nal partitioning of the utterance space. That is,
if the two leaf nodes stemming from the same
FOLLOW-UP QUESTION node have identical YES
or NO values, this is an indication of either a mis-
annotation or a redundant question. We automati-
cally identify these irregularities, trim the subtree
at FOLLOW-UP QUESTION node and re-annotate.
This also means that our protocol effectively guar-
antees a minimum of two annotations per leaf node,
further enhancing data quality.

Figure 5: Annotators are asked to write a scenario
that fits the given information, i.e. questions and
answers.

Back-validation We implement back-validation
by providing the workers with two options: YES
and proceed with the task, or NO and provide an
invalidation reason to de-incentivize unnecessary
rejections. We found this approach to be valuable
both as a validation mechanism as well as a means
of collecting direct feedback about the task and
the types of incorrect annotations encountered. We
then trim any invalidated subtrees and re-annotate.

Contradiction Detection We can introduce con-
tradictory information by adding random questions
and answers to a dialog part when generating HITs
for scenario generation. Therefore, we first ask
each annotator to identify whether the provided
dialog parts are contradictory. If they are, the anno-
tator will invalidate the HIT.

Validation Sampling We sample a proportion of
each worker‘s annotations to validate. Through this
process, each worker is assigned a quality score.
We only allow workers with a score higher than a
certain value to participate in our HITs (Snow et al.,
2008). We also restrict participation to workers
with > 97% approval rate, > 1000 previously com-
pleted HITs and located in the UK, US or Canada.

Qualification Test Amazon Mechanical Turk al-
lows the creation of qualification tests through the
API, which need to be passed by each turker before

attempting any HIT from a specific task. A qual-
ification can contain several questions with each
having a value. The qualification requirement for a
HIT can specify that the total value must be over
a specific threshold for the turker to obtain that
qualification. We set this threshold to 100%.

Possible Sources of Noise Here we detail pos-
sible sources of noise, estimate their effects and
outline the steps taken to mitigate these sources:

a) Noise arising from annotation errors: This has
been discussed in detail above.

b) Noise arising from negative question gener-
ation: Some noise could be introduced due to the
automatic sampling of the negative questions. To
obtain an estimate, 100 negative questions were
assessed by an expert annotator. It was found that
only 8% of negatively sampled questions were er-
roneous.

c) Noise arising from the negative scenario sam-
pling: A further 100 utterances with negatively
sampled scenarios were curated by an expert anno-
tator, and it was found that 5% of the utterances
were erroneous.

d) Errors arising from the application of scenar-
ios to dialog trees: The assumption that the sce-
nario was only relevant to the follow-up questions
it was generated from, and was independent to all
other follow-up questions posed in that dialog tree
is not necessarily true, and could result in noisy di-
alog utterances. 100 utterances from the subset of
the data where this type of error was possible were
assessed by expert annotators, and 12% of these
utterances were found to be erroneous. This type
of error can only affect 80% of utterances, thus the
estimated total effect of this type of noise is 10%.

Despite the relatively low levels of noise, we
asked expert annotators to manually inspect and
curate (if necessary) all the instances in the devel-
opment and the test set that are prone to potential
errors. This leads to an even higher quality of data
in our dataset.

C Further Details on Corpus

We use 264 unique sources from 10 unique do-
mains listed below. For transparency and repro-
ducibility, the source URLs are included in the
corpus for each dialog utterance.

• http://legislature.maine.gov/

• https://esd.wa.gov/

• https://www.benefits.gov/

• https://www.dmv.org/

• https://www.doh.wa.gov/

• https://www.gov.uk/

• https://www.humanservices.gov.
au/

• https://www.irs.gov/

• https://www.usa.gov/

• https://www.uscis.gov/

Further, the ShARC dataset composition can be
seen in Table 6.

Set # Utterances # Trees # Scenarios # Sources

All 32436 948 6637 264
Train 21890 628 4611 181
Development 2270 69 547 24
Test 8276 251 1910 59

Table 6: Dataset composition.

D Negative Data

In this section, we provide further details regarding
the generation of the negative examples.

D.1 Negative Questions
Formally, for each unique positive question, rule
text pair, (qi, ri), and defining di as the source
document for (qi, ri), we construct the set Q ∈
{q1 . . . qn} where Q is the set of questions that are
not sourced from di. We take a random uniform
sample qj from Q to generate the negative utter-
ance (qj , ri, hj , yj) where yj = IRRELEVANT and
hj is an empty history sequence. An example of a
negative question is shown below.

Q. Can I get Working Tax Credit?

R. You must also wear protective headgear if you
are using a learner’s permit or are within 1 year of
obtaining a motorcycle license.

D.2 Negative Scenarios
We also negatively sample scenarios so that models
can learn to ignore distracting scenario information
that is not relevant to the task. We define a negative
scenario as a scenario that provides no information
to assist answering a given question and as such,

http://legislature.maine.gov/
https://esd.wa.gov/
https://www.benefits.gov/
https://www.dmv.org/
https://www.doh.wa.gov/
https://www.gov.uk/
https://www.humanservices.gov.au/
https://www.humanservices.gov.au/
https://www.irs.gov/
https://www.usa.gov/
https://www.uscis.gov/

good models should ignore all details within these
scenarios.

A scenario sx is associated with the (one
or more) dialog question and answer pairs
{(fx,1, ax,1) .. (fx,n, ax,n)} that it was generated
from.

For a given unique question, rule text pair,
(qi, ri), associated with a set of positive scenar-
ios {si,1 . . . si,k}, we uniformly randomly sample
a candidate negative scenario sj from the set of all
possible scenarios. We then build TF-IDF repre-
sentations for the set of all dialog questions asso-
ciated with (qi, ri), i.e. Fi = {(fi,1,1) .. (fi,k,n)}.
We also construct TF-IDF representations for the
set of dialog questions associated with sj , Fsj =
{(fj,1) .. (fj,x)}.

If the cosine similarity for all pairs of dialog
questions between Fi and Fsj are less than a certain
threshold, the candidate is accepted as a negative,
otherwise a new candidate is sampled and the pro-
cess is repeated. Then we iterate over all utterances
that contain (qi, ri) and use the negative scenario
to create one more utterance whenever the original
utterance has an empty scenario. The threshold
value was validated using manual verification. An
example is shown below:

R. You are allowed to make emergency calls to
911, and bluetooth devices can still be used while
driving.

S. The person I’m referring to can no longer take
care of their own affairs.

E Challenges

In this section we present a few interesting exam-
ples we encountered in order to provide a better
understanding of the requirements and challenges
of the proposed task.

E.1 Dialog Generation
Table 8 shows the breakdown of the types of chal-
lenges that exist in our dataset for dialog generation
and their proportion.

F Entailment Corpus

Using the scenarios and their associated questions
and answers we create an entailment corpus for
each of the train, development and test sets of
ShARC. For every dialog utterance that includes
a scenario, we create a number of data points as
follows:

Figure 6: Example of a complex and hard-to-
interpret rule relationship.

For every utterance in ShARC with input x =
(q, r, h, s) and output y where y = fm 6∈
{YES, NO, IRRELEVANT}, we create an entail-
ment instance (xe, ye) such that xe = s
and:

• ye = ENTAILMENT if the answer am to follow-
up question fm is YES which can be derived
from s.

• ye = CONTRADICTION if the answer am to
follow-up question fm is NO which can be
derived from s.

• ye = NEUTRAL if the answer am to follow-up
question fm cannot be derived from s.

Table 7 shows the statistics for the entailment cor-
pus.

Set ENTAILMENT CONTRADICTION NEUTRAL

Train 2373 2296 10912
Dev 271 253 1098
Test 919 944 4003

Table 7: Statistics of the entailment corpus created
from the ShARC dataset.

Figure 7: Example of a hard-to-interpret rule due
to complex negations. In this particular example,
majority vote was inaccurate.

Figure 8: Example of a conjunctive rule relation-
ship derived from a bulleted list, determined by the
presence of “, and” in the third bullet.

Figure 9: Example of a dialog-tree for a typical disjunctive bulleted list.

G Further details on Interpreting rules

Category Example Question Example Rule Text Percentage

Simple Can I claim extra MBS
items?

If youre providing a bulk billed service to a patient you may
claim extra MBS items.

31%

Bullet Points Do I qualify for assis-
tance?

To qualify for assistance, applicants must meet all loan eligi-
bility requirements including:

• Be unable to obtain credit elsewhere at reasonable
rates and terms to meet actual needs;

• Possess legal capacity to incur loan obligations;

34%

In-line Conditions Do these benefits apply
to me?

These are benefits that apply to individuals who have earned
enough Social Security credits and are at least age 62.

39%

Conjunctions Could I qualify for Let-
ting Relief?

If you qualify for Private Residence Relief and have a charge-
able gain, you may also qualify for Letting Relief. This
means youll pay less or no tax.

18%

Disjunctions Can I get deported? The United States may deport foreign nationals who partici-
pate in criminal acts, are a threat to public safety, or violate
their visa.

41%

Understanding
Questioner Role

Am I eligible? The borrower must qualify for the portion of the loan used
to purchase or refinance a home. Borrowers are not required
to qualify on the portion of the loan used for making energy-
efficient upgrades.

10%

Negations Will I get the National
Minimum Wage?

You wont get the National Minimum Wage or National Liv-
ing Wage if youre work shadowing

15%

Conjunction
Disjunction Combi-
nation

Can my partner and
I claim working tax
credit?

You can claim if you work less than 24 hours a week between
you and one of the following applies:

• you work at least 16 hours a week and youre disabled
or aged 60 or above

• you work at least 16 hours a week and your partner is
incapacitated

18%

World Knowledge
Required to Resolve
Ambiguity

Do I qualify for Statu-
tory Maternity Leave?

You qualify for Statutory Maternity Leave if:

• youre an employee not a ‘worker’
• you give your employer the correct notice

13%

Table 8: Types of features present for question, rule text pairs and their proportions in the dataset based
on 100 samples. World Knowledge Required to resolve ambiguity refers to where the rule itself doesn’t
syntactically indicate whether to apply a conjunction or disjunction, and world knowledge is required to
infer the rule.

H Further details on Follow-up Question
Generation Modelling

Table 9 details all the results for all the the models
considered for follow-up question generation.

First Sent. Return the first sentence of the rule
text

Random Sent. Return a random sentence from
the rule text

SurfaceLR A simple binary logistic model,
which was trained to predict whether or not a given
sentence in a rule text had the highest trigram over-
lap with the target follow-up question, using a bag
of words feature set, augmented with 3 very simple
engineered features (the number of sentences in the
rule text, the number of tokens in the sentence and
the position of the sentence in the rule text)

Sequence Tag A simple neural model consisting
of a learnt word embedding followed by an LSTM.
Each word in the rule text is classified as either in
or out of the subsequence to return using an I/O
sequence tagging scheme.

H.1 Further details on neural models for
question generation

Table 10 details what the inputs and outputs of the
neural models should be.

The NMT-Copy model follows an encoder-
decoder architecture. The encoder is an LSTM. The
decoder is a GRU equipped with a copy mechanism,
with an attention mechanism over the encoder out-
puts and an additional attention over the encoder
outputs with respect to the previously copied to-
ken. We achieved best results by limiting the
model‘s generator vocabulary to only very com-
mon interrogative words. We train with a 50:50
teacher-forcing / greedy decoding ratio. At test
time we greedily sample the next word to gener-
ate, but prevent repeated tokens being generated by
sampling the second highest scoring token if the
highest would result in a repeat.

In order to frame the task as a span extraction
task, a simple method of mapping a follow-up ques-
tion onto a span in the rule text was employed. The
longest common subsequence of tokens between
the rule text and follow-up question was found, and
if the subsequence length was greater than a cer-
tain threshold, the target span was generated by
increasing the length of the subsequence so that
it matched the length of the follow-up question.

These spans were then used to supervise the train-
ing of the BiDAF and sequence tagger models.

I Evaluating Utility of CMR

In order to evaluate the utility of conversational ma-
chine reading, we run a user study that compares
CMR with the scenario when such an agent is not
available, i.e. the user has to read the rule text, the
question, and the scenario, and determine for them-
selves whether the answer to the question is “Yes”
or “No”. On the other hand, with the agent, the user
does not read the rule text, instead only responds
to follow-up questions with a “Yes” or “No”, based
on the scenario text and world knowledge.

We carry out a user study with 100 randomly
selected scenarios and questions, and elicit annota-
tion from 5 workers for each. As these instances
are from the CMR dataset, the quality is fairly high,
and thus we have access to the gold answers and
follow-ups questions for all possible responses by
the users. This allows us to evaluate the accuracy
of the users in answering the question, the primary
objective of any QA system. We also track a num-
ber of other metrics, such as the time taken by the
users to reach the conclusion.

In Figure 10a, we see that the users that have
access to the conversational agent are almost twice
as fast the users that need to read the rule text. This
demonstrates that even though the users with the
conversational agent have to answer more ques-
tions (as many as the followup questions), they are
able to understand and apply the knowledge more
quickly. Further, in Figure 10b, we see that users
with access to the conversational agents are much
more accurate than ones without, demonstrating
that an accurate conversational agent can have a
considerable impact on efficiency.

Model BLEU-1 BLEU-2 BLEU-3 BLEU-4

Random Sent. 0.302 0.228 0.197 0.179
First Sent. 0.221 0.144 0.119 0.106
Last Sent. 0.314 0.247 0.217 0.197
Surface LR 0.293 0.233 0.205 0.186
NMT-Copy 0.339 0.206 0.139 0.102
Sequence Tag 0.212 0.151 0.126 0.110
BiDAF 0.450 0.375 0.338 0.312
Rule-based 0.533 0.437 0.379 0.344

Table 9: All results of the baseline models on follow-up question generation.

Model Input Output

NMT-Copy r || q || f1 ? aa || . . . || fm ? am fm+1
Sequence Tag r || q || f1 ? aa || . . . || fm ? am Span corresponding to follow-up question.
BiDAF Question: q || f1 ? aa || . . . || fm ? am

Context : r
Span corresponding to follow-up question.

Table 10: Inputs and outputs of neural models for question generation.

Manual Reading Conversation (oracle)
0

10

20

30

40

50
Time to answer query (seconds)

(a) Time taken to reach conclusion

Manual Reading Conversation (oracle)
0

20

40

60

80

Correct conclusion reached (accuracy)

(b) Accuracy of the conclusion reached

Figure 10: Utility of CMR Evaluation via a user
study demonstrating that users with an accurate
conversational agent are not only reach conclusions
much faster than ones that have to read the rule text,
but also that the conclusions reached are correct
much more often.