CS计算机代考程序代写 information retrieval ER AI algorithm Explaining Question Answering Models through Text Generation

Explaining Question Answering Models through Text Generation

Veronica Latcinnik1 Jonathan Berant1,2

1School of Computer Science, Tel-Aviv University
2Allen Institute for AI

{veronical@mail,joberant@cs}.tau.ac.il

Abstract

Large pre-trained language models (LMs)
have been shown to perform surprisingly well
when fine-tuned on tasks that require common-
sense and world knowledge. However, in end-
to-end architectures, it is difficult to explain
what is the knowledge in the LM that allows
it to make a correct prediction. In this work,
we propose a model for multi-choice question
answering, where a LM-based generator gen-
erates a textual hypothesis that is later used by
a classifier to answer the question. The hy-
pothesis provides a window into the informa-
tion used by the fine-tuned LM that can be in-
spected by humans. A key challenge in this
setup is how to constrain the model to gener-
ate hypotheses that are meaningful to humans.
We tackle this by (a) joint training with a sim-
ple similarity classifier that encourages mean-
ingful hypotheses, and (b) by adding loss func-
tions that encourage natural text without repeti-
tions. We show on several tasks that our model
reaches performance that is comparable to end-
to-end architectures, while producing hypothe-
ses that elucidate the knowledge used by the
LM for answering the question.

1 Introduction

Language Models (LMs), trained on large amounts
of data using self-supervised learning (Peters et al.,
2018; Devlin et al., 2018; Liu et al., 2019; Yang
et al., 2019; Raffel et al., 2019), have been recently
shown to encode substantial amounts of knowledge
in their parameters (Petroni et al., 2019; Jiang et al.,
2019; Talmor et al., 2019). This has been demon-
strated by their ability to answer questions that
require common sense and world knowledge, with-
out retrieving information from an external source
(Trinh and Le, 2018; Zhou et al., 2019; Roberts
et al., 2020; Ling et al., 2020). For example, the cur-
rent top model for COMMONSENSEQA (CSQA)

Figure 1: An overview of our approach: a generator takes
a question and outputs a textual hypothesis that is used by a
classifier to select the answer (example from CSQA).

(Talmor et al., 2018),1 a benchmark testing the
ability to answer commonsense questions, answers
questions using ALBERT (Lan et al., 2019) only.

Despite these impressive results, most current
models are based on end-to-end architectures,
where it is difficult to know how the model reached
its prediction. In fact, it has been repeatedly
shown that often models obtain high performance
through “shortcuts” rather than language under-
standing (Tsuchiya, 2018; Poliak et al., 2018; Gu-
rurangan et al., 2018; Geva et al., 2019). This has
sparked interest in explainable models, where in-
termediate parts of the model can be inspected and
interpreted by humans (Lipton, 2016; Ribeiro et al.,
2016; Lundberg and Lee, 2017; Camburu et al.,
2018; Thorne et al., 2019; Rajani et al., 2019; Jain
and Wallace, 2019).

In this paper, we investigate explainable LM-
based models for multi-choice question answering
(MC-QA). Specifically, we address the question:
What is the knowledge in the LM used for answer-
ing a question? Our approach consists of a gener-

1https://www.tau-nlp.org/
csqa-leaderboard

ar
X

iv
:2

00
4.

05
56

9v
1

[
cs

.C
L

]
1

2
A

pr
2

02
0

https://www.tau-nlp.org/csqa-leaderboard
https://www.tau-nlp.org/csqa-leaderboard

ator and a classifier (see Figure 1). The generator
takes a question and outputs a textual hypothesis,
which consists of a few words in natural language.
The classifier takes the hypothesis (and possibly the
question) to make the final prediction. Unlike end-
to-end models where reasoning is internal to the
model, the hypothesis is an inspectable intermedi-
ate representation that exposes relevant knowledge
extracted from the LM. To the best of our know-
eldge, we are the first to propose this intermediate
textual layer for as an explanation for a QA model.

Our setup can be viewed as an instance of con-
trolled text generation. However, generation is
not controlled by additional inputs (Kikuchi et al.,
2016; Ficler and Goldberg, 2017; Keskar et al.,
2019; Dathathri et al., 2019). Instead, we train
from weak supervision – the generator is trained
to produce hypotheses that are useful for the down-
stream QA application. We compare this approach
to existing methods for generating explanations,
where the explanation is provided as a target and
the model is trained to generate it explicitly (Cam-
buru et al., 2018; Rajani et al., 2019).

Training from weak supervision raises several
technical challenges. First, because the gener-
ator outputs discrete symbols, the loss is non-
differentiable with respect to its parameters. We
use the Gumbel-Softmax straight-through estima-
tor (Jang et al., 2016; Maddison et al., 2016) to
overcome this difficulty. Second, the classifier can
choose to ignore the generated hypothesis, or it can
coordinate with the generator to change the original
meaning of words. In such case, the hypothesis will
not constitute a useful explanation. To encourage
meaningful hypotheses, we (a) train the classifier
along with an auxiliary similarity classifier that
must use the hypothesis, (b) add loss terms that
encourage the hypothesis to correspond to natural
language, and (c) feed the classifier with multiple
generator outputs.

We evaluate our approach in multiple setups.
First, in a synthetic setup, where the model learns to
output the hypernym of objects; Second, on CSQA,
which focuses on commonsense knowledge; and
third, on zero-shot transfer to QASC, a multi-
choice QA task with an emphasis on scientific
knowledge. We find that our approach reaches per-
formance that is comparable to end-to-end models,
while providing good hypotheses that are shown to
be used by the classifier for prediction. We analyze
the generated hypotheses and demonstrate that they

shed light on different reasons for model error in
different questions (e.g., missing world knowledge
vs. language understanding difficulties), and help
detect examples where the prediction of the model
does not reflect its true knowledge. Our approach
can be generalized outside of MC-QA to any sce-
nario where we want to control the text generated
from a LM using a signal from a downstream ap-
plication. Our code and data is available at https:
//github.com/nika2312/qa_explaination.

2 Background

Problem setting We consider the task of multi-
choice question answering, where given a question
q and a list of candidate answersA = (a1, . . . , an),
our goal is to choose the correct answer a∗ ∈ A.
At training time, we observe question-candidates-
answer triples {(qi,Ai, a∗i )}

N
i=1, from which we

train a model.
A standard end-to-end approach for MC-QA is

to first obtain a contextualized representation for
the question and each answer candidate ai, by pass-
ing it through a transformer-based pre-trained LM
(Vaswani et al., 2017): hi = LM(q, ai). Then, the
contextualized representations are summarized to
a single vector gi. For example, in BERT (Devlin
et al., 2018) the summary vector is gi = hiCLS,
where hiCLS is the representation of the special CLS
token. Last, a weight vector w is learned to com-
pute the score si = w>gi. The final output distribu-
tion is softmax(s1, . . . , sn), and the entire model
is trained to maximize the log-likelihood of a∗.

This simple setup has been successful in multiple
MC-QA tasks that require commonsense reasoning
(Sap et al., 2018; Zellers et al., 2018; Chen et al.,
2019b; Talmor et al., 2018; Wang et al., 2019).
However, in end-to-end models, all reasoning is in-
ternal to the model, and it is difficult to know what
knowledge inside the LM was used for prediction.

In this work, we develop a model that exposes
more directly the knowledge inside the LM that is
used, by having it generate this information in the
form of language (Figure 1). Our goal is not to im-
prove performance, as we assume backpropagation
over a differentiable model is an effective method
for distilling information from a pre-trained LM.
Instead, we aim for explainability, aiming to help
both users and practitioners understand what the
model is doing. Recent work has shown (Petroni
et al., 2019; Jiang et al., 2019; Talmor et al., 2019)
that it is possible to extract information encoded

https://github.com/nika2312/qa_explaination
https://github.com/nika2312/qa_explaination

inside a LM by directly training it to output cer-
tain tokens in certain positions. Here, we make
a weaker assumption that we do not know what
words the model should generate, only that the text
should be useful for the downstream application.

3 Model

At a high-level our model consists of two com-
ponents, the generator and the classifier. The
generator takes a question q as input, and out-
puts a hypothesis, which is a sequence of tokens
c = (c1, . . . , c|c|). The classifier takes the question,
generated hypothesis, and answer candidates A,
and predicts the answer.

We assume the generator is based on a large pre-
trained autoregressive LM (encoding world knowl-
edge in its parameters). The LM provides a dis-
tribution over the vocabulary pgen(xi | x1…i−1),
where x1…i−1 = (x1, . . . , xi−1) is a sequence of
observed tokens. Thus, the hypothesis c is gen-
erated by decoding the hypothesis left-to-right,
concatenating the decoded prefix to the question
pgen(c | q) =

∏|c|
i=1 pgen(ci | [q; c1…i−1]). In

this work, we utilize two well-known LMs, GPT-
2 (Radford et al., 2019) and XLNet (Yang et al.,
2019) (see §4).

The classifier is very similar to the end-to-end
architecture described in §2. It takes the question q,
the hypothesis c, and an answer candidate ai as in-
put, and outputs a score si. The final distribution is
pQA(a | q, c) = softmax(s1, . . . , sn). We discuss
possible forms for the classifier and the effect on
performance and explainability in §3.2.

Our framework raises a few technical challenges.
First, because we sample discrete symbols from
the generator, the log-likelihood of the correct an-
swer a∗ is not differentiable with respect to the
parameters of the generator pgen. While this indeed
makes optimization challenging, we overcome this
by using the straight-through Gumbel-softmax es-
timator (Jang et al., 2016; Maddison et al., 2016),
which is a standard approach in this setup (§3.1).
Second, if the classifier is strong enough, for exam-
ple, another instance of a large pre-trained LM, it
can easily answer the question directly, completely
ignoring the hypothesis c generated by the gener-
ator. To overcome this, we train a much simpler
similarity classifier (possibly jointly with a more
complex classifier), which provides incentive for
the generator to produce useful hypothesis (§3.2).
Last, how do we make sure that the hypothesis is

a useful explanation for humans? We experiment
with both loss functions and decoding mechanisms
that improve the quality of the hypothesis (§3.3).

3.1 Training and Optimization
Our goal is to obtain a good MC-QA model, and
thus we maximize the expected log-likelihood of
the correct answer:

Eĉ∼pgen(c|q)[log pQA(a
∗ | q, ĉ)]. (1)

Computing the expectation exactly is intractable,
and so we approximate it by sampling ĉ.

Training the classifier is trivial, since the loss is
differentiable with respect to its parameters. How-
ever, the loss is not differentiable with respect to
the parameters of pgen. To overcome this diffi-
culty, we use the straight-through (ST) Gumbel-
softmax (GS) estimator, which has been shown to
produce better gradient estimates compared to RE-
INFORCE (Williams, 1992). The Gumbel-softmax
trick provides a continuous relaxation for the cat-
egorical distribution over the vocabulary, making
the model fully differentiable. However, it results
in a mismatch between training and test time, as at
test time we want discrete hypotheses. The straight-
through estimator solves that by using argmax in
the forward pass and softmax in the backward
pass. For details on the GS-ST estimator, please see
Bengio et al. (2013); Jang et al. (2016); Maddison
et al. (2016); Yin et al. (2019).

We note that because the loss is backpropagated
through the inputs of the classifier to the outputs of
the generator, the vocabularies of the generator and
classifier must be tied. In practice, we use XLNet
(or GPT-2) for both generation and classification.

3.2 Classifier Expressivity
We would like to have hypotheses that reflect
knowledge in the LM generator that is useful for
the MC-QA task. However, if the classifier has
the same knowledge as the generator (for example,
if they are initialized with the same parameters),
it has no incentive to use the hypothesis at all, as
it can extract the same knowledge from its own
weights. Moreover, even if the classifier is weaker
than the generator, once it has enough capacity, it
can coordinate with the generator in arbitrary ways,
and make the tokens decoded by the generator lose
their original meaning. Indeed, in §4 we show that
in such cases hypotheses are meaningless.

In this work, we consider a simple similarity
classifier psimQA whose only parameters are word

embeddings (excluding the generator parameters).
Specifically, given the sequence of word embed-
dings for the context Ec = (e1c , . . . , e

|c|
c ) and the

sequence of word embeddings for an answer candi-
date Eai = (e

1
a, . . . , e

|ai|
a ), we define the score for

the candidate answer as:

si =
1

|c||ai|

|c|∑
j=1

|ai|∑
k=1

ec
j>eka = avg(E

>
c Eai).

This model does not utilize the question at all and
hence must use the hypothesis c to answer the ques-
tion. This pushes the generator towards generating
interpretable hypotheses that are more similar to
the true answer compared to the distractors.

The similarity classifier psimQA encourages the gen-
erator to produce meaningful hypotheses, but this
can come at a great cost to performance. Thus, we
propose to train psimQA jointly with a more expressive
LM-based classifier pLMQA . Specifically, this classi-
fier is another large pre-trained LM almost identical
to the end-to-end model described in §2. The only
differences are that (a) it takes as input the concate-
nation of the question, answer candidate, and the
generated hypothesis, (b) the summary vector gi
encoding the input is the last hidden state of the
contextualized representation. The classifiers share
word embeddings and receive the same generator’s
output. The modified objective is hence:

Eĉ∼pgen(c|q)[log(p
sim
QA(a

∗ | q, ĉ)pLMQA(a
∗ | q, ĉ))].

(2)
We empirically show in §4 through ablation tests
that this LM-based classifier indeed uses the hy-
potheses c generated by the generator, even when
it is given the question as input.

Last, we note that psimQA pushes the generator to
produce hypotheses that are similar to the answer.
In other setups, we might want the generator to
produce hypotheses that complement a different
source of information. For example, if the ques-
tion is accompanied by a paragraph that contains
relevant information. In this case, the classifier can
be a reading comprehension (RC) model whose
parameters are untrained, and the generator will be
pushed to produce hypotheses that allow the RC
model to answer the question.

3.3 Explainability
The similarity classifier encourages the generator
to generate meaningful hypotheses. However, there
is no guarantee that the meaning of words does not

change during training. Moreover, the similarity
classifier objective might lead to repetitions in the
generated text. We now describe variants aimed at
improving interpretability.

KLD loss We add a KLD loss regularization term
to the generator objective (Jaques et al., 2016):
given the original pre-trained LM pNL, we min-
imize at each decoding step the KL divergence
D(pgen||pNL) between the trained generator distri-
bution and the original pre-trained LM distribution.
This prevents drift of the generator distribution.

Repetition penalty We adopt the unlikelihood
objective from Welleck et al. (2019) to discour-
age repetitions in the generated text. For each de-
coded token ci, we add to the loss function the term∑

w∈Wt log(1− pgen(w | [q; c1…i−1)), where W
t

is a set of “negative” tokens we want to penalize,
that is, those that already appeared in c1…i−1.

Top-K sampling Sampling a single hypothesis ĉ
provides a narrow channel for the generator to pass
information about its distribution to the classifier.
To provide more information, we modify the ST es-
timator forward pass, and output the top-K tokens
before passing the hypotheses to the classifier. This
produces a set of similar words that are guaranteed
to be distinct from one another. To avoid the com-
putational burden of running the classifier over a
beam of hypotheses, we perform just one step of
decoding, and concatenate the top-K tokens with
highest probability (thus, |c| = K).

3.4 Supervised Generator

Several recent works learned to generate answers or
explanations (McCann et al., 2018; Camburu et al.,
2018; Rajani et al., 2019; Huang et al., 2019) using
the standard supervised sequence-to-sequence ob-
jective (Sutskever et al., 2014). Specifically, they
encode a source sequence and maximize the log-
likelihood of the target explanation (or answer)
token-by-token using cross-entropy loss over the
vocabulary. We adapt this approach to our setup,
and encode the question q as usual, but use the
correct answer a∗ as the target. Here, the model
obtains supervision in every decoding step, unlike
weak supervision where the loss is obtained once
for the entire hypothesis. We refer to this model
variant in §4 as SUPGEN.2

2Training SUPGEN jointly with the LM-based classifier
resulted in poor performance, due to optimization issues.

Figure 2: Examples from the different datasets we use.

4 Experiments

We evaluate on the synthetic task of hypernym ex-
traction, on CSQA, and on transfer to QASC.

4.1 Sanity Check: Hypernym Extraction

We investigate whether our model can extract hy-
pernyms, which have been shown to be encoded
well in LMs (Richardson and Sabharwal, 2019).

We automatically create a dataset of 7,625 ques-
tions of the form “What is a [hyponym]?”,
where [hyponym] is a slot filled by words, such
as “dog”, and the possible answers are six disjoint
hypernym categories: (a) “plant”, (b) “bird”, (c)

“fish”, (d) “mammal”, (e) “reptile”, and (f) “bac-
teria”. Hyponym-hypernym pairs are harvested
from ConceptNet (Speer et al., 2016). We create a
development set by randomly sampling 20% of the
data, and use the rest for training.

We use GPT-2 as a generator, generating a sin-
gle word (|c| = 1), and optimize the similarity
classifier only: E[log psimQA(a

∗ | q, ĉ). Our goal is
to compare this with a standard end-to-end (§2)
model, termed END2END. We observe that the ac-
curacy of our model is 84.0, close to the accuracy
of END2END, at 86.5. This shows that the model
is able to convert the knowledge in its parameters
into language tokens. More results including the
effect of the components of the GS-ST estimator
are in Appendix B.

4.2 MC-QA Experiments

We evaluate our models on CSQA (Talmor et al.,
2018), a multi-choice commonsense reasoning QA
benchmark consisting of 12,102 questions with 5-
choice answers, and on zero-shot transfer to QASC
(Khot et al., 2019), a dataset containing science

questions that require external knowledge, consist-
ing of 9,980 examples with 8-choice answers (Fig-
ure 2). QASC contains a corpus of scientific facts
for information retrieval (IR) purposes, but we do
not use it in our setup.

All models are trained for 20 epochs, using
BertAdam optimizer with a learning rate of 2e−5,
batch size of 8, and dropout of 0.1. Training was
done using AllenNLP (Gardner et al., 2018), on top
of models from Hugging Face (Wolf et al., 2019).

4.2.1 Similarity classifier experiments

We start by examining the performance and ex-
plainability of models trained using a XLNet-based
similarity classifier only (Eq. 1), and examine the
effect of the variants from §3.

Baselines The aforementioned END2END base-
line (§4.1) obtains an accuracy of 71.0 on the de-
velopment set of CSQA. However, it has two ad-
vantages compared to our approach that we wish
to disentangle: (a) it is fully differentiable (this is
what we wish to isolate), but also (b) its architec-
ture is different: it is given both the question and
answer candidate as inputs to a transformer. Thus,
it can model arbitrary interactions between them
at the token level. This is in contrast to our model,
where the generator only observes the question, and
generates a hypothesis that is used to determinis-
tically select the answer. Thus, we introduce the
NOINTERACTION baseline, which is fully differ-
entiable, but has limited interaction between the
question and answer candidates.

In NOINTERACTION, we feed only the ques-
tion q as input to the pre-trained LM, and repre-
sent it with the last contextualized representation
gfinal. Then, the score for an answer candidate
is computed just like in the similarity classifier:
si = avg(g>finalEai). This results in an end-to-end
model without a hypothesis, where the question and
answer candidates have limited interaction. NOIN-
TERACTION obtains 63.7 accuracy on the develop-
ment set, showing that with this limited interaction
performance drops by 7-8 points. We view this
performance as an upper bound on the accuracy of
the similarity classifier.

Results Table 1 shows the performance of differ-
ent generator models, where we compare different
hypothesis lengths (|c|), use both the KLD and rep-
etition (REP) loss functions, and the ST estimator
with TOP-K sampling. Table 4 presents examples

Model Accuracy % repetitions
|c| = 1 53.3 –
|c| = 3 54.0 63
|c| = 3+KLD 51.3 42
|c| = 3+KLD+REP 49.0 19
|c| = 5 52.8 78
|c| = 5+KLD 52.2 68
|c| = 5+KLD+REP 50.7 14
TOP-K = 3 ST 58.0 –
TOP-K = 5 ST 56.2 –
SUPGEN |c| = 3 50.8 0.9
Comparable END2END 63.7 –

Table 1: Similarity classifier accuracy on the development set.
% repetitions corresponds to the number of repeated words in
a hypothesis across all examples.

of hypotheses generated by the different genera-
tors that highlights the differences in explainability
between them.

Our best model is TOP-K = 3 ST that reaches
58.0 accuracy, and by design outputs three different
words in one decoding step. This is 5.7 points lower
than NOINTERACTION, and we attribute the lower
performance to the narrow channel between the
generator and classifier (words instead of vectors),
and to the difficult optimization.

Models with multiple decoding steps tend to
produce repeated words, due to the nature of the
similarity classifier, which encourages generating
words that are similar to the correct answer. Adding
the KLD and especially the repetition loss reduces
repetitions dramatically, but also decreases accu-
racy. Table 4 shows that the generated text is rea-
sonable, even when different from the gold answer
(we evaluate and analyze the quality of hypotheses
more explicitly below). Surprisingly, KLD loss did
not yield more coherent phrases, perhaps due to
the known tendency of LMs to generate repetitions
(Holtzman et al., 2019; Welleck et al., 2019).

We evaluated the supervised generator, SUPGEN,
outputting exactly 3 tokens and then applying the
similarity classifier. SUPGEN produces text that is
more natural than our models, as evidenced by the
examples in Table 6. However, because it is not
optimized for QA, accuracy drops by 7.2 points.

4.2.2 LM-based classifier experiments
We now examine the performance and explainabil-
ity of joint training of the LM-based classifier and
the similarity classifier (Eq. 2). To show the impor-
tance of joint training, we evaluate NOSIMCLASSI-
FIER, where the model is trained without the simi-
larity classifier. When training jointly, we start with
the similarity classifier only, as a warm-up for the

Generator Q+C Q ∆(%) QASC
NOSIMCLASSIFIER 70.0 70.0 0.0 37.0
|c| = 1 69.0 62.0 -10.1 35.1
|c| = 3 69.0 63.5 -8.0 35.7
+KLD 70.0 63.5 -9.3 38.4
+KLD+REP 70.0 65.0 -7.1 36.4
TOP-K = 3 ST 70.0 60.4 -13.7 35.7
TOP-K = 5 ST 70.9 60.7 -13.1 39.2
SUPGEN|c| = 3 66.7 56.6 -15.4 –
SUPGEN|c| = 30 66.7 48.4 -27.4 –
END2END – 71.0 – 38.9

Table 2: Accuracy on the development set of the LM-based
classifier, trained jointly with the similarity classifier; Q+C:
The classifier takes the question and hypothesis as input at
training and test time; Q: The classifier takes the question
and hypothesis as input at training time, but only the question
at test time; ∆: performance drop (in %) between Q and
Q+C; SUPGEN was trained independently from the classifier
to generate the answer a∗. QASC: Accuracy when trained on
CSQAwithout further fine-tuning.

generator, and then add the LM-based classifier.
Table 2 shows the results. The Q+C column

shows results when the classifier is trained and
tested using the question and hypothesis. The Q
column shows results when training with the ques-
tion and hypothesis as usual, but at test time only
the question is passed, and the hypothesis input is
zeroed out. This indicates whether the LM-based
classifier uses the hypothesis or ignores it.

We observe that NOSIMCLASSIFIER gets high
accuracy, comparable to END2END. However, be-
cause the classifier is strong, it can ignore the hy-
pothesis, and generated hypotheses are meaning-
less. This is also evident by the fact that zeroing
out the hypothesis does not change performance.

All weakly-supervised generators reach roughly
the same accuracy, and their performance dramati-
cally drops in lieu of the hypothesis, showing that
the LM-based classifier uses it. TOP-K = 5 ST
performs best – almost the same as END2END.
However, models differ in terms of how much they
rely on the hypothesis (columns Q and ∆). Per-
formance drops by more than 13% for both TOP-
K = 3 ST and TOP-K = 5 ST, indicating that the
LM-based classifier strongly relies on the gener-
ated hypotheses. Table 4 shows some generated hy-
potheses, which are similar to those obtained when
training with a similarity classifier only (§4.2.1).

Tables 8 and 9 (Appendix A) present examples
where zeroing out the hypothesis input (column Q)
creates or fixes an error for TOP-K = 3 ST. We
observe how the hypothesis sways the prediction
of the model, and that often even when an error is
caused, the hypothesis is reasonable.

Model Score
|c| = 3+KLD+REP 0.72
TOP-K = 5 ST 0.74
SUPGEN |c| = 3 0.60
SUPGEN |c| = 30 0.55

Table 3: Human-evaluation results for how reasonable hy-
potheses are (CSQA development set). Each rater determined
whether a hypothesis is reasonable (1 point), somewhat rea-
sonable (0.5 point) or not reasonable (0 points). The score is
the average rating across raters and examples.

We evaluate SUPGEN, which was trained inde-
pendently as a sequence-to-sequence model and
is not optimized for QA. We experiment with de-
coding a hypothesis of length |c| = 3 and also of
maximal length |c| = 30. Results show that indeed
performance is 4 points lower than our best model,
but the output text is more natural. Moreover, the
LM-based classifier relies on it for its prediction,
especially when we decode very long hypotheses
|c| = 30. To summarize, there is a trade-off be-
tween the two types of supervision, where training
from the QA signal leads to higher accuracy, but
text that is “list-like” and not natural (Table 4),
while training to generate the answer yields lower
performance, but more natural text (Table 6).

Human evaluation To better understand the
quality of the generated text, we perform a hu-
man evaluation. We randomly sample 50 examples
from the development set, and generate hypothe-
ses from four models: (a) |c| = 3+KLD+REP,
(b) TOP-K = 5 ST, (c) SUPGEN |c| = 3 and (d)
SUPGEN |c| = 30. We randomly shuffle question-
hypotheses pairs from all models, show them to
six graduate students, and ask them to rate whether
a hypothesis is reasonable, somewhat reasonable,
or not reasonable. Each hypothesis is rated by 3
students, and the score for a model is the average
across raters and examples.

Table 3 shows the results of this experiment.
Top-K = 5 ST achieved the highest score of 0.74.
While SUPGEN models produce more natural texts,
they are judged to be less reasonable in the context
of the question.

Zero-shot transfer to QASC We examine
whether our model can generalize to the QASC
dataset, where general knowledge is needed. An
END2END model trained on CSQA obtains 38.9
accuracy (Table 2), showing that a model trained on
CSQA transfers reasonably well to QASC without
fine-tuning. Our hypothesis-generating models also

Q
ue

st
io

n
go

ld
|c
|
=

1
|c
|
=

3
|c
|
=

3
+

K
L

D
+

R
E

P
|c
|
=

3
+

K
L

D
+

R
E

P
E

2
E

T
O

P
-K

=
5

T
O

P
-K

=
5

E
2

E

w
ha

tw
ou

ld
us

e
a

m
us

ic
al

in
st

ru
m

en
t?


or

ch
es

tr
a

en
te

rt
ai

nm
en

t
pl

ay
pl

ay
pl

ay
m

us
ic

M
us

ic
M

us
ic

co
nc

er
tm

ov
ie

s
op

er
a

ba
nd

or
ch

es
tr

a
m

us
ic

co
nc

er
ts

in
g

ba
nd

co
nc

er
to

rc
he

st
ra

m
us

ic
sh

ow

w
he

re
w

ou
ld

yo
u

fin
d

a
ti

ck
et

bo
ot

h
an

d
se

e
a

co
nc

er
t?


ve

nu
e

th
ea

tr
e

th
ea

te
r

th
ea

tr
e

th
ea

tr
e

co
nc

er
tT

he
at

er
co

nc
er

t
th

ea
te

r
th

ea
te

rs
th

ea
te

r
th

ea
tr

e
th

ea
te

r
au

di
to

ri
um

ar
en

a
st

ad
iu

m
th

ea
te

r
th

ea
tr

e
au

di
to

ri
um

st
ad

iu
m

ve
nu

e

w
ha

t
do

pe
op

le
do

w
he

n
th

ey
do

n’
t

un
de

rs
ta

nd
so

m
e-

th
in

g?

as
k

qu
es


ti

on
s

qu
es

ti
on

qu
es

ti
on

s
qu

es
ti

on
qu

es
ti

on
s

cr
it

ic
iz

e
od

iq
ue

st
io

n
ar

gu
e

Q
ue

st
io

ns
qu

es
ti

on
s

qu
es

ti
on

s
qu

es
ti

on
st

ud
y

as
k

th
in

k
qu

es
ti

on
s

co
nf

us
io

n
co

nf
us

ed
pu

zz
le

d
co

nf
us

e

w
he

re
ar

e
re

qu
ir

ed
to

ca
rr

y
bo

ok
s

al
ld

ay
?”

un
iv

er
si

ty
sc

ho
ol

cl
as

sr
oo

m
cl

as
sr

oo
m

cl
as

s-
ro

om
of

fi
ce

of
fi

ce
s

of
fi

ce
cl

as
sr

oo
m

s
ba

ck
pa

ck
ba

ck

pa
ck

cl
as

sr
oo

m
li

br
ar

y
cl

as
sr

oo
m

s
sc

ho
ol

li

br
ar

ie
s

li
br

ar
y

sc
ho

ol
li

br
ar

ie
s

of
fi

ce
li

br
ar

ia
n


th

ey
ha

d
a

th
eo

ry
of

w
ha

tt
he

y
co

ul
d

do
in

th
e

bi
g

ga
m

e,
so

ov
er

an
d

ov
er

th
ey

w
ou

ld
w

ha
t?


pr

ac
ti

ce
pr

ac
ti

ce
at

ta
ck

at
ta

ck
at

ta
ck

pr
ac

ti
ce

P
ra

ct
ic

e
pr

ac
ti

ce
pr

ac
ti

ce
S

tr
at

eg
ie

s
st

ra
te

gy
pr

ac
ti

ce
pl

ay
tr

y
pr

ac
ti

ce
d

ex
ec

ut
e

st
ra

te
gy

pr
ac

ti
ce

th
in

k
st

ra
te

gi
es

dr
il

l


w

he
re

m
ig

ht
an

un
us

ed
ch

es
s

se
tb

e
st

or
ed

?”
cu

pb
oa

rd
cu

pb
oa

rd
cu

pb
oa

rd
cu

pb
oa

rd
cu

pb
oa

rd
ca

bi
ne

tc
up

bo
ar

d
ca

bi
ne

t
cu

pb
oa

rd
cl

os
et

cl
os

et
cu

pb
oa

rd
cl

os
et

ga
ra

ge
dr

aw
er

ca
bi

ne
t

cu
pb

oa
rd

dr
aw

er
ca

bi
ne

tc
lo

se
tb

as
em

en
t


a

hu
m

an
w

an
ts

to
su

bm
er

ge
hi

m
se

lf
in

w
at

er
,

w
ha

t
sh

ou
ld

he
us

e?

w
hi

rl
po

ol
ba

th
ba

th
ba

th
tu

b
ba

th
tu

b
ba

th
tu

b
ba

th
B

at
h

po
ol

ba
th

su
bm

er
ge

d
sp

on
ge

ba
th

ba
th

tu
b

tu
b

po
ol

ba
th

ro
om

ba
th

ba
th

tu
b

tu
b

si
nk

sh
ow

er


w

he
re

ca
n

yo
u

pu
ta

pi
ct

ur
e

fr
am

e
w

he
n

it
’s

no
th

un
g

ve
rt

i-
ca

ll
y?


ta

bl
e

sh
el

f
sh

el
f

sh
el

f
sh

el
f

sh
el

f
sh

el
ve

s
sh

el
f

sh
el

f
sh

el
ve

s
sh

el
f

sh
el

f
ta

bl
e

fl
oo

r
dr

aw
er

sh
el

ve
s

sh
el

f
fl

oo
r

ta
bl

e
ce

il
in

g
sh

el
ve

s


be

ca
us

e
jo

hn
w

as
fir

st
vi

ol
in

,
he

ha
d

to
br

in
g

so
m

et
hi

ng
im

po
rt

an
t

to
w

or
k

ev
er

da
y.

w
ha

t
di

d
he

ne
ed

to
br

in
g

to
w

or
k?

vi
ol

in
ca

se
vi

ol
in

in
st

ru
m

en
ts

in
st

ru
m

en
ts

in

st
ru

m
en

ts
br

ie
fc

as
e

br
ie

fc
as

e
br

ie
fc

as
e

vi
ol

in
In

st
ru

m
en

ts
vi

ol
in

in
st

ru
m

en
ts

in
st

ru
m

en
t

vi
ol

in
br

ie
fc

as
e

ba
ck

pa
ck

vi
ol

in
br

ie
fc

as
e

ba
ck

pa
ck

in
st

ru
m

en
t

la
p-

to
p


w

ha
td

o
pe

op
le

ai
m

to
do

at
w

or
k?


co

m
pl

et
e

jo
b

ac
hi

ev
e

ac
co

m
pl

is
h

w
or

k
w

or
k

ac
hi

ev
e

ac
hi

ev
em

en
ta

dv
an

ce

m
en

t
im

pr
ov

e
w

or
k

w
or

k
ac

hi
ev

e
pr

od
uc

e
re

su
lt

s
ac

co
m

pl
is

h
do

pr
og

re
ss

ac
co

m
pl

is
h

im
pr

ov
e

cu
st

om
er

s
im

pr
ov

em
en

t

Ta
bl

e
4:

H
yp

ot
he

se
s

ge
ne

ra
te

d
w

he
n

tr
ai

ni
ng

th
e

si
m

il
ar

it
y

cl
as

si
fi

er
.
|c
|i

nd
ic

at
es

th
e

hy
po

th
es

is
le

ng
th

;
K

L
D

an
d

R
E

P
co

rr
es

po
nd

to
K

L
D

lo
ss

an
d

re
pe

ti
ti

on
lo

ss
;

E
2E

in
di

ca
te

s
jo

in
tt

ra
in

in
g

w
it

h
th

e
L

M
-b

as
ed

cl
as

si
fi

er
.

generalizes without fine-tuning to QASC, where
TOP-K = 5 ST reaches the highest accuracy, 39.2,
while also providing the hypotheses as explanation.
Table 7 shows examples for generated hypotheses
for QASC questions.

CSQA Test set results We evaluated TOP-K =
5 ST on the test set of CSQA and obtained 63.5
accuracy. As a point of reference, the leaderboard
of CSQA reports one model that uses XLNet-large,
which obtains 66.9 accuracy, but also uses external
documents retrieved from Wikipedia with IR.3

4.3 Explainability Analysis

We analyze how the textual hypotheses provide in-
sights onto the abilities of the LM beyond what is
possible with an end-to-end architecture. We ana-
lyze our joint model TOP-K = 5 ST, and present
examples in Table 5.

First, we look at cases where both the similarity
and LM-based classifiers answered correctly (49%
of the cases). We manually annotated whether the
hypotheses provide a reasonable answer for a ran-
dom sample of 100 hypotheses from the develop-
ment set, assigning 1 point to a reasonable hypothe-
sis, 0.5 point to a somewhat reasonable hypothesis,
and 0 points to an unreasonable hypothesis. The
score was 0.94, indicating that when both classi-
fiers are correct, the hypotheses are reasonable,
and we can have confidence that the model does
not “cheat”. In the few cases where the hypothe-
sis is unreasonable, it reveals a shortcoming in the
model’s knowledge. For example, for the question

“The hostess was good at her job, she always had a
smile when she would what?”, the model outputs
the hypothesis “dinner eat serve food meals”, in-
dicating that it does not distinguish a hostess from
a waitress. However, the distractors are weak, and
the model correctly chooses “welcome guests”.

Next, we look at cases where both classifiers
were wrong (23% of the cases). Annotating 51 ex-
amples to estimate whether hypotheses are reason-
able, we obtain a very low score of 0.21. By exam-
ining the hypotheses, we can understand the main
reasons of error (examples in Table 5): (a) Miss-
ing knowledge: the question requires very specific
knowledge that the model does not generate, sug-
gesting that the knowledge is missing. (b) Semantic
errors: the model ignores parts of the question and
is misled by surface clues (ignoring negation in the

3DREAM entry on https://www.tau-nlp.org/
csqa-leaderboard.

first example, and a restrictive relative clause in
the second). These two reasons cover 54% of the
unreasonable hypotheses produced, showing how
the hypotheses provide important information for
“debugging” the model.

We also analyze cases where the LM-based clas-
sifier was correct but the similarity classifier was
wrong (22% of the cases). Annotating 50 such
examples produces a score of 0.55, showing that
many of the hypotheses were actually reasonable.
We observe that in two-thirds of these cases, the
error is related to the inability of the generator to
consider the distractors themselves, and while the
hypothesis was reasonable, it was more similar to
a distractor than to the gold answer (Table 5). Fi-
nally, transferring to QASC (Table 7) shows that
the model performs well on general knowledge
questions, which are somewhat similar to those in
CSQA, but fails on more scientific questions that
require encyclopedic knowledge.

Summary We observe that joint training leads to
a model that has comparable performance to an end-
to-end model, but also provides a window to the
information inside the LM. Specifically, using the
hypotheses we can detect cases where the model
was right, but lacks the necessary knowledge, cases
where the model was wrong, but provides reason-
able answers, and analyze different types of failures
related to knowledge and language.

5 Discussion and Related Work

Explanation datasets There has been substan-
tial effort recently to collect human-generated ex-
planations and train models to generate or select
them. Rajani et al. (2019) presented the CoS-E
dataset, where human-generated explanations to
commonsense questions are used to train a LM.
Wang et al. (2019) created a dataset for evalu-
ating whether a model can choose the right ex-
planation for its decision. Huang et al. (2019)
crowd-sourced a multi-choice reading comprehen-
sion dataset, where answer candidates are human-
generated explanations, and showed that a trained
generative model can produce semantically con-
sistent explanations. Sap et al. (2019) created a
dataset for explaining social situations. All these
approaches rely on human-generated explanations
that are expensive to collect. Moreover, it is unclear
whether generated explanations are actually used
for predicting the correct answer.

https://www.tau-nlp.org/csqa-leaderboard
https://www.tau-nlp.org/csqa-leaderboard

Question Hypothesis Answer candidates
“what mall store sells jeans for a decent price?” store mall stores retail shop apartment, GAP, bedroom, thrift store, clothing store
“the gimmicky low brow TV show was about animals when they what?” animals humans human aliens live sick, males, mammals, bite, attack
“where would you use a folding chair but not store one?” cupboard cabinet closet drawer room closet, beach, city hall, school, garage
“where would you get some maps that you own?” store stores bookstore shop Store important when traveling, library, cabinet, electrical circuit, bookstore
“athletes soak in hot tubs to relieve what after playing baseball?” sweat pain stress headache exhaustion strikes, pain, errors, fame, sore muscles
“where would you find a monkey in the wild?” wild woods bush range forest Thailand, captivity, barrel, research laboratory, zoo

Table 5: Top: examples for bad hypotheses due to missing knowledge. Middle: Examples for bad hypotheses due to semantic
errors. Bottom: Examples for reasonable hypotheses that do not fit the specific distractors. Bold: gold answer; underline:
predicted answer.

Explainability There has been ample research
recently on precisely defining what are explana-
tions and what are their different facets (see Lipton
(2016) and Wiegreffe and Pinter (2019) among
others). Specifically, an important question is
whether the explanation causes the prediction, that
is, whether the model actually uses the explana-
tion to reach its prediction. From this perspective,
our similarity classifier provides hypotheses that
strongly influence the prediction, as the question
is not even passed to the classifier. Our LM-based
classifier can potentially choose to ignore the hy-
pothesis, but we show experimentally that ablating
the hypothesis results in a performance drop. Thus,
it is more similar to attention as an explanation
(Serrano and Smith, 2019; Jain and Wallace, 2019;
Pruthi et al., 2019), where the attention structure
does not necessarily reveal what tokens are used
for prediction. Our supervised model is an even
weaker form of explanation since it is trained inde-
pendently from QA.

Multi-choice QA MC-QA has been a popular
format for QA (Lai et al., 2017; Zellers et al., 2018;
Clark et al., 2018) mostly because it simplifies eval-
uation of free text answers. However, users in the
real world ask questions in order to get an answer,
and not to test the knowledge of a model or stu-
dent (Chen et al., 2019a). Thus, it is difficult to
know whether models that succeed in MC-QA can
actually answer the question, as they may take ad-
vantage of weaknesses in the way distractors were
constructed. Our similarity classifier model can be
viewed as an abstractive QA model that answers the
question directly, and is evaluated with a multiple-
choice format. In future work, our similarity clas-
sifier can be replaced by more complex untrained
models that choose the right answer given the gen-
erated hypothesis.

6 Conclusion

In this work with propose a LM-based model for
MC-QA, which generates natural language text

that can be used to understand the the knowledge
extracted from the LM, as an intermediate step. The
performance of our model is on par with end-to-end
models, while providing an inspectable layer for
practitioners and users. Our approach is supervised
from downstream application signal only, and thus
can be generalized to any scenario where we would
like to train a LM to generate text that is useful for
a downstream application.

Acknowledgements

We thank Inbar Oren and Guy Tevet for their useful
suggestions. This research was partially supported
by The Israel Science Foundation grant 942/16,
The Yandex Initiative for Machine Learning and
the European Research Council (ERC) under the
European Union Horizons 2020 research and inno-
vation programme (grant ERC DELPHI 802800).

Q
ue

st
io

n
go

ld
S

U
P
G

E
N
|c
|
=

3
S

U
P
G

E
N
|c
|
=

3
0


w

ha
tw

ou
ld

us
e

a
m

us
ic

al
in

st
ru

m
en

t?

or
ch

es
tr

a
ba

nd
pr

ac
ti

ce
ro

om
pl

ay
m

us
ic

ha
ll

,a
nd

or
ch

es
tr

a
m

us
ic

,o
rg

an
o

ob
o

.

w
he

re
w

ou
ld

yo
u

fin
d

a
ti

ck
et

bo
ot

h
an

d
se

e
a

co
nc

er
t?


ve

nu
e

au
di

to
ri

um
bu

il
di

ng
ou

ts
id

e
ar

en
a

bu
il

di
ng

or
,a

nd
th

ea
tr

e
bu

il
di

ng
or

au
di

to
ri

um
bu

il
di

ng
bu

il
di

ng
on

ca
m

pu
s

of
co

ll
eg

e
or

sc
ho

ol
bu

il
di

ng
ca

m
pu

s

w
ha

td
o

pe
op

le
do

w
he

n
th

ey
do

n’
tu

nd
er

st
an

d
so

m
et

hi
ng

?”
as

k
qu

es
ti

on
s

as
k

qu
es

ti
on

s
ob

se
ss

,a
nd

ob
se

ss
ov

er
.


w

he
re

ar
e

re
qu

ir
ed

to
ca

rr
y

bo
ok

s
al

ld
ay

?”
un

iv
er

si
ty

sc
ho

ol
ba

ck
pa

ck
s

cl
as

sr
oo

m
bu

il
di

ng
or

,a
nd

sc
ho

ol
ba

ck
pa

ck
s

ar
e

re
qu

ir
ed

to
ca

rr
y

te
xt

bo
ok

s
al

ld
ay

lo
ng

.

th
ey

ha
d

a
th

eo
ry

of
w

ha
tt

he
y

co
ul

d
do

in
th

e
bi

g
ga

m
e,

so
ov

er
an

d
ov

er
th

ey
w

ou
ld

w
ha

t?

pr
ac

ti
ce

re
pe

at
at

ta
ck

s
.

pr
ac

ti
ce

ha
rd

an
d

,a
nd

an
d


w

he
re

m
ig

ht
an

un
us

ed
ch

es
s

se
tb

e
st

or
ed

?”
cu

pb
oa

rd
ca

bi
ne

td
ra

w
er

s
ca

bi
ne

ts
of

,a
nd

ca
bi

ne
ts

of
qu

ar
re

ls
of

w
oo

d
,m

us
ic

ro
om

st
or

ag
e

ca
bi

ne
ts


a

hu
m

an
w

an
ts

to
su

bm
er

ge
hi

m
se

lf
in

w
at

er
,w

ha
ts

ho
ul

d
he

us
e?


w

hi
rl

po
ol

ba
th

be
ac

h
ba

ll
sa

ck
sw

im
m

in
g

po
ol

of
,a

nd
ge

tw
et

in
w

at
er

fo
un

ta
in

s
an

d
w

at
er

fa
ll

s
of


w

he
re

ca
n

yo
u

pu
ta

pi
ct

ur
e

fr
am

e
w

he
n

it
’s

no
th

un
g

ve
rt

ic
al

ly
?”

ta
bl

e
sh

el
f

ro
d

sh
el

f
un

it
dr

aw
er

,a
nd

w
al

ls
he

lf
(c

ab
in

et
dr

aw
er

).

be
ca

us
e

jo
hn

w
as

fir
st

vi
ol

in
,h

e
ha

d
to

br
in

g
so

m
et

hi
ng

im
po

rt
an

t
to

w
or

k
ev

er
da

y.
w

ha
t

di
d

he
ne

ed
to

br
in

g
to

w
or

k?

vi
ol

in
ca

se
vi

ol
in

ca
se

,
or

ch
es

tr
at

io
n

ap
pa

ra
tu

s,
an

d


w

ha
td

o
pe

op
le

ai
m

to
do

at
w

or
k?


co

m
pl

et
e

jo
b

ac
co

m
pl

is
h

go
al

s
an

d
in

cr
ea

se
in

co
m

e
an

d,
an

d
re

sp
on

si
bi

li
ti

es
an

d
re

sp
ec

t
ab

il
it

y
in

jo
bs

w
it

h
re

sp
ec

t
ab

il
it

y
st

an

da
rd

s
an

d
pa

y
w

el
le

no
ug

h
to

ke
ep

Ta
bl

e
6:

H
yp

ot
he

se
s

ge
ne

ra
te

d
by

S
U

P
G

E
N
|c
|=

3
an

d
S

U
P
G

E
N
|c
|=

3
0

m
od

el
s.

Q
ue

st
io

n
go

ld
|c
|
=

3
+

K
L

D
+

R
E

P
E

2
E

T
O

P
-K

=
5

E
2

E
S

U
P
G

E
N
|c
|
=

3
S

U
P
G

E
N
|c
|
=

3
0


w

ha
t

is
de

sc
ri

be
d

in
te

rm
s

of
te

m
pe

ra
tu

re
an

d
w

at
er

in
th

e
ai

r?

cl
im

at
e

ba
la

nc
e

te
m

pe
ra

tu
re

s
te

m
pe

ra
tu

re
hu

m
id

it
y

cl
im

at
e

at
m

os
ph

er
e

ai
r

te
m

pe
ra

tu
re

m
oi

st
ur

e
li

ng
er

at
m

os
ph

er
e

of
ea

rt
h

,a
nd

pl
an

et
ea

rt
h

.


w

ha
ta

re
bu

ss
es

us
ed

fo
r?


T

ra
ns

po
rt

in
g

hu
m

an
s

tr
an

sp
or

ta
ti

on
T

ra
ns

po
rt

tr
an

sp
or

ta

ti
on

tr
an

sp
or

ta
ti

on
tr

av
el

tr
an

sp
or

t
tr

av
el

in
g

T
ra

ns

po
rt

at
io

n
tr

an
sp

or
ta

ti
on

pu
rp

os
e

tr
an

sp
or

ta
ti

on
tr

an
sp

or
ti

ng
pe

op
le

an
d,

an
d

ca
rg

o
(s

tu
ff

)
be

tw
ee

n
ci

ti
es

an
d

co
un

tr
ys

id
e

(l
an

d
m

as
se

s)
.


w

ha
td

o
sp

id
er

s
ca

tc
h?


in

se
ct

s
in

se
ct

s
in

se
ct

in
se

ct
s

in
se

ct
s

gr
as

s
w

eb
le

av
es

in
se

ct
w

eb
s

of
w

eb
s

of
,a

nd
sp

id
er

w
eb

s
of

fi
sh

w
eb

s
w

eb
ch

in
a

ci
ty

ho
ne

y
co

m
bs

sp
id

er
w

eb
s

w
eb

m
ic

hi
ga

n

ki
dn

ey
fa

il
ur

e
m

ay
be

tr
ea

te
d

w
it

h
a

w
ay

of
cl

ea
ni

ng
w

ha
t?


bl

oo
d

ki
dn

ey
ga

st
ro

in
te

st
in

al
bl

oo
d

ki
dn

ey
ur

in
e

bl
oo

d
co

lo
n

in
te

st
in

e
bl

oo
d

ve
ss

el
s

of
liv

er
ce

ll
s

ou
t,

an
d

ou
to

f
bl

oo
d

ve
ss

el
s.


w

ha
ti

s
us

ed
fo

r
na

vi
ga

ti
on

?”
a

co
m

pa
ss

co
m

pa
ss

S
ai

st
ar

s
G

P
S

sa
te

ll
it

es
ra

da
r

ra
di

o
m

ap
na

vi
ga

ti
on

ra
di

o
tr

an
sm

is
si

on
sy

st
em

,a
nd

sa
te

ll
it

es
,s

ub
m

ar
in

e
na

vi
ga


ti

on
,b

al
lo

on
ai

r
na

vi
ga

ti
on

,s
pa

ce
sh

ut
tl

e
fl

ig
ht


w

ha
ti

s
a

si
m

pl
e

m
od

e
of

tr
an

sp
or

ta
ti

on
?”

W
it

h
fe

ed
ba

ck
lo

op
s

au
to

m
ob

il
e

bu
se

s
bu

s
ta

xi
ca

r
bu

s
m

ot
or

cy
cl

e
bi

cy
cl

e
ca

r
,b

us
bi

cy
cl

es
an

d
,a

nd
m

ot
or

ca
rs

an
d

ai
rp

la
ne

s
an

d
ve

hi
cl

es
of

an
y

ki
nd

.

w
ha

ta
re

aq
ua

ti
c?


an

em
on

es
L

if
e

or
ga

ni
sm

s
un

de
rw

at
er

an
im

al
s

cr
ea

tu
re

s
or

ga
ni

sm
s

an
im

al
m

am
m

al
s

li
fe

fo
rm

s
li

ke
fi

sh
es

an
d,

an
d

aq
ua

ti
c

li
fe

ep
he

m
er

al
ed

di
es

of
w

at
er

cu
r-

re
nt

s
m

ov
in

g
fa

st
ou

to
f

de
pt

h
to

su
rf

ac
e

w
at

er

w
ha

td
o

ch
oa

no
cy

te
s

ha
ve

to
tr

ap
th

e
pa

rt
ic

le
s?


P

ro
te

in
lu

ng
an

ti
bo

di
es

ce
ll

m
em

br
an

e
ut

er
us

m
us

cl
es

tu
be

s
ar

te
ri

es
ce

ll
w

al
ll

in
in

g
m

us
cl

es
of

,a
nd

m
us

cl
es

of
ra

tt
ai

lp
aw

s
,a

ca
t’


w

ha
tt

ra
ps

pa
rt

ic
le

s?

fl
ag

el
lu

m
or

ti
ny

ha
ir

s
va

cu
um

!
〈/

s〉
va

cu
um

m
ag

ne
tp

ar
ti

cl
e

ca
ge

tr
ap

co
nt

ai
ne

r
of

li
qu

id
so

li
d

m
at

er
ia

lb
ar

ri
er

,a
nd

so
li

d
w

al
lo

f
so

li
d

w
al

lc
on

st
ru

c-
ti

on
or

st
ru

ct
ur

e
,c

on
cr

et
e

bl
oc

k
w

al
l,

w
al

lb
in

de
r

,w
al

l

w
ha

ti
s

th
e

pr
oc

es
s

by
w

hi
ch

ne
ur

on
s

ar
e

cr
ea

te
d?


m

el
an

in
co

nt
en

t
di

ff
er

en
ti

at
io

n
su

rg
ic

al
re

pr
od

uc

ti
on

gr
ow

th
cr

ea
ti

on
re

pr
od

uc
ti

on
ev

ol
ut

io
n

bi
rt

h
gr

ow
th

ou
to

f
re

pr
od

uc
ti

on
of

li
fe

,
an

d
hu

m
an

it
y

be
in

g
cr

ea
te

d
up

on
pl

an
et

ea
rt

h
is

fu
nd

am
en

ta
l

to
na

tu
re

an
d

lo
gi

ca
l

pr
og

re
s-

si
on

of
un

iv
er

se
as

Ta
bl

e
7:

H
yp

ot
he

se
s

ge
ne

ra
te

d
by

m
od

el
s

tr
ai

ne
d

on
C

S
Q

A
an

d
ev

al
ua

te
d

on
Q

A
S

C
.U

pp
er

pa
rt

:
ex

am
pl

es
w

he
re

th
e

m
od

el
s

co
nt

ai
ne

d
th

e
re

qu
ir

ed
kn

ow
le

dg
e

to
an

sw
er

.
L

ow
er

pa
rt

:
ex

am
pl

es
w

he
re

th
e

m
od

el
s

la
ck

ed
th

e
re

qu
ir

ed
kn

ow
le

dg
e.

References
Yoshua Bengio, Nicholas Lonard, and Aaron Courville.

2013. Estimating or propagating gradients through
stochastic neurons for conditional computation.

Oana-Maria Camburu, Tim Rocktschel, Thomas
Lukasiewicz, and Phil Blunsom. 2018. e-snli: Nat-
ural language inference with natural language expla-
nations.

Anthony Chen, Stanovsky Gabriel, Sameer Singh, and
Matt Gardner. 2019a. Evaluating question answer-
ing evaluation. In Proceedings of the 2nd Workshop
on Machine Reading for Question Answering.

Michael Chen, Mike D’Arcy, Alisa Liu, Jared Fernan-
dez, and Doug Downey. 2019b. Codah: An adver-
sarially authored question-answer dataset for com-
mon sense.

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot,
Ashish Sabharwal, Carissa Schoenick, and Oyvind
Tafjord. 2018. Think you have solved question an-
swering? try arc, the ai2 reasoning challenge.

Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane
Hung, Eric Frank, Piero Molino, Jason Yosinski, and
Rosanne Liu. 2019. Plug and play language models:
A simple approach to controlled text generation.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2018. Bert: Pre-training of deep
bidirectional transformers for language understand-
ing. arXiv preprint arXiv:1810.04805.

Jessica Ficler and Yoav Goldberg. 2017. Controlling
linguistic style aspects in neural language genera-
tion. arXiv preprint arXiv:1707.02633.

Matt Gardner, Joel Grus, Mark Neumann, Oyvind
Tafjord, Pradeep Dasigi, Nelson Liu, Matthew Pe-
ters, Michael Schmitz, and Luke Zettlemoyer. 2018.
Allennlp: A deep semantic natural language process-
ing platform.

Mor Geva, Yoav Goldberg, and Jonathan Berant. 2019.
Are we modeling the task or the annotator? an inves-
tigation of annotator bias in natural language under-
standing datasets.

Suchin Gururangan, Swabha Swayamdipta, Omer
Levy, Roy Schwartz, Samuel R Bowman, and
Noah A Smith. 2018. Annotation artifacts in
natural language inference data. arXiv preprint
arXiv:1803.02324.

Serhii Havrylov and Ivan Titov. 2017. Emergence of
language with multi-agent games: Learning to com-
municate with sequences of symbols.

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and
Yejin Choi. 2019. The curious case of neural text
degeneration.

Lifu Huang, Ronan Le Bras, Chandra Bhagavatula, and
Yejin Choi. 2019. Cosmos qa: Machine reading
comprehension with contextual commonsense rea-
soning.

Sarthak Jain and Byron C. Wallace. 2019. Attention is
not explanation.

Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categor-
ical reparameterization with gumbel-softmax.

Natasha Jaques, Shixiang Gu, Dzmitry Bahdanau,
Jos Miguel Hernndez-Lobato, Richard E. Turner,
and Douglas Eck. 2016. Sequence tutor: Conserva-
tive fine-tuning of sequence generation models with
kl-control.

Zhengbao Jiang, Frank F. Xu, Jun Araki, and Graham
Neubig. 2019. How can we know what language
models know?

Nitish Shirish Keskar, Bryan McCann, Lav R. Varsh-
ney, Caiming Xiong, and Richard Socher. 2019.
Ctrl: A conditional transformer language model for
controllable generation.

Tushar Khot, Peter Clark, Michal Guerquin, Peter
Jansen, and Ashish Sabharwal. 2019. Qasc: A
dataset for question answering via sentence compo-
sition.

Yuta Kikuchi, Graham Neubig, Ryohei Sasano, Hiroya
Takamura, and Manabu Okumura. 2016. Control-
ling output length in neural encoder-decoders.

Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang,
and Eduard Hovy. 2017. Race: Large-scale reading
comprehension dataset from examinations. arXiv
preprint arXiv:1704.04683.

Zhenzhong Lan, Mingda Chen, Sebastian Goodman,
Kevin Gimpel, Piyush Sharma, and Radu Soricut.
2019. Albert: A lite bert for self-supervised learn-
ing of language representations.

Jeffrey Ling, Nicholas FitzGerald, Zifei Shan,
Livio Baldini Soares, Thibault Fvry, David Weiss,
and Tom Kwiatkowski. 2020. Learning cross-
context entity representations from text.

Zachary Chase Lipton. 2016. The mythos of model
interpretability. CoRR, abs/1606.03490.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Luke Zettlemoyer, and Veselin Stoyanov. 2019.
Roberta: A robustly optimized bert pretraining ap-
proach.

Scott Lundberg and Su-In Lee. 2017. A unified ap-
proach to interpreting model predictions.

Chris J Maddison, Andriy Mnih, and Yee Whye Teh.
2016. The concrete distribution: A continuous relax-
ation of discrete random variables. arXiv preprint
arXiv:1611.00712.

http://arxiv.org/abs/1308.3432
http://arxiv.org/abs/1308.3432
http://arxiv.org/abs/1812.01193
http://arxiv.org/abs/1812.01193
http://arxiv.org/abs/1812.01193
http://arxiv.org/abs/1904.04365
http://arxiv.org/abs/1904.04365
http://arxiv.org/abs/1904.04365
http://arxiv.org/abs/1803.05457
http://arxiv.org/abs/1803.05457
http://arxiv.org/abs/1912.02164
http://arxiv.org/abs/1912.02164
http://arxiv.org/abs/1803.07640
http://arxiv.org/abs/1803.07640
http://arxiv.org/abs/1908.07898
http://arxiv.org/abs/1908.07898
http://arxiv.org/abs/1908.07898
http://arxiv.org/abs/1705.11192
http://arxiv.org/abs/1705.11192
http://arxiv.org/abs/1705.11192
http://arxiv.org/abs/1904.09751
http://arxiv.org/abs/1904.09751
http://arxiv.org/abs/1909.00277
http://arxiv.org/abs/1909.00277
http://arxiv.org/abs/1909.00277
http://arxiv.org/abs/1902.10186
http://arxiv.org/abs/1902.10186
http://arxiv.org/abs/1611.01144
http://arxiv.org/abs/1611.01144
http://arxiv.org/abs/1611.02796
http://arxiv.org/abs/1611.02796
http://arxiv.org/abs/1611.02796
http://arxiv.org/abs/1911.12543
http://arxiv.org/abs/1911.12543
http://arxiv.org/abs/1909.05858
http://arxiv.org/abs/1909.05858
http://arxiv.org/abs/1910.11473
http://arxiv.org/abs/1910.11473
http://arxiv.org/abs/1910.11473
http://arxiv.org/abs/1609.09552
http://arxiv.org/abs/1609.09552
http://arxiv.org/abs/1909.11942
http://arxiv.org/abs/1909.11942
http://arxiv.org/abs/2001.03765
http://arxiv.org/abs/2001.03765
http://arxiv.org/abs/1907.11692
http://arxiv.org/abs/1907.11692
http://arxiv.org/abs/1705.07874
http://arxiv.org/abs/1705.07874

Bryan McCann, Nitish Shirish Keskar, Caiming Xiong,
and Richard Socher. 2018. The natural language de-
cathlon: Multitask learning as question answering.
arXiv preprint arXiv:1806.08730.

Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt
Gardner, Christopher Clark, Kenton Lee, and Luke
Zettlemoyer. 2018. Deep contextualized word repre-
sentations.

Fabio Petroni, Tim Rocktschel, Patrick Lewis, Anton
Bakhtin, Yuxiang Wu, Alexander H. Miller, and Se-
bastian Riedel. 2019. Language models as knowl-
edge bases?

Adam Poliak, Jason Naradowsky, Aparajita Haldar,
Rachel Rudinger, and Benjamin Van Durme. 2018.
Hypothesis only baselines in natural language infer-
ence. arXiv preprint arXiv:1805.01042.

Danish Pruthi, Mansi Gupta, Bhuwan Dhingra, Gra-
ham Neubig, and Zachary C. Lipton. 2019. Learn-
ing to deceive with attention-based explanations.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
Dario Amodei, and Ilya Sutskever. 2019. Language
models are unsupervised multitask learners. OpenAI
Blog, 1(8):9.

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Wei Li, and Peter J. Liu. 2019. Exploring the limits
of transfer learning with a unified text-to-text trans-
former.

Nazneen Fatema Rajani, Bryan McCann, Caiming
Xiong, and Richard Socher. 2019. Explain yourself!
leveraging language models for commonsense rea-
soning.

Marco Tulio Ribeiro, Sameer Singh, and Carlos
Guestrin. 2016. ”why should i trust you?”: Explain-
ing the predictions of any classifier.

Kyle Richardson and Ashish Sabharwal. 2019. What
does my qa model know? devising controlled probes
using expert knowledge. ArXiv.

Adam Roberts, Colin Raffel, and Noam Shazeer. 2020.
How much knowledge can you pack into the param-
eters of a language model?

Maarten Sap, Ronan LeBras, Emily Allaway, Chan-
dra Bhagavatula, Nicholas Lourie, Hannah Rashkin,
Brendan Roof, Noah A. Smith, and Yejin Choi. 2018.
Atomic: An atlas of machine commonsense for if-
then reasoning.

Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le-
Bras, and Yejin Choi. 2019. Socialiqa: Common-
sense reasoning about social interactions.

Sofia Serrano and Noah A. Smith. 2019. Is attention in-
terpretable? In Association for Computational Lin-
guistics (ACL).

Robyn Speer, Joshua Chin, and Catherine Havasi. 2016.
Conceptnet 5.5: An open multilingual graph of gen-
eral knowledge.

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.
Sequence to sequence learning with neural networks.
In Advances in neural information processing sys-
tems, pages 3104–3112.

Alon Talmor, Yanai Elazar, Yoav Goldberg, and
Jonathan Berant. 2019. olmpics – on what language
model pre-training captures.

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and
Jonathan Berant. 2018. Commonsenseqa: A ques-
tion answering challenge targeting commonsense
knowledge. arXiv preprint arXiv:1811.00937.

James Thorne, Andreas Vlachos, Christos
Christodoulopoulos, and Arpit Mittal. 2019.
Generating token-level explanations for natural
language inference.

Trieu H. Trinh and Quoc V. Le. 2018. A simple method
for commonsense reasoning.

Masatoshi Tsuchiya. 2018. Performance impact
caused by hidden bias of training data for recogniz-
ing textual entailment.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In Advances in neural information pro-
cessing systems, pages 5998–6008.

Cunxiang Wang, Shuailong Liang, Yue Zhang, Xiao-
nan Li, and Tian Gao. 2019. Does it make sense?
and why? a pilot study for sense making and expla-
nation.

Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Di-
nan, Kyunghyun Cho, and Jason Weston. 2019. Neu-
ral text generation with unlikelihood training.

Sarah Wiegreffe and Yuval Pinter. 2019. Attention is
not not explanation.

Ronald J Williams. 1992. Simple statistical gradient-
following algorithms for connectionist reinforce-
ment learning. Machine learning, 8(3-4):229–256.

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Chaumond, Clement Delangue, Anthony Moi, Pier-
ric Cistac, Tim Rault, Rmi Louf, Morgan Funtowicz,
and Jamie Brew. 2019. Huggingface’s transformers:
State-of-the-art natural language processing.

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-
bonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019.
Xlnet: Generalized autoregressive pretraining for
language understanding.

Penghang Yin, Jiancheng Lyu, Shuai Zhang, Stanley
Osher, Yingyong Qi, and Jack Xin. 2019. Under-
standing straight-through estimator in training acti-
vation quantized neural nets.

http://arxiv.org/abs/1802.05365
http://arxiv.org/abs/1802.05365
http://arxiv.org/abs/1909.01066
http://arxiv.org/abs/1909.01066
http://arxiv.org/abs/1909.07913
http://arxiv.org/abs/1909.07913
http://arxiv.org/abs/1910.10683
http://arxiv.org/abs/1910.10683
http://arxiv.org/abs/1910.10683
http://arxiv.org/abs/1906.02361
http://arxiv.org/abs/1906.02361
http://arxiv.org/abs/1906.02361
http://arxiv.org/abs/1602.04938
http://arxiv.org/abs/1602.04938
http://arxiv.org/abs/2002.08910
http://arxiv.org/abs/2002.08910
http://arxiv.org/abs/1811.00146
http://arxiv.org/abs/1811.00146
http://arxiv.org/abs/1904.09728
http://arxiv.org/abs/1904.09728
http://arxiv.org/abs/1612.03975
http://arxiv.org/abs/1612.03975
http://arxiv.org/abs/1912.13283
http://arxiv.org/abs/1912.13283
http://arxiv.org/abs/1904.10717
http://arxiv.org/abs/1904.10717
http://arxiv.org/abs/1806.02847
http://arxiv.org/abs/1806.02847
http://arxiv.org/abs/1804.08117
http://arxiv.org/abs/1804.08117
http://arxiv.org/abs/1804.08117
http://arxiv.org/abs/1906.00363
http://arxiv.org/abs/1906.00363
http://arxiv.org/abs/1906.00363
http://arxiv.org/abs/1908.04319
http://arxiv.org/abs/1908.04319
http://arxiv.org/abs/1908.04626
http://arxiv.org/abs/1908.04626
http://arxiv.org/abs/1910.03771
http://arxiv.org/abs/1910.03771
http://arxiv.org/abs/1906.08237
http://arxiv.org/abs/1906.08237
http://arxiv.org/abs/1903.05662
http://arxiv.org/abs/1903.05662
http://arxiv.org/abs/1903.05662

Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin
Choi. 2018. Swag: A large-scale adversarial dataset
for grounded commonsense inference.

Xuhui Zhou, Yue Zhang, Leyang Cui, and Dandan
Huang. 2019. Evaluating commonsense in pre-
trained language models.

http://arxiv.org/abs/1808.05326
http://arxiv.org/abs/1808.05326
http://arxiv.org/abs/1911.11931
http://arxiv.org/abs/1911.11931

A Examples of the Impact of the Hypothesis on Prediction

Question Hypothesis Gold answer Predicted with hypothesis Predicted without hypothesis
“the dogs were protecting their own when they decided to what the bad
man?”

attack chase go attack attack defend

“what are you using if there are speakers strapped on your ears?” radio headphones music headphones headphones conference
“what does everyone have in relation to other people?” relationship relationships

feelings
feelings feelings unique personality

“if the president wanted to ban snakes, where would he issue such a
decree?”

congress law Congress white house white house new mexico

“what is the sun ultimately responsible for?” life world existence life on earth life on earth heat
“where can you store you spare linens near your socks?” closet bedroom drawer dresser drawers dresser drawers home
“what would be necessary for getting in shape?” exercise fitness exercises exercise exercise good health

Table 8: Examples where the hypothesis from TOP-K = 3 ST is useful, and zeroing it out causes an error.

Question Hypothesis Gold answer Predicted with hypothesis Predicted without hypothesis
“what can eating lunch cause that is painful?” headache headaches pain heartburn headache heartburn
“what happens to a dog before someone puts up posters of them?” tame trained clean get lost trained get lost
“john was an aristocratic fox hunter. where might he live?” England France London new hampshire england new hampshire
“where can children play with animals?” park zoo farm fairgrounds zoos fairgrounds

Table 9: Examples where the hypothesis from TOP-K = 3 ST hurts the model, and zeroing it out fixes an error.

B Hypernym Extraction

We report results on the synthetic hypernym extraction task with and without the Gumbel-softmax trick
and the ST estimator. Table 10 shows the results. We observe that the ST estimator is crucial even on such
a simple task, which aligns with prior observations (Havrylov and Titov, 2017) that ST helps overcome
the discrepancy between training time and test time. GS improved results without ST, but had little effect
with ST.

Model Accuracy
+GS +ST 84.0
+GS -ST 61.0
-GS +ST 84.7
-GS -ST 54.7
END2END 86.5

Table 10: Results of END2END compared to our model (with GS and ST variants) on hypernym extraction.