BERT Rediscovers the Classical NLP Pipeline
Ian Tenney1 Dipanjan Das1 Ellie Pavlick1,2
1Google Research 2Brown University
{iftenney,dipanjand,epavlick}@google.com
Abstract
Pre-trained text encoders have rapidly ad-
vanced the state of the art on many NLP
tasks. We focus on one such model, BERT,
and aim to quantify where linguistic informa-
tion is captured within the network. We find
that the model represents the steps of the tra-
ditional NLP pipeline in an interpretable and
localizable way, and that the regions respon-
sible for each step appear in the expected se-
quence: POS tagging, parsing, NER, semantic
roles, then coreference. Qualitative analysis
reveals that the model can and often does ad-
just this pipeline dynamically, revising lower-
level decisions on the basis of disambiguating
information from higher-level representations.
1 Introduction
Pre-trained sentence encoders such as ELMo (Pe-
ters et al., 2018a) and BERT (Devlin et al., 2019)
have rapidly improved the state of the art on many
NLP tasks, and seem poised to displace both static
word embeddings (Mikolov et al., 2013) and dis-
crete pipelines (Manning et al., 2014) as the basis
for natural language processing systems. While
this has been a boon for performance, it has come
at the cost of interpretability, and it remains un-
clear whether such models are in fact learning the
kind of abstractions that we intuitively believe are
important for representing natural language, or are
simply modeling complex co-occurrence statis-
tics.
A wave of recent work has begun to “probe”
state-of-the-art models to understand whether they
are representing language in a satisfying way.
Much of this work is behavior-based, designing
controlled test sets and analyzing errors in order
to reverse-engineer the types of abstractions the
model may or may not be representing (e.g. Con-
neau et al., 2018; Marvin and Linzen, 2018; Poliak
et al., 2018). Parallel efforts inspect the structure
of the network directly, to assess whether there
exist localizable regions associated with distinct
types of linguistic decisions. Such work has pro-
duced evidence that deep language models can en-
code a range of syntactic and semantic informa-
tion (e.g. Shi et al., 2016; Belinkov, 2018; Ten-
ney et al., 2019), and that more complex structures
are represented hierarchically in the higher layers
of the model (Peters et al., 2018b; Blevins et al.,
2018).
We build on this latter line of work, focusing
on the BERT model (Devlin et al., 2019), and use
a suite of probing tasks (Tenney et al., 2019) de-
rived from the traditional NLP pipeline to quantify
where specific types of linguistic information are
encoded. Building on observations (Peters et al.,
2018b) that lower layers of a language model en-
code more local syntax while higher layers capture
more complex semantics, we present two novel
contributions. First, we present an analysis that
spans the common components of a traditional
NLP pipeline. We show that the order in which
specific abstractions are encoded reflects the tradi-
tional hierarchy of these tasks. Second, we quali-
tatively analyze how individual sentences are pro-
cessed by the BERT network, layer-by-layer. We
show that while the pipeline order holds in ag-
gregate, the model can allow individual decisions
to depend on each other in arbitrary ways, de-
ferring ambiguous decisions or revising incorrect
ones based on higher-level information.
2 Model
Edge Probing. Our experiments are based on
the “edge probing” approach of Tenney et al.
(2019), which aims to measure how well infor-
mation about linguistic structure can be extracted
from a pre-trained encoder. Edge probing decom-
poses structured-prediction tasks into a common
ar
X
iv
:1
90
5.
05
95
0v
2
[
cs
.C
L
]
9
A
ug
2
01
9
format, where a probing classifier receives spans
s1 = [i1, j1) and (optionally) s2 = [i2, j2) and
must predict a label such as a constituent or rela-
tion type.1 The probing classifier has access only
to the per-token contextual vectors within the tar-
get spans, and so must rely on the encoder to pro-
vide information about the relation between these
spans and their role in the sentence.
We use eight labeling tasks from the edge
probing suite: part-of-speech (POS), constituents
(Consts.), dependencies (Deps.), entities, seman-
tic role labeling (SRL), coreference (Coref.), se-
mantic proto-roles (SPR; Reisinger et al., 2015),
and relation classification (SemEval). These tasks
are derived from standard benchmark datasets, and
evaluated with a common metric–micro-averaged
F1–to facilitate comparison across tasks. 2
BERT. The BERT model (Devlin et al., 2019)
has shown state-of-the-art performance on many
tasks, and its deep Transformer architecture
(Vaswani et al., 2017) is typical of many recent
models (e.g. Radford et al., 2018, 2019; Liu et al.,
2019). We focus on the stock BERT models
(base and large, uncased), which are trained with
a multi-task objective (masked language modeling
and next-sentence prediction) over a 3.3B word
English corpus. Since we want to understand how
the network represents language as a result of pre-
training, we follow Tenney et al. (2019) (depart-
ing from standard BERT usage) and freeze the en-
coder weights. This prevents the encoder from re-
arranging its internal representations to better suit
the probing task.
Given input tokens T = [t0, t1, . . . , tn],
a deep encoder produces a set of layer ac-
tivations H(0), H(1), . . . ,H(L), where H(`) =
[h
(`)
0 ,h
(`)
1 , . . . ,h
(`)
n ] are the activation vectors of
the `th encoder layer and H(0) corresponds to the
non-contextual word(piece) embeddings. We use
a weighted sum across layers (§3.1) to pool these
into a single set of per-token representation vec-
tors H = [h0,h1, . . . ,hn], and train a probing
classifier Pτ for each task using the architecture
and procedure of Tenney et al. (2019).
1For single-span tasks (POS, entities, and constituents),
s2 is not used. For POS, s1 = [i, i+ 1) is a single token.
2We use the code from https://github.com/
jsalt18-sentence-repl/jiant. Dependencies is
the English Web Treebank (Silveira et al., 2014), SPR is the
SPR1 dataset of (Teichert et al., 2017), and relations is Se-
mEval 2010 Task 8 (Hendrickx et al., 2009). All other tasks
are from OntoNotes 5.0 (Weischedel et al., 2013).
Limitations This work is intended to be ex-
ploratory. We focus on one particular encoder–
BERT–to explore how information can be orga-
nized in a deep language model, and further work
is required to determine to what extent the trends
hold in general. Furthermore, our work carries
the limitations of all inspection-based probing: the
fact that a linguistic pattern is not observed by our
probing classifier does not guarantee that it is not
there, and the observation of a pattern does not
tell us how it is used. For this reason, we empha-
size the importance of combining structural analy-
sis with behavioral studies (as discussed in § 1) to
provide a more complete picture of what informa-
tion these models encode and how that informa-
tion affects performance on downstream tasks.
3 Metrics
We define two complementary metrics. The first,
scalar mixing weights (§3.1) tell us which lay-
ers, in combination, are most relevant when a
probing classifier has access to the whole BERT
model. The second, cumulative scoring (§3.2) tells
us how much higher we can score on a probing
task with the introduction of each layer. These
metrics provide complementary views on what is
happening inside the model. Mixing weights are
learned solely from the training data–they tell us
which layers the probing model finds most useful.
In contrast, cumulative scoring is derived entirely
from an evaluation set, and tell us how many lay-
ers are needed for a correct prediction.
3.1 Scalar Mixing Weights
To pool across layers, we use the scalar mixing
technique introduced by the ELMo model. Fol-
lowing Equation (1) of Peters et al. (2018a), for
each task we introduce scalar parameters γτ and
a
(0)
τ , a
(1)
τ , . . . , a
(L)
τ , and let:
hi,τ = γτ
L∑
`=0
s(`)τ h
(`)
i (1)
where sτ = softmax(aτ ). We learn these weights
jointly with the probing classifier Pτ , in order to
allow it to extract information from the many lay-
ers of an encoder without adding a large number
of parameters. After the probing model is trained,
we extract the learned coefficients in order to es-
timate the contribution of different layers to that
particular task. We interpret higher weights as ev-
https://github.com/jsalt18-sentence-repl/jiant
https://github.com/jsalt18-sentence-repl/jiant
Figure 1: Summary statistics on BERT-large. Columns
on left show F1 dev-set scores for the baseline (P (0)τ )
and full-model (P (L)τ ) probes. Dark (blue) are the mix-
ing weight center of gravity (Eq. 2); light (purple) are
the expected layer from the cumulative scores (Eq. 4).
idence that the corresponding layer contains more
information related to that particular task.
Center-of-Gravity. As a summary statistic, we
define the mixing weight center of gravity as:
Ēs[`] =
L∑
`=0
` · s(`)τ (2)
This reflects the average layer attended to for each
task; intuitively, we can interpret a higher value to
mean that the information needed for that task is
captured by higher layers.
3.2 Cumulative Scoring
We would like to estimate at which layer in the
encoder a target (s1, s2, label) can be correctly
predicted. Mixing weights cannot tell us this di-
rectly, because they are learned as parameters and
do not correspond to a distribution over data. A
naive classifier at a single layer cannot either, be-
cause information about a particular span may be
spread out across several layers, and as observed
in Peters et al. (2018b) the encoder may choose to
discard information at higher layers.
To address this, we train a series of classifiers
{P (`)τ }` which use scalar mixing (Eq. 1) to attend
to layer ` as well as all previous layers. P (0)τ corre-
sponds to a non-contextual baseline that uses only
a bag of word(piece) embeddings, while P (L)τ =
Pτ corresponds to probing all layers of the BERT
model.
These classifiers are cumulative, in the sense
that P (`+1)τ has a similar number of parameters but
with access to strictly more information than P (`)τ ,
Figure 2: Layer-wise metrics on BERT-large. Solid
(blue) are mixing weights s(`)τ (§3.1); outlined (purple)
are differential scores ∆(`)τ (§3.2), normalized for each
task. Horizontal axis is encoder layer.
and we see intuitively that performance (F1 score)
generally increases as more layers are added.3 We
can then compute a differential score ∆(`)τ , which
measures how much better we do on the probing
task if we observe one additional encoder layer `:
∆(`)τ = Score(P
(`)
τ )− Score(P
(`−1)
τ ) (3)
Expected Layer. Again, we compute a
(pseudo)4 expectation over the differential scores
as a summary statistic. To focus on the behavior
of the contextual encoder layers, we omit the con-
tribution of both the “trivial” examples resolved at
layer 0, as well as the remaining headroom from
3Note that if a new layer provides distracting features, the
probing model can overfit and performance can drop. We see
this in particular in the last 1-2 layers (Figure 2).
4This is not a true expectation because the F1 score is not
an expectation over examples.
the full model. Let:
Ē∆[`] =
∑L
`=1 ` ·∆
(`)
τ∑L
`=1 ∆
(`)
τ
(4)
This can be thought of as, approximately, the ex-
pected layer at which the probing model correctly
labels an example, assuming that example is re-
solved at some layer ` ≥ 1 of the encoder.
4 Results
Figure 1 reports summary statistics and absolute
F1 scores, and Figure 2 reports per-layer metrics.
Both show results on the 24-layer BERT-large
model. We also report K(?) = KL(?||Uniform)
to estimate how non-uniform5 each statistic (? =
sτ ,∆τ ) is for each task.
Linguistic Patterns. We observe a consistent
trend across both of our metrics, with the tasks
encoded in a natural progression: POS tags pro-
cessed earliest, followed by constituents, depen-
dencies, semantic roles, and coreference. That is,
it appears that basic syntactic information appears
earlier in the network, while high-level semantic
information appears at higher layers. We note that
this finding is consistent with initial observations
by Peters et al. (2018b), which found that con-
stituents are represented earlier than coreference.
In addition, we observe that in general, syntactic
information is more localizable, with weights re-
lated to syntactic tasks tending to be concentrated
on a few layers (high K(s) and K(∆)), while in-
formation related to semantic tasks is generally
spread across the entire network. For example,
we find that for semantic relations and proto-roles
(SPR), the mixing weights are close to uniform,
and that nontrivial examples for these tasks are re-
solved gradually across nearly all layers. For en-
tity labeling many examples are resolved in layer
1, but with a long tail thereafter, and only a weak
concentration of mixing weights in high layers.
Further study is needed to determine whether this
is because BERT has difficulty representing the
correct abstraction for these tasks, or because se-
mantic information is inherently harder to localize.
Comparison of Metrics. For many tasks, we
find that the differential scores are highest in the
5KL(?||Uniform) = −H(?)+Constant, so higher val-
ues correspond to lower entropy.
first few layers of the model (layers 1-7 for BERT-
large), i.e. most examples can be correctly classi-
fied very early on. We attribute this to the avail-
ability of heuristic shortcuts: while challenging
examples may not be resolved until much later,
many cases can be guessed from shallow statis-
tics. Conversely, we observe that the learned mix-
ing weights are concentrated much later, layers 9-
20 for BERT-large. We observe–particularly when
weights are highly concentrated–that the highest
weights are found on or just after the highest lay-
ers which give an improvement ∆(`)τ in F1 score
for that task.
This helps explain the observations on the se-
mantic relations and SPR tasks: cumulative scor-
ing shows continued improvement up to the high-
est layers of the model, while the lack of con-
centration in the mixing weights suggest that the
BERT encoder does not expose a localized set of
features that encode these more semantic phenom-
ena. Similarly for entity types, we see contin-
ued improvements in the higher layers – perhaps
related to fine-grained semantic distinctions like
”Organization” (ORG) vs. ”Geopolitical Entity”
(GPE) – while the low value for the expected layer
reflects that many examples require only limited
context to resolve.
Comparison of Encoders. We observe the same
general ordering on the 12-layer BERT-base
model (Figure A.2). In particular, there appears
to be a “stretching effect”, where the representa-
tions for a given task tend to concentrate at the
same layers relative to the top of the model; this
is illustrated side-by-side in Figure A.3.
4.1 Per-Example Analysis
We explore, qualitatively, how beliefs about the
structure of individual sentences develop over the
layers of the BERT network. The OntoNotes de-
velopment set contains annotations for five of our
probing tasks: POS, constituents, entities, SRL,
and coreference. We compile the predictions of
the per-layer classifiers P (`)τ for each task. Be-
cause many annotations are uninteresting – for ex-
ample, 89% of part-of-speech tags are correct at
layer 0 – we use a heuristic to identify ambigu-
ous sentences to visualize.6 Figure 3 shows two
6Specifically, we look for target edges (s1, s2, label)
where the highest scoring label has an average score
1
L+1
∑L
`=0
P
(`)
τ (label|s1, s2) ≤ 0.7, and look at sentences
with more than one such edge.
(a) he smoked toronto in the playoffs with six hits, seven
walks and eight stolen bases …
(b) china today blacked out a cnn interview that was …
Figure 3: Probing classifier predictions across lay-
ers of BERT-base. Blue is the correct label; or-
ange is the incorrect label with highest average score
over layers. Bar heights are (normalized) probabilities
P
(`)
τ (label|s1, s2). In the interest of space, only se-
lected annotations are shown.
selected examples, and more are presented in Ap-
pendix A.2.
We find that while the pipeline order holds on
average (Figure 2), for individual examples the
model is free to and often does choose a differ-
ent order. In the first example, the model origi-
nally (incorrectly) assumes that “Toronto” refers
to the city, tagging it as a GPE. However, af-
ter resolving the semantic role – determining that
“Toronto” is the thing getting “smoked” (ARG1)
– the entity-typing decision is revised in favor of
ORG (i.e. the sports team). In the second exam-
ple, the model initially tags “today” as a common
noun, date, and temporal modifier (ARGM-TMP).
However, this phrase is ambiguous, and it later
reinterprets “china today” as a proper noun (i.e.
the TV network) and updates its beliefs about the
entity type (to ORG), followed by the semantic role
(reinterpreting it as the agent ARG0).
5 Conclusion
We employ the edge probing task suite to explore
how the different layers of the BERT network can
resolve syntactic and semantic structure within a
sentence. We present two complementary mea-
surements: scalar mixing weights, learned from a
training corpus, and cumulative scoring, measured
on an evaluation set, and show that a consistent
ordering emerges. We find that while this tradi-
tional pipeline order holds in the aggregate, on in-
dividual examples the network can resolve out-of-
order, using high-level information like predicate-
argument relations to help disambiguate low-level
decisions like part-of-speech. This provides new
evidence corroborating that deep language mod-
els can represent the types of syntactic and se-
mantic abstractions traditionally believed neces-
sary for language processing, and moreover that
they can model complex interactions between dif-
ferent levels of hierarchical information.
Acknowledgments
Thanks to Kenton Lee, Emily Pitler, and Jon Clark
for helpful comments and feedback, and to the
members of the Google AI Language team for
many productive discussions.
References
Yonatan Belinkov. 2018. On internal language repre-
sentations in deep learning: An analysis of machine
translation and speech recognition. Ph.D. thesis,
Massachusetts Institute of Technology.
Terra Blevins, Omer Levy, and Luke Zettlemoyer.
2018. Deep RNNs encode soft hierarchical syntax.
In Proceedings of ACL.
Alexis Conneau, Germán Kruszewski, Guillaume
Lample, Loı̈c Barrault, and Marco Baroni. 2018.
What you can cram into a single $&#* vector: Prob-
ing sentence embeddings for linguistic properties.
In Proceedings of ACL.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training of
deep bidirectional transformers for language under-
standing. In Proceedings of NAACL.
Iris Hendrickx, Su Nam Kim, Zornitsa Kozareva,
Preslav Nakov, Diarmuid Ó Séaghdha, Sebastian
Padó, Marco Pennacchiotti, Lorenza Romano, and
Stan Szpakowicz. 2009. Semeval-2010 task 8:
Multi-way classification of semantic relations be-
tween pairs of nominals. In Proceedings of
the Workshop on Semantic Evaluations: Recent
Achievements and Future Directions.
Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jian-
feng Gao. 2019. Multi-task deep neural networks
for natural language understanding. arXiv preprint
1901.11504.
Christopher Manning, Mihai Surdeanu, John Bauer,
Jenny Finkel, Steven Bethard, and David McClosky.
http://hdl.handle.net/1721.1/118079
http://hdl.handle.net/1721.1/118079
http://hdl.handle.net/1721.1/118079
https://www.aclweb.org/anthology/P18-2003
https://www.aclweb.org/anthology/P18-1198
https://www.aclweb.org/anthology/P18-1198
https://www.aclweb.org/anthology/N19-1423
https://www.aclweb.org/anthology/N19-1423
https://www.aclweb.org/anthology/N19-1423
https://www.aclweb.org/anthology/W09-2415
https://www.aclweb.org/anthology/W09-2415
https://www.aclweb.org/anthology/W09-2415
https://arxiv.org/abs/1901.11504
https://arxiv.org/abs/1901.11504
2014. The Stanford CoreNLP natural language pro-
cessing toolkit. In Proceedings of ACL: System
Demonstrations.
Rebecca Marvin and Tal Linzen. 2018. Targeted syn-
tactic evaluation of language models. In Proceed-
ings of EMNLP.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Cor-
rado, and Jeff Dean. 2013. Distributed representa-
tions of words and phrases and their compositional-
ity. In Proceedings of NIPS.
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt
Gardner, Christopher Clark, Kenton Lee, and Luke
Zettlemoyer. 2018a. Deep contextualized word rep-
resentations. In Proceedings of NAACL.
Matthew Peters, Mark Neumann, Luke Zettlemoyer,
and Wen-tau Yih. 2018b. Dissecting contextual
word embeddings: Architecture and representation.
In Proceedings of EMNLP.
Adam Poliak, Aparajita Haldar, Rachel Rudinger,
J. Edward Hu, Ellie Pavlick, Aaron Steven White,
and Benjamin Van Durme. 2018. Collecting di-
verse natural language inference problems for sen-
tence representation evaluation. In Proceedings of
EMNLP.
Alec Radford, Karthik Narasimhan, Tim Salimans,
and Ilya Sutskever. 2018. Improving lan-
guage understanding by generative pre-training.
https://blog.openai.com/language-unsupervised.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
Dario Amodei, and Ilya Sutskever. 2019. Lan-
guage models are unsupervised multitask learners.
https://blog.openai.com/better-language-models.
Drew Reisinger, Rachel Rudinger, Francis Ferraro,
Craig Harman, Kyle Rawlins, and Benjamin Van
Durme. 2015. Semantic proto-roles. Transactions
of the Association of Computational Linguistics.
Xing Shi, Inkit Padhi, and Kevin Knight. 2016. Does
string-based neural MT learn source syntax? In Pro-
ceedings of EMNLP.
Natalia Silveira, Timothy Dozat, Marie-Catherine
de Marneffe, Samuel Bowman, Miriam Connor,
John Bauer, and Christopher D. Manning. 2014. A
gold standard dependency corpus for English. In
Proceedings of the Ninth International Conference
on Language Resources and Evaluation.
Adam Teichert, Adam Poliak, Benjamin Van Durme,
and Matthew Gormley. 2017. Semantic proto-role
labeling. In Proceedings of AAAI.
Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang,
Adam Poliak, R Thomas McCoy, Najoung Kim,
Benjamin Van Durme, Sam Bowman, Dipanjan Das,
and Ellie Pavlick. 2019. What do you learn from
context? probing for sentence structure in contextu-
alized word representations. In International Con-
ference on Learning Representations.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In Proceedings of NIPS.
Ralph Weischedel, Martha Palmer, Mitchell Marcus,
Eduard Hovy, Sameer Pradhan, Lance Ramshaw,
Nianwen Xue, Ann Taylor, Jeff Kaufman, Michelle
Franchini, et al. 2013. OntoNotes release
5.0 LDC2013T19. Linguistic Data Consortium,
Philadelphia, PA.
https://www.aclweb.org/anthology/P14-5010
https://www.aclweb.org/anthology/P14-5010
https://www.aclweb.org/anthology/D18-1151
https://www.aclweb.org/anthology/D18-1151
http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf
http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf
http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf
https://doi.org/10.18653/v1/N18-1202
https://doi.org/10.18653/v1/N18-1202
https://www.aclweb.org/anthology/D18-1179
https://www.aclweb.org/anthology/D18-1179
https://www.aclweb.org/anthology/D18-1007
https://www.aclweb.org/anthology/D18-1007
https://www.aclweb.org/anthology/D18-1007
https://blog.openai.com/language-unsupervised
https://blog.openai.com/language-unsupervised
https://blog.openai.com/better-language-models
https://blog.openai.com/better-language-models
https://doi.org/10.1162/tacl_a_00152
https://doi.org/10.18653/v1/D16-1159
https://doi.org/10.18653/v1/D16-1159
http://www.lrec-conf.org/proceedings/lrec2014/pdf/1089_Paper.pdf
http://www.lrec-conf.org/proceedings/lrec2014/pdf/1089_Paper.pdf
https://www.aaai.org/ocs/index.php/AAAI/AAAI17/paper/viewPaper/14997
https://www.aaai.org/ocs/index.php/AAAI/AAAI17/paper/viewPaper/14997
https://openreview.net/forum?id=SJzSgnRcKX
https://openreview.net/forum?id=SJzSgnRcKX
https://openreview.net/forum?id=SJzSgnRcKX
http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
https://catalog.ldc.upenn.edu/LDC2013T19
https://catalog.ldc.upenn.edu/LDC2013T19
A Appendix
A.1 Comparison of Encoders
We reproduce Figure 1 and Figure 2 (which depict
metrics on BERT-large) from the main paper be-
low, and show analogous plots for the BERT-base
models. We observe that the most important layers
for a given task appear in roughly the same relative
position on both the 24-layer BERT-large and 12-
layer BERT-base models, and that tasks generally
appear in the same order.
Additionally, in Figure A.1 we show scalar mix-
ing weights for the ELMo encoder (Peters et al.,
2018a), which consists of two LSTM layers over a
per-word character CNN. We observe that the first
LSTM layer (layer 1) is most informative for all
tasks, which corroborates the observations of Fig-
ure 2 of Peters et al. (2018a). As with BERT, we
observe that the weights are only weakly concen-
trated for the relations and SPR tasks. However,
unlike BERT, we see only a weak concentration in
the weights on the coreference task, which agrees
with the finding of Tenney et al. (2019) that ELMo
presents only weak features for coreference.
A.2 Additional Examples
We provide additional examples in the style of
Figure 3, which illustrate sequential decisions in
the layers of the BERT-base model.
Figure A.1: Scalar mixing weights for the ELMo en-
coder. Layer 0 is the character CNN that produces per-
word representations, and layers 1 and 2 are the LSTM
layers.
(a) BERT-base (b) BERT-large
Figure A.2: Summary statistics on BERT-base (left) and BERT-large (right). Columns on left show F1 dev-set
scores for the baseline (P (0)τ ) and full-model (P
(L)
τ ) probes. Dark (blue) are the mixing weight center of gravity;
light (purple) are the expected layer from the cumulative scores.
(a) BERT-base (b) BERT-large
Figure A.3: Layer-wise metrics on BERT-base (left) and BERT-large (right). Solid (blue) are mixing weights s(`)τ ;
outlined (purple) are differential scores ∆(`)τ , normalized for each task. Horizontal axis is encoder layer.
Figure A.4: Trace of selected annotations that intersect
the token “basque” in the above sentence. We see the
model recognize this as part of a proper noun (NNP)
in layer 2, which leads it to revise its hypothesis about
the constituent “petro basque” from an ordinary noun
phrase (NP) to a nominal mention (NML) in layers 3-4.
We also see that from layer 3 onwards, the model rec-
ognizes “petro basque” as either an organization (ORG)
or a national or religious group (NORP), but does not
strongly disambiguate between the two.
Figure A.5: Trace of selected annotations that intersect
the second “today” in the above sentence. The odel ini-
tially believes this to be a date and a common noun, but
by layer 4 realizes that this is the TV show (entity tag
WORK OF ART) and subsequently revises its hypothe-
ses about the constituent type and part-of-speech.
Figure A.6: Trace of selected coreference annotations
on the above sentence. Not shown are two coreference
edges that the model has correctly resolved at layer
0 (guessing from embeddings alone): “him” and “the
hurt man” are coreferent, as are “he” and “he”. We see
that the remaining edges, between non-coreferent men-
tions, are resolved in several stages.
Figure A.7: Trace of selected coreference and SRL an-
notations on the above sentence. The model resolves
the semantic role (purpose, ARGM-PRP) of the phrase
“to help him” in layers 5-7, then quickly resolves at
layer 8 that “him” and “he” (the agent of “stop”) are
not coreferent. Also shown is the correct prediction
that “him” is the recipient (ARG1, patient) of “help”.