Computational
Linguistics
Summer 2020
CSC 485
13
13. Unsupervised Parsing
Gerald Penn
Department of Computer Science, University of Toronto
Copyright © 2020 Gerald Penn. All rights reserved.
Unsupervised parsing
• Parsing without training on parse trees. How could such a thing be learned?
• Well, unsupervised doesn’t always mean no supervision…
• Parts of speech
• Binary-branch restrictions
• …and we often lower the bar in terms of what we expect the system to learn:
• Unlabelled (binary) trees
• Hierarchical structure without explicit, recursive rules.
2
PRPN: parse-read-predict
• PRPN trains a sequence of components that build a parse tree on the way to predicting the next word in a string of words – a fancy language model.
• But that means that supervising the whole system in sequence means that we must only provide words in sequence…
• for a parser, that counts as unsupervised!
• When we are done, we can break off the later components and use the parser by itself.
3
Some terminology
• “Corner-dominant”
• The highest ancestor for which a node is the left
corner, e.g.:
corner-dominant’s parent
corner-dominant
node’s parent
node
4
Some terminology
• “Corner-dominant”
• The highest ancestor for which a node is the left
corner, e.g.:
corner-dominant’s parent
corner-dominant
node’s parent
node
5
Some terminology
• “Left extent”
• lt = the left corner of a pre-terminal node’s corner-
dominant’s parent, for t > 0, e.g.:
corner-dominant
corner-dominant’s parent
l2: left extent x2: pre-terminal node
6
Some terminology
• “Left extent”
• lt = the left corner of a pre-terminal node’s corner-
dominant’s parent, for t > 0, e.g.:
corner-dominant
corner-dominant’s parent
l4: left extent x4: pre-terminal node
7
Some terminology
• “Dependency-range gate”
• 𝑔𝑡= ቊ1, 𝑙𝑡 ≤ 𝑖 < 𝑡, labels left extent of x, e.g.:
𝑖 0,0≤𝑖<𝑙𝑡 t x2’s corner-dominant
x2’s corner-dominant’s parent
x2
g2: 1 1 -- -- -- -- -- --
8
Some terminology
• “Dependency-range gate”
• 𝑔𝑡= ቊ1, 𝑙𝑡 ≤ 𝑖 < 𝑡, labels left extent of x, e.g.:
x4’s corner-dominant
x4’s corner-dominant’s parent
𝑖 0,0≤𝑖<𝑙𝑡
t
x4
g4: 0 0 1 1 -- -- -- --
9
Some terminology
• “Height”
• h(w) = 1,
• h(n) = max h 𝑚 + 1. 𝑚∈𝑇 \𝑛
6
𝑛
333 22222222
11111111
Note: height is not depth, nor is it h(root)-depth. Count from the bottom.
5
44
10
Some terminology
• “Roark-Hollingshead distance” • d(i) = di = h 𝑤𝑖−1,𝑤𝑖 −2 .
h=6
h=5
h=4
d(0) = 6+1−2 = 1 6−1
d(2) = 5−2 = 3 6−1 5
d(4) = 4−2 = 2 6−1 5
h 𝑟 −1
0 1 2 3 4 5 6 7=L-1
where h(w-1,w0) = h(wL-1,wL) = h(r)+1,
h(u,v) = h(u ⊔ v) everywhere else (trees are CNF).
11
Roark-Hollingshead Conjecture
A: All of it (except labels)!
Very cool, because this is a “local” statistic.
Q: How much of
this does this preserve?
Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo
12
Some terminology
• “Dependency range”
• α𝑡 = 𝑠𝑖𝑔𝑛 𝑑𝑡−𝑑𝑖+1 +1, where 𝑖 < 𝑡.
𝑖2
α6:1 0 1 1 0 1 -- -- 22
13
PRPN’s big idea
𝑡−1
𝑔𝑡=𝑃𝑙𝑡≤𝑖 ≈ ෑα𝑡. 𝑖𝑗
𝑗=𝑖+1
α6:1 0 1 1 0 1 -- -- 22
g6:0 0 0 0 0 1 -- --
14
PRPN’s big idea
𝑡−1
𝑔𝑡=𝑃𝑙𝑡≤𝑖 ≈ ෑα𝑡. 𝑖𝑗
𝑗=𝑖+1
x4 ⊔ x5
x2 ⊔ x3
x5 ⊔ x6
α6:1 0 1 1 0 1 -- -- 22
g6:0 0 0 0 0 1 -- --
15
PRPN’s big idea
𝑡−1
𝑔𝑡=𝑃𝑙𝑡≤𝑖 ≈ ෑα𝑡. 𝑖𝑗
𝑗=𝑖+1
x4 ⊔ x5
α5:1 1 1 1 1 -- -- -- 2
g5:1 1 1 1 1 -- -- --
16
PRPN’s big idea
𝑡−1 𝑡
• But if 𝑃 𝑙𝑡 ≤ 𝑖 ≈ ∏ 𝛼 , then:
𝑃 𝑙𝑡
≤ 𝑖
≈ ∏ 𝛼𝑡 𝑡 𝑗=𝑖+2 𝑗
∙ 𝛼𝑡 ≈ 𝑃 𝑙𝑡 ≤ 𝑖 + 1 𝑖+1
∙ 𝛼𝑡 𝑖+1
, so:
𝑡−1
𝑗=𝑖+1 𝑗
𝛼𝑖+1 ≈𝑃 𝑙𝑡 ≠𝑖+1 𝑙𝑡 ≤𝑖+1)
1−𝛼𝑡 ≈𝑃 𝑙𝑡 =𝑖 𝑙𝑡 ≤𝑖),and: 𝑖
𝑃 𝑙𝑡 = 𝑖 = 𝑃 𝑙𝑡 = 𝑖 𝑙𝑡 ≤ 𝑖) ∙ 𝑃(𝑙𝑡 ≤ 𝑖) 𝑡 𝑡−1 𝑡
≈1−𝛼∙∏𝛼. 𝑖 𝑗=𝑖+1 𝑗
• This is an example of the well-known stick-breaking process. When 𝛼 = 1 − 𝛽 , and
the 𝛽 are samples from beta distributions, this 𝑗
process is an instance of a Dirichlet process.
𝑗𝑗
17
PRPN’s big idea
𝑡−1 𝑡
• But if 𝑃 𝑙𝑡 ≤ 𝑖 ≈ ∏ 𝛼 , then:
𝑃 𝑙𝑡
≤ 𝑖
≈ ∏ 𝛼𝑡 𝑡 𝑗=𝑖+2 𝑗
∙ 𝛼𝑡 ≈ 𝑃 𝑙𝑡 ≤ 𝑖 + 1 𝑖+1
∙ 𝛼𝑡 𝑖+1
, so:
𝑡−1
𝑗=𝑖+1 𝑗
𝛼𝑖+1 ≈𝑃 𝑙𝑡 ≠𝑖+1 𝑙𝑡 ≤𝑖+1)
1−𝛼𝑡 ≈𝑃 𝑙𝑡 =𝑖 𝑙𝑡 ≤𝑖),and: 𝑖
𝑃 𝑙𝑡 = 𝑖 = 𝑃 𝑙𝑡 = 𝑖 𝑙𝑡 ≤ 𝑖) ∙ 𝑃(𝑙𝑡 ≤ 𝑖) 𝑡 𝑡−1 𝑡
≈1−𝛼∙∏𝛼. 𝑖 𝑗=𝑖+1 𝑗
• This is very well suited to modelling the dependence of 𝑙𝑡 upon as many words/pre- terminals as we want.
18
PRPN: parse
• Soften up “Dependency range:”
• α𝑡 = 𝑠𝑖𝑔𝑛 𝑑𝑡−𝑑𝑖+1 +1, where 𝑖 < 𝑡, becomes:
𝑖2
• 𝛼𝑡 = h𝑎𝑟𝑑𝑡𝑎𝑛h((𝑑𝑡−𝑑𝑖+1)∙𝜏) , where τ is temperature, 𝑖2
• and h𝑎𝑟𝑑𝑡𝑎𝑛h 𝑥 = max −1,min 1,𝑥 .
• Then learn RH distance with a 2-layer convolution:
𝑒
𝑒 𝑡
• 𝑑 = 𝑅𝑒𝐿𝑈 𝑊 𝑞 + 𝑏 . 𝑡𝑑𝑡𝑑
• 𝑞 =𝑅𝑒𝐿𝑈 𝑊 𝑡−𝐿+1 +𝑏 , 𝑡𝑞⋯𝑞
𝑒 𝑡−𝐿
• But we’re not going to supervise this with dt from actual trees...
Word vectors for wi-L, wi-L+1, ... wi
19
PRPN: read
• Instead, we couple the input to memory states mi and use RH distance to interpolate mixtures of previous time steps into “summary vectors” that predict subsequent memory states:
•𝑘=𝑊𝑚 +𝑊𝑒, 𝑡 𝑚 𝑡−1 𝑒𝑇𝑡
• 𝑠ҧ𝑡=𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑚𝑖𝑘𝑡 ,
• 𝑐ҧ = σ 𝑠𝑖 ∙ 𝑐𝑖 , 𝑡 𝑖=1
𝑐ҧ 𝑐 𝑡𝑡
• 𝑚ഥ𝑡 𝑚𝑡.
𝑒 𝑡
𝑖
dim(𝑘)
𝑔𝑡
Big idea: depends on d ’s now i
Summary vector
•𝑠𝑡= 𝑖 𝑠ҧ𝑡, 𝑖 σ𝑔𝑡𝑖
𝑗𝑗
𝑚ഥ𝑡 𝑡−1 𝑡 𝑚𝑖
LSTM
20
PRPN: predict
• Instead, we Now, predict et+1, given m0,...,mt, which in turn depend upon e0,...,et:
•𝑘=𝑊𝑚 +𝑊𝑒, 𝑡 𝑚 𝑡−1 𝑒𝑇𝑡
• 𝑠ҧ𝑡=𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑚𝑖𝑘𝑡
,
𝑖
•𝑟𝑡=𝑖 𝑠ҧ𝑡, 𝑖 σ𝑔𝑡+1𝑖
𝑗𝑗
ҧ 𝑡−1 𝑡
• 𝑙𝑡 = σ 𝑟 ∙𝑚𝑖, 𝑖=𝑙 𝑖
𝑔𝑡+1
Depends on dt+1
Stick-breaking process: also depends on dt+1
dim(𝑘)
𝑡+1
෩෨ ≈𝑅𝑒𝐿𝑈𝑊𝑚+𝑏 ,
• Estimate𝑑
𝑡+1 𝑑𝑡𝑑
• then estimate 𝑒ǁ = tanh(𝑊 𝑡 + 𝑏 ). 𝑡+1 𝑓𝑚𝑡𝑓
𝑙ҧ
21
CCM: brackets and spans
• Predict syntax directly, but not with trees.
• Instead, use bracket matrices and “spans,” which here consist also of yields and contexts:
22
CCM: brackets and spans
• Predict syntax directly, but not with trees.
• Instead, use bracket matrices and “spans,” which
here consist also of yields and contexts.
• 𝑃𝑆,𝐵 =𝑃𝐵𝑃𝑆𝐵 •𝑃𝑆𝐵=∏𝑃𝛼𝐵𝑃𝑥𝐵
𝑖,𝑗
• Then, use Expectation Maximization:
• E-step: calculate P(B|S,Θ)
Θ
• P(B) is not recalculated – it is a uniform distribution over tree-consistent bracketings.
• M-step: fixing those, calculate:
argmaxσ𝑃 𝐵 𝑆,Θ log𝑃 𝑆,𝐵 Θ .
𝑖𝑗𝑖𝑗 𝑖𝑗𝑖𝑗
𝐵
23
Performance on WSJ10
24