CS计算机代考程序代写 Computational

Computational
Linguistics
Summer 2020
CSC 485
13
13. Unsupervised Parsing
Gerald Penn
Department of Computer Science, University of Toronto
Copyright © 2020 Gerald Penn. All rights reserved.

Unsupervised parsing
• Parsing without training on parse trees. How could such a thing be learned?
• Well, unsupervised doesn’t always mean no supervision…
• Parts of speech
• Binary-branch restrictions
• …and we often lower the bar in terms of what we expect the system to learn:
• Unlabelled (binary) trees
• Hierarchical structure without explicit, recursive rules.
2

PRPN: parse-read-predict
• PRPN trains a sequence of components that build a parse tree on the way to predicting the next word in a string of words – a fancy language model.
• But that means that supervising the whole system in sequence means that we must only provide words in sequence…
• for a parser, that counts as unsupervised!
• When we are done, we can break off the later components and use the parser by itself.
3

Some terminology
• “Corner-dominant”
• The highest ancestor for which a node is the left
corner, e.g.:
corner-dominant’s parent
corner-dominant
node’s parent
node
4

Some terminology
• “Corner-dominant”
• The highest ancestor for which a node is the left
corner, e.g.:
corner-dominant’s parent
corner-dominant
node’s parent
node
5

Some terminology
• “Left extent”
• lt = the left corner of a pre-terminal node’s corner-
dominant’s parent, for t > 0, e.g.:
corner-dominant
corner-dominant’s parent
l2: left extent x2: pre-terminal node
6

Some terminology
• “Left extent”
• lt = the left corner of a pre-terminal node’s corner-
dominant’s parent, for t > 0, e.g.:
corner-dominant
corner-dominant’s parent
l4: left extent x4: pre-terminal node
7

Some terminology
• “Dependency-range gate”
• 𝑔𝑡= ቊ1, 𝑙𝑡 ≤ 𝑖 < 𝑡, labels left extent of x, e.g.: 𝑖 0,0≤𝑖<𝑙𝑡 t x2’s corner-dominant x2’s corner-dominant’s parent x2 g2: 1 1 -- -- -- -- -- -- 8 Some terminology • “Dependency-range gate” • 𝑔𝑡= ቊ1, 𝑙𝑡 ≤ 𝑖 < 𝑡, labels left extent of x, e.g.: x4’s corner-dominant x4’s corner-dominant’s parent 𝑖 0,0≤𝑖<𝑙𝑡 t x4 g4: 0 0 1 1 -- -- -- -- 9 Some terminology • “Height” • h(w) = 1, • h(n) = max h 𝑚 + 1. 𝑚∈𝑇 \𝑛 6 𝑛 333 22222222 11111111 Note: height is not depth, nor is it h(root)-depth. Count from the bottom. 5 44 10 Some terminology • “Roark-Hollingshead distance” • d(i) = di = h 𝑤𝑖−1,𝑤𝑖 −2 . h=6 h=5 h=4 d(0) = 6+1−2 = 1 6−1 d(2) = 5−2 = 3 6−1 5 d(4) = 4−2 = 2 6−1 5 h 𝑟 −1 0 1 2 3 4 5 6 7=L-1 where h(w-1,w0) = h(wL-1,wL) = h(r)+1, h(u,v) = h(u ⊔ v) everywhere else (trees are CNF). 11 Roark-Hollingshead Conjecture A: All of it (except labels)! Very cool, because this is a “local” statistic. Q: How much of this does this preserve? Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo
12

Some terminology
• “Dependency range”
• α𝑡 = 𝑠𝑖𝑔𝑛 𝑑𝑡−𝑑𝑖+1 +1, where 𝑖 < 𝑡. 𝑖2 α6:1 0 1 1 0 1 -- -- 22 13 PRPN’s big idea 𝑡−1 𝑔𝑡=𝑃𝑙𝑡≤𝑖 ≈ ෑα𝑡. 𝑖𝑗 𝑗=𝑖+1 α6:1 0 1 1 0 1 -- -- 22 g6:0 0 0 0 0 1 -- -- 14 PRPN’s big idea 𝑡−1 𝑔𝑡=𝑃𝑙𝑡≤𝑖 ≈ ෑα𝑡. 𝑖𝑗 𝑗=𝑖+1 x4 ⊔ x5 x2 ⊔ x3 x5 ⊔ x6 α6:1 0 1 1 0 1 -- -- 22 g6:0 0 0 0 0 1 -- -- 15 PRPN’s big idea 𝑡−1 𝑔𝑡=𝑃𝑙𝑡≤𝑖 ≈ ෑα𝑡. 𝑖𝑗 𝑗=𝑖+1 x4 ⊔ x5 α5:1 1 1 1 1 -- -- -- 2 g5:1 1 1 1 1 -- -- -- 16 PRPN’s big idea 𝑡−1 𝑡 • But if 𝑃 𝑙𝑡 ≤ 𝑖 ≈ ∏ 𝛼 , then: 𝑃 𝑙𝑡 ≤ 𝑖 ≈ ∏ 𝛼𝑡 𝑡 𝑗=𝑖+2 𝑗 ∙ 𝛼𝑡 ≈ 𝑃 𝑙𝑡 ≤ 𝑖 + 1 𝑖+1 ∙ 𝛼𝑡 𝑖+1 , so: 𝑡−1 𝑗=𝑖+1 𝑗 𝛼𝑖+1 ≈𝑃 𝑙𝑡 ≠𝑖+1 𝑙𝑡 ≤𝑖+1) 1−𝛼𝑡 ≈𝑃 𝑙𝑡 =𝑖 𝑙𝑡 ≤𝑖),and: 𝑖 𝑃 𝑙𝑡 = 𝑖 = 𝑃 𝑙𝑡 = 𝑖 𝑙𝑡 ≤ 𝑖) ∙ 𝑃(𝑙𝑡 ≤ 𝑖) 𝑡 𝑡−1 𝑡 ≈1−𝛼∙∏𝛼. 𝑖 𝑗=𝑖+1 𝑗 • This is an example of the well-known stick-breaking process. When 𝛼 = 1 − 𝛽 , and the 𝛽 are samples from beta distributions, this 𝑗 process is an instance of a Dirichlet process. 𝑗𝑗 17 PRPN’s big idea 𝑡−1 𝑡 • But if 𝑃 𝑙𝑡 ≤ 𝑖 ≈ ∏ 𝛼 , then: 𝑃 𝑙𝑡 ≤ 𝑖 ≈ ∏ 𝛼𝑡 𝑡 𝑗=𝑖+2 𝑗 ∙ 𝛼𝑡 ≈ 𝑃 𝑙𝑡 ≤ 𝑖 + 1 𝑖+1 ∙ 𝛼𝑡 𝑖+1 , so: 𝑡−1 𝑗=𝑖+1 𝑗 𝛼𝑖+1 ≈𝑃 𝑙𝑡 ≠𝑖+1 𝑙𝑡 ≤𝑖+1) 1−𝛼𝑡 ≈𝑃 𝑙𝑡 =𝑖 𝑙𝑡 ≤𝑖),and: 𝑖 𝑃 𝑙𝑡 = 𝑖 = 𝑃 𝑙𝑡 = 𝑖 𝑙𝑡 ≤ 𝑖) ∙ 𝑃(𝑙𝑡 ≤ 𝑖) 𝑡 𝑡−1 𝑡 ≈1−𝛼∙∏𝛼. 𝑖 𝑗=𝑖+1 𝑗 • This is very well suited to modelling the dependence of 𝑙𝑡 upon as many words/pre- terminals as we want. 18 PRPN: parse • Soften up “Dependency range:” • α𝑡 = 𝑠𝑖𝑔𝑛 𝑑𝑡−𝑑𝑖+1 +1, where 𝑖 < 𝑡, becomes: 𝑖2 • 𝛼𝑡 = h𝑎𝑟𝑑𝑡𝑎𝑛h((𝑑𝑡−𝑑𝑖+1)∙𝜏) , where τ is temperature, 𝑖2 • and h𝑎𝑟𝑑𝑡𝑎𝑛h 𝑥 = max −1,min 1,𝑥 . • Then learn RH distance with a 2-layer convolution: 𝑒 𝑒 𝑡 • 𝑑 = 𝑅𝑒𝐿𝑈 𝑊 𝑞 + 𝑏 . 𝑡𝑑𝑡𝑑 • 𝑞 =𝑅𝑒𝐿𝑈 𝑊 𝑡−𝐿+1 +𝑏 , 𝑡𝑞⋯𝑞 𝑒 𝑡−𝐿 • But we’re not going to supervise this with dt from actual trees... Word vectors for wi-L, wi-L+1, ... wi 19 PRPN: read • Instead, we couple the input to memory states mi and use RH distance to interpolate mixtures of previous time steps into “summary vectors” that predict subsequent memory states: •𝑘=𝑊𝑚 +𝑊𝑒, 𝑡 𝑚 𝑡−1 𝑒𝑇𝑡 • 𝑠ҧ𝑡=𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑚𝑖𝑘𝑡 , • 𝑐ҧ = σ 𝑠𝑖 ∙ 𝑐𝑖 , 𝑡 𝑖=1 𝑐ҧ 𝑐 𝑡𝑡 • 𝑚ഥ𝑡 𝑚𝑡. 𝑒 𝑡 𝑖 dim(𝑘) 𝑔𝑡 Big idea: depends on d ’s now i Summary vector •𝑠𝑡= 𝑖 𝑠ҧ𝑡, 𝑖 σ𝑔𝑡𝑖 𝑗𝑗 𝑚ഥ𝑡 𝑡−1 𝑡 𝑚𝑖 LSTM 20 PRPN: predict • Instead, we Now, predict et+1, given m0,...,mt, which in turn depend upon e0,...,et: •𝑘=𝑊𝑚 +𝑊𝑒, 𝑡 𝑚 𝑡−1 𝑒𝑇𝑡 • 𝑠ҧ𝑡=𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑚𝑖𝑘𝑡 , 𝑖 •𝑟𝑡=𝑖 𝑠ҧ𝑡, 𝑖 σ𝑔𝑡+1𝑖 𝑗𝑗 ҧ 𝑡−1 𝑡 • 𝑙𝑡 = σ 𝑟 ∙𝑚𝑖, 𝑖=𝑙 𝑖 𝑔𝑡+1 Depends on dt+1 Stick-breaking process: also depends on dt+1 dim(𝑘) 𝑡+1 ෩෨ ≈𝑅𝑒𝐿𝑈𝑊𝑚+𝑏 , • Estimate𝑑 𝑡+1 𝑑𝑡𝑑 • then estimate 𝑒ǁ = tanh(𝑊 𝑡 + 𝑏 ). 𝑡+1 𝑓𝑚𝑡𝑓 𝑙ҧ 21 CCM: brackets and spans • Predict syntax directly, but not with trees. • Instead, use bracket matrices and “spans,” which here consist also of yields and contexts: 22 CCM: brackets and spans • Predict syntax directly, but not with trees. • Instead, use bracket matrices and “spans,” which here consist also of yields and contexts. • 𝑃𝑆,𝐵 =𝑃𝐵𝑃𝑆𝐵 •𝑃𝑆𝐵=∏𝑃𝛼𝐵𝑃𝑥𝐵 𝑖,𝑗 • Then, use Expectation Maximization: • E-step: calculate P(B|S,Θ) ෡ Θ • P(B) is not recalculated – it is a uniform distribution over tree-consistent bracketings. • M-step: fixing those, calculate: ෡ argmaxσ𝑃 𝐵 𝑆,Θ log𝑃 𝑆,𝐵 Θ . 𝑖𝑗𝑖𝑗 𝑖𝑗𝑖𝑗 𝐵 23 Performance on WSJ10 24