4 Probabilistic CFGs
The “hidden structure” here has a more complicated shape than we had to deal with for PFSAs:
• With a (P)FSA, the hidden additional structure was a collection of states, indexed by their position
in a sequence (i.e. Q0, Q1, Q2, …).
• Now with a (P)CFG, the hidden additional structure is a collection of nonterminals, indexed by their
position in a tree.
The hidden structure associated with the one tree via which we can generate ‘watches with telescopes’ (as
a VP) can be expressed as follows, where the nonterminals are indexed by Gorn addresses:
(1) Pr(Nε = VP,N0 = VP,X0 = watches,N1 = PP,N10 = P,X10 = with,N11 = NP,X11 = telescopes)
VP
VP PP watches
P NP with telescopes
We’ll set up a PCFG as the kind of thing that defines joint probabilities like (1).
(2) A probabilistic context-free grammar (PCFG) is a four-tuple ⟨N, Σ, I, R⟩ where:
• N is a finite set of nonterminal symbols;
• Σ, the alphabet, is a finite set of terminal symbols (disjoint from N);
• I : N → [0, 1] is a function assigning initial probabilities; and
• R : N × ((N × N ) ∪ Σ) → [0, 1] is a function assigning rule probabilities; with the additional requirements that
• n∈N I(n) = 1;
• foralln∈N,l∈N r∈N R(n,l,r)+x∈ΣR(n,x)=1;and
• there are no “dead ends”.
We take the values assigned by the I and R functions to be giving us “local” conditional probabilities:
(3) I(n) = Pr(Nε = n) R(n,l,r)=Pr(Nα0 =l,Nα1 =r|Nα =n)
R(n,x)=Pr(Xα =x|Nα =n)
And we make the following important independence assumptions, from which it will follow that the big joint probability in (1) can be computed as a product of the grammar’s local conditional probabilities:
(4) a.
Limited horizon:
The choice of how to expand a node Nα is independent of the ancestors of node Nα.
b. Context-free-ness:
The choice of how to expand a node Nα is independent of anything to the left or right of node
Nα.
c. Place invariance:
The choice of how to expand a node Nα is independent of the position α of the node in the tree.
Ling185A, Winter 2021 — Tim Hunter, UCLA
Specifically:
(5) Pr(Nε = VP,N0 = VP,X0 = watches,N1 = PP,N10 = P,X10 = with,N11 = NP,X11 = telescopes) =Pr(Nε =VP)Pr(N0 =VP,N1 =PP|Nε =VP)Pr(X0 =watches|N0 =VP)
Pr(N10 = P,N11 = NP | N1 = PP)Pr(X10 = with | N10 = P)Pr(X11 = telescopes | N11 = NP) = I(VP) × R(VP, VP, PP) × R(VP, watches) × R(PP, P, NP) × R(P, with) × R(NP, telescopes)
Some notation for dealing with this hidden tree structure:
(6) a. Xα is the random variable for the substring dominated by the node at position α.
b. X←− is the random variable for the substring to the left of the node at position α. α
c. X−→α is the random variable for the substring to the right of the node at position α.
The probability of a string involves summing over all tree structures:
(7) Pr(Xε = watches with telescopes) = Pr(Xε = watches with telescopes, T = t) t
It’s hard to enumerate all those possible choices of t, so we interleave it with the calculation of probabilities for subparts of the tree (just like we did for boolean CFGs).
4.1 Inside probabilities
Let’s just add in a specific choice of root nonterminal:
(8) Pr(Xε = x1 …xn) = Pr(Xε = x1 …xn,Nε = n)
n
=Pr(Nε =n)Pr(Xε =x1…xn |Nε =n) n
A probability of the form Pr(Xα = w | Nα = n) is what we previously called inside(w)(n). So the equation says the same thing as this equation that we’ve seen before:
(9)
We can also break these down recursively: (10) Pr(Xα = x1 …xm | Nα = n)
= Pr(Nα0 =l,Nα1 =r,X =x1…xi,X =xi+1…xm |Nα =n) α0 α1
ilr
= Pr(Nα0 =l,Nα1 =r|Nα =n)×Pr(Xα0 =x1…xi |Nα0 =l) ilr
×Pr(Xα1 =xi+1…xm |Nα1 =r)
This is the recursive description of inside values that we saw before: (11)
f(w) = I(n) × inside(w)(n) n
inside(x1 . . . xm)(n) = R(n, l, r) × inside(x1 . . . xi)(l) × inside(xi+1 . . . xm)(r) ilr
Ling185A, Winter 2021 — Tim Hunter, UCLA
4.2 Outside probabilities
Things work just the same way for outside probabilities. If we add in just a single chosen leaf node at address α:
n∈N
n∈N which we saw before as:
(13) f(x1 . . . xm) = outside(x1 . . . xi−1, xi+1 . . . xm)(n) × R(n)(xi) n
(12) Pr(Xε=x1…xm)
=Pr(N =n,X←− =x …x ,X =x,X−→ =x …x )
α α 1 i−1 α i α i+1 m
=Pr(N =n,X←− =x …x ,X−→ =x …x )×Pr(X =x |N =n)
And the recursive description:
(14) Pr(X←− =y …y ,X−→ =z …z ,N =x)
α α 1 i−1 α i+1 m
α i α
α1nα1mα
= Pr(X←− = y1 …yn,X−→ = z1 …zm,Nβ0 = x)
β0 β0
+Pr(X←− =y1…yn,X−→ =z1…zm,Nβ1 =x)
β1 β1
= Pr(Nβ0 =x,Nβ1 =r|Nβ =p)×Pr(X =z1…zi |Nβ1 =r)
β1 ipr
×Pr(X←− =y1…yn,X−→ =zi+1…zm,Nβ =p) ββ
+ Pr(Nβ0 =l,Nβ1 =x|Nβ =p)×Pr(X =yi+1…yn |Nβ0 =l) β0
ipl ×Pr(X←− =y1…yi,X−→ =z1…zm,Nβ =p)
which we saw before as:
(15) outsideG(y1 . . . ym, z1 . . . zq)(n)
= R(p,n,r)×insideG(z1…zi)(r)×outsideG(y1…ym,zi+1…zq)(p) ipr
+ R(p,l,n)×insideG(yi+1 …ym)(l)×outsideG(y1 …yi,z1 …zq)(p) ipl
ββ
Ling185A, Winter 2021 — Tim Hunter, UCLA
5 Probabilities of structure given a string
This PFSA assigns joint probabilities, each of which we can think of as the probability of a certain string along with a certain syllable structure.
0.1
C/0.5 112
V/0.2
V/0.5 C/1 V/0.2
3
V/0.5
Ling185A, Winter 2021 — Tim Hunter, UCLA
6 Fitting a probabilistic model to data
6.1 Three-sided coin example
Suppose we have a “three-sided coin”, with faces A, B and C.
We know that there’s some number w (and 0 ≤ w ≤ 1) such that, on any particular flip of the coin:
• the probability of A is w2,
• the probability of B is w(1 − w), and • the probability of C is 1 − w.
So, for example: (16)
probability of A probability of B probability of C
w=0 w=0.25 w=0.5 w=0.75 w=1
0 0.0625 0.25 0 0.1875 0.25 1 0.75 0.5
0.5625 1 0.1875 0 0.25 0
We flip the coin ten times and get three As, five Bs and two Cs. In general, the probability or likelihood of this outcome is:
(17) (probability of A)3 × (probability of B)5 × (probability of C)2 This value will depend on the value of w:
(18) a. If w were 0.25, then the likelihood would be (0.0625)3 × (0.1875)5 × (0.75)2 ≈ 3.18 × 10−8. b. If w were 0.5, then the likelihood would be (0.25)3 × (0.25)5 × (0.5)2 ≈ 3.81 × 10−6.
c. If w were 0.75, then the likelihood would be (0.5625)3 × (0.1875)5 × (0.25)2 ≈ 2.58 × 10−6.
So of these three options, w = 0.5 makes the likelihood the greatest. But which is the best option overall?
likelihood L(w) = (probability of A)3 × (probability of B)5 × (probability of C)2 = (w2)3 ×(w(1−w))5 ×(1−w)2
(19)
(20)
This derivative is zero when: (21) w = 0 or
= w11(1 − w)7
dL(w) = 11w10 × (1 − w)7 + w11 × − 7(1 − w)6
(product rule for differentiation)
11(1 − w) − 7w = 0
11 − 11w − 7w = 0
11 = 18w
w=11 ≈0.611 18
dw
= w10(1 − w)611(1 − w) − 7w
1 − w = 0 or w = 1
Of these, the only local maximum is w ≈ 0.611. So this is the value of w which assigns the highest likelihood to the observed flips. (And with this value of w, the probabilities of A, B and C are about 0.373, 0.238 and 0.389, respectively.)
Ling185A, Winter 2021 — Tim Hunter, UCLA
Ling185A, Winter 2021 — Tim Hunter, UCLA
$ python3 -q
>>> from scipy.optimize import minimize_scalar >>> minimize_scalar(lambda w: – w**11 * (1-w)**7) fun: -5.9717377174644157e-06
nfev: 13
nit: 12
success: True
x: 0.61111111104523153
>>>
6×10−6 4×10−6 2×10−6
L(w)
−2 × 10−6
−4 × 10−6
−6 × 10−6
−0.2 0 0.2 0.4 0.6 0.8 1
w
6.2
0
Two-sided coin example
Suppose now we have a more traditional two-sided coin, with faces H and T.
We know that there’s some number w (and 0 ≤ w ≤ 1) such that, on any particular flip of the coin:
• the probability of H is w, and
• the probability of T is 1 − w.
We flip the coin ten times and get seven Hs and three Ts.
(22) likelihood L(w) = (probability of H)7 × (probability of T)3 = w7(1 − w)3
(23) dL(w) = 7w6 × (1 − w)3 + w7 × − 3(1 − w)2 dw
= w6(1 − w)27(1 − w) − 3w
(product rule for differentiation)
7(1 − w) − 3w = 0 7 − 7w − 3w = 0 7 = 10w
w=7 10
This derivative is zero when: (24) w = 0 or
1 − w = 0 or w = 1
Ling185A, Winter 2021 — Tim Hunter, UCLA
$ python3 -q
>>> from scipy.optimize import minimize_scalar >>> minimize_scalar(lambda w: – w**7 * (1-w)**3) fun: -0.0022235660999999998
nfev: 13
nit: 12
success: True
x: 0.7000000001004768
>>>
0.002
0.0015
0.001 L(w)
6.3
Two-parameter three-sided coin example
0.0025
0.0005 0 −0.0005 −0.001
−0.2 0 0.2 0.4 0.6 0.8 1 w
Suppose now we have another three-sided coin, with faces A, B and C.
We know that there are two numbers w0 and w1 such that, on any particular flip of the coin:
• the probability of A is w0,
• the probability of B is w1, and
• the probability of C is (1 − w0 − w1).
We flip the coin ten times and get three As, five Bs and two Cs.
(25) likelihood L(w0, w1) = (probability of A)3 × (probability of B)5 × (probability of C)2
=w03 ×w15 ×(1−w0 −w1)2 High-school calculus is less help here, but the principles are the same.
Ling185A, Winter 2021 — Tim Hunter, UCLA
L(w0, w1)
1
0.8
w1
1
0.6
0.8 0.6
w0
0.4
0.4 0.2
$ python3 -q
>>> from scipy.optimize import minimize
>>> l = lambda w: – (w[0]**3 * w[1]**5 * (1 – w[0] – w[1])**2)
>>> c = #{’type’: ’ineq’, ’fun’: (lambda w: 1 – w[0] – w[1])#}
>>> b = [(0,1),(0,1)]
>>> minimize(l, (0.2,0.2), bounds=b, constraints=c, tol=1e-10) fun: -3.3749999736171599e-05
jac: array([ -3.45876288e-08, -9.02400643e-09, 0.00000000e+00]) message: ’Optimization terminated successfully.’
nfev: 70
nit: 17
njev: 17
status: 0
success: True
x: array([ 0.29998248, 0.50000868])
>>>
6.4 Relative frequency and likelihood
0.2
00
So in general: when our hypothesis space is set up with parameters corresponding to each possible outcome of a multinomial distribution, taking probabilities to be relative frequencies maximizes likelihood.