09-distributional-semantics
Distributional Semantics¶
For this notebook, we’ll be using the 500 document Brown corpus included in NLTK
In [1]:
from nltk.corpus import brown
This notebook is divided up into two independent parts: the first uses PMI for distinguishing good collocations, and the second involves building a vector space model.
For the PMI portion, we’ll use a function which extracts the information we need for a particular two word collocation, namely counts of each word individually, counts of the collocation, and the total number of word tokens in the corpus, and then calculates the PMI:
In [2]:
import math
def get_PMI_for_collocation_brown(word1,word2):
word1_count = 0
word2_count = 0
both_count = 0
total_count = 0.0 # so that division results in a float
for sent in brown.sents():
sent = [word.lower() for word in sent]
for i in range(len(sent)):
total_count += 1
if sent[i] == word1:
word1_count += 1
if i < len(sent) - 1 and sent[i + 1] == word2:
both_count += 1
elif sent[i] == word2:
word2_count += 1
return math.log((both_count/total_count)/((word1_count/total_count)*(word2_count/total_count)), 2)
Note that in a typical use case, we probably wouldn't do it this way, since we'd likely want to calculate PMI across many different words, and collecting the statisitcs for this can be done in a single pass across the corpus for all words, and then the PMI calculated in a separate function. Anyway, let's compare the PMI for two phrases, "hard work" and "some work"
In [31]:
print(get_PMI_for_collocation_brown("hard","work"))
print(get_PMI_for_collocation_brown("some","work"))
5.237244531670497
1.9135320271049516
Based on PMI, "hard work" appears to be a much better collocation than "some work", which matches our intuition. Go ahead and try out this out some other collocations.
For the second part of the notebook, let's first create a sparse document-term matrix, using sci-kit learn. We'll then apply tf-idf weighting and SVD to learn word vectors.
In [32]:
from sklearn.feature_extraction import DictVectorizer
def get_BOW(text):
BOW = {}
for word in text:
BOW[word.lower()] = BOW.get(word.lower(),0) + 1
return BOW
texts = []
for fileid in brown.fileids():
texts.append(get_BOW(brown.words(fileid)))
vectorizer = DictVectorizer()
brown_matrix = vectorizer.fit_transform(texts)
print(brown_matrix)
(0, 49) 1.0
(0, 58) 1.0
(0, 169) 1.0
(0, 181) 1.0
(0, 205) 1.0
(0, 238) 1.0
(0, 322) 33.0
(0, 373) 3.0
(0, 374) 3.0
(0, 393) 87.0
(0, 395) 4.0
(0, 405) 88.0
(0, 454) 4.0
(0, 465) 1.0
(0, 695) 1.0
(0, 720) 1.0
(0, 939) 1.0
(0, 1087) 1.0
(0, 1103) 1.0
(0, 1123) 1.0
(0, 1159) 1.0
(0, 1170) 1.0
(0, 1173) 1.0
(0, 1200) 3.0
(0, 1451) 1.0
: :
(499, 49161) 1.0
(499, 49164) 1.0
(499, 49242) 1.0
(499, 49253) 1.0
(499, 49275) 1.0
(499, 49301) 1.0
(499, 49313) 1.0
(499, 49369) 1.0
(499, 49385) 1.0
(499, 49386) 4.0
(499, 49390) 2.0
(499, 49410) 2.0
(499, 49446) 1.0
(499, 49576) 1.0
(499, 49590) 1.0
(499, 49613) 3.0
(499, 49691) 42.0
(499, 49694) 3.0
(499, 49697) 3.0
(499, 49698) 1.0
(499, 49707) 17.0
(499, 49708) 1.0
(499, 49710) 4.0
(499, 49711) 1.0
(499, 49797) 1.0
Our matrix is sparse: for instance, columns 0-48 in row 0 are empty, and are just left out, only the rows and columns with values other than zeros are displayed
Rather than removing stopwords as we did for text classification, let's add some idf weighting to this matrix. Scikit-learn has a built-in tf-idf transformer for just this purpose.
In [33]:
from sklearn.feature_extraction.text import TfidfTransformer
transformer = TfidfTransformer(smooth_idf=False,norm=None)
brown_matrix_tfidf = transformer.fit_transform(brown_matrix)
print(brown_matrix_tfidf)
(0, 49646) 1.7298111649315369
(0, 49613) 1.3681693233644676
(0, 49596) 3.7066318654255337
(0, 49386) 9.98833379406486
(0, 49378) 8.731629015654066
(0, 49313) 2.62964061975162
(0, 49301) 7.374075931214787
(0, 49292) 2.184170177029756
(0, 49224) 3.385966701933097
(0, 49147) 6.0
(0, 49041) 3.407945608651872
(0, 49003) 22.210096880912054
(0, 49001) 5.741605353137016
(0, 48990) 16.84677293625242
(0, 48951) 4.7297014486341915
(0, 48950) 4.939351940117883
(0, 48932) 3.9565115604007097
(0, 48867) 7.046120322868667
(0, 48777) 1.41855034765682
(0, 48771) 13.694210097452498
(0, 48769) 6.236428984115791
(0, 48753) 1.2957142441490452
(0, 48749) 3.1984194075136347
(0, 48720) 1.1648746431902341
(0, 48670) 2.1974319458783156
: :
(499, 2710) 3.120263536200091
(499, 2688) 2.04412410338404
(499, 2670) 3.9565115604007097
(499, 2611) 4.270169119255751
(499, 2468) 6.521460917862246
(499, 2439) 4.170085660698769
(499, 2415) 4.122633007848826
(499, 2413) 2.320337504305643
(499, 2388) 2.096614286005437
(499, 2358) 6.115995809754082
(499, 2290) 61.0
(499, 2289) 7.5533024513831695
(499, 2286) 11.156201344558639
(499, 2285) 20.714812015222506
(499, 2283) 1.2256466815323281
(499, 1345) 6.521460917862246
(499, 1141) 4.506557897319982
(499, 405) 83.0
(499, 395) 12.710333931244342
(499, 393) 188.0
(499, 374) 4.0872168559431525
(499, 373) 4.095849955425997
(499, 354) 7.214608098422191
(499, 322) 7.538167310351703
(499, 320) 3.4769384801388235
Next, let's apply SVD. Scikit-learn does not expose the internal details of the decomposition, we just use the TruncatedSVD class directly get a matrix with k dimensions. Since the Brown corpus is a fairly small corpus, we'll do k=10. Note that we'll first transpose the document-term sparse matrix to a term-document matrix before we apply SVD.
In [46]:
from sklearn.decomposition import TruncatedSVD
from scipy.sparse import csr_matrix
#dimension of brown_matrix_tfidf = num_documents x num_vocab
#dimension of brown_matrix_tfidf_transposed = num_vocab x num_documents
brown_matrix_tfidf_transposed = csr_matrix(brown_matrix_tfidf).transpose()
svd = TruncatedSVD(n_components=10)
brown_matrix_lowrank = svd.fit_transform(brown_matrix_tfidf_transposed)
print(brown_matrix_lowrank.shape)
print(brown_matrix_lowrank)
(49815, 10)
[[ 1.46529922e+02 -1.56578300e+02 3.77295895e+01 ... 1.87574330e+01
5.17826940e+00 -1.32357467e+01]
[ 6.10797743e-01 6.77336542e-01 -2.04392054e-01 ... -1.02796238e+00
-1.14385161e-01 -1.12871217e+00]
[ 1.00411586e+00 1.99456979e-01 -1.25054329e-01 ... 1.14578446e+00
-4.14250674e-01 -1.68706426e-01]
...
[ 3.26612758e-01 2.53370725e-01 -2.71177861e-01 ... 2.51508282e-01
1.31093947e-01 1.59715022e-01]
[ 6.35382477e-01 7.12100488e-01 -2.82140022e-02 ... 6.70060518e-01
1.78645267e-01 3.52829119e-01]
[ 3.27037764e-01 7.38765531e-01 2.09243078e+00 ... -2.95536854e-01
-3.95585989e-01 -1.02777409e-02]]
The returned matrix corresponds to the transformed term/word matrix, $U \Sigma$, after SVD factorisation, $X \approx U \Sigma V^T$, applied to brown_matrix_tfidf_transposed, as $X$. Note that the resulting matrix is not sparse.
The last thing we'll do is to compare some words and see if their similarity fits our intuition.
In [38]:
import numpy as np
from numpy.linalg import norm
v1 = brown_matrix_lowrank[vectorizer.vocabulary_["medical"]]
v2 = brown_matrix_lowrank[vectorizer.vocabulary_["health"]]
v3 = brown_matrix_lowrank[vectorizer.vocabulary_["gun"]]
def cos_sim(a, b):
return np.dot(a, b)/(norm(a)*norm(b))
print(cos_sim(v1, v2))
print(cos_sim(v1, v3))
0.8095752240062064
0.14326323746458713
There'll be some variability to the exact cosine similarity values that you'll get (feel free to re-run SVD and check this), but hopefully you should find that medical and health is more closely related to each other than medical and gun.
Next let's try information, retrieval and science!
In [47]:
v1 = brown_matrix_lowrank[vectorizer.vocabulary_["information"]]
v2 = brown_matrix_lowrank[vectorizer.vocabulary_["retrieval"]]
v3 = brown_matrix_lowrank[vectorizer.vocabulary_["science"]]
print(cos_sim(v1, v2))
print(cos_sim(v1, v3))
0.4883440848184145
0.43337516485029776
What did you find? Did you get similar results to Discussion Q2 in the worksheet? (you might want to re-run SVD and see if you find contradicting results).
In [ ]: