lecture11.pptx
LECTURE 11
Word Senses and Similarity
Arkaitz Zubiaga, 14
th
February, 2018
2
Word Senses: Concepts.
Thesauri: Wordnet.
Thesaurus Methods.
Distributonal Models of Similarity.
Evaluaton.
LECTURE 11: CONTENTS
WORD SENSES: CONCEPTS
4
Homonymy: same word can have differen, Snrela,ed meaninges :
I put my money in the bank
1
.
We took the picture in the east bank
2
of the Nile river.
bank
1
: fnancial insttuton.
bank
2
: sloping land.
HOMONYMY
5
Polysemy: same word, rela,ed meaninges:
The bank
1
was constructed in 1875 out of local red brick.
I withdrew the money from the bank
2
.
bank
1
: the building belonging to a fnancial insttuton.
bank
2
: a fnancial insttuton.
POLYSEMY
6
Polysemy is often sys,emati:
BSildinge & orgeanisaton:
I’m heading to the Sniversi,y.
The Sniversi,y staf has gone on strike.
AS,hor & work:
Shakespeare wrote nice stuf.
I love reading Shakespeare.
SYSTEMATIC POLYSEMY
7
Synonyms: words with same meaning in some or all contexts.
flbert hazelnut
couch sofa
big large
SYNONYMY
8
But there are few (or no) examples of perfect synonymy.
e.g. big large:
My bige sister… [= older sister]
My largee sister… [≠ older sister]
SYNONYMY
9
An,onyms: Senses that are opposi,es wi,h respei, ,o one
fea,Sre of meaninge, very similar otherwise:
dark light.
short long.
fast slow.
hot cold.
up down.
ANTONYMY
10
One sense is a hyponym of another if the frst sense is more
speiifi, denotng a sSbilass of the other:
car is a hyponym of vehicle
apple is a hyponym of fruit
Conversely hypernym:
vehicle is a hypernym of car
fruit is a hypernym of apple
HYPONYMY AND HYPERNYMY
11
An ins,anie is an individual, a proper noun that is a unique
entty:
London is an ins,anie of city.
But city is a ilass:
city is a hyponym of municipality or locaton.
INSTANCES VS HYPONYMS
THESAURI
13
THESAURI
Thesaurus (plural thesauri) is a reference work that lists words
grouped together according to similarity of meaning.
Useful for different tasks (which we’ll see in next lectures):
Information Extraction.
Information Retrieval.
Question Answering.
Machine Translation.
14
WORDNET 3.1
Wordnet is a popular dataset (thesaurus + aspects of dictionary):
https://wordnet.princeton.edu/
It’s integrated in NLTK:
http://www.nltk.org/howto/wordnet.html
Structured into 117,000 synsets (synonym sets).
Linked to other synsets through “conceptual relations”.
Synsets have brief definition (“gloss”) and example sentences.
15
SENSES OF “BASS” IN WORDNET
16
SYNSETS: SYNONYM SETS
Example: chump as a noun with the gloss:
“a person who is gullible and easy to take advantage of”
This sense of “chump” is shared by 9 words:
chump
1
, fool
2
, gull
1
, mark
9
, patsy
1
, fall guy
1
, sucker
1
, soft touch
1
,
mug
2
Note: only those senses of the words, e.g. mug
1
is a “cup”,
belongs to a different synset.
17
WORDNET HYPERNYM HIERARCHY FOR “BASS”
18
WORDNET NOUN RELATIONS
THESAURUS METHODS
20
WORD SIMILARITY AND RELATEDNESS
Synonymy is binary.
Similarity is a looser metric, two words are similar when they share
features of meaning.
bank
1
is similar to fund
3
.
bank
2
is similar to slope
5
.
Relatedness measures possible associations.
motorbike and bike are similar.
motorbike and fuel are related, not similar.
21
TWO TYPES OF SIMILARITY ALGORITHMS
Thesaurus-based algorithms:
Are words “nearby” in hypernym hierarchy?
Do words have similar glosses (definitions)?
Distributional algorithms:
Do words have similar distributional contexts?
22
PATH-BASED SIMILARITY
Two senses are similar if there is
a short path between them in the
thesaurus hierarchy
pathlen(c1,c2) = 1 + # of edges in shortest path
simpath(c1,c2) =
simpath(nickel,coin) = 1/2 = .5
simpath(nickel,Richter scale) = 1/8 = .125
23
PATH-BASED SIMILARITY
Two senses are similar if there is
a short path between them in the
thesaurus hierarchy
pathlen(c1,c2) = 1 + # of edges in shortest path
simpath(c1,c2) =
simpath(nickel,coin) = 1/2 = .5
simpath(nickel,Richter scale) = 1/8 = .125
Problem: assumes uniform distance for all
links.
simpath(nickel, money) = 6
simpath(nickel, standard) = 6
24
ALTERNATIVE THESAURUS-BASED SIMILARITY
Thesaurus-based similarity algorithms:
Information content for similarity measurement (Resnik).
Lin similarity function.
The Lesk algorithm.
25
INFORMATION CONTENT
Train by counting in a corpus.
Each instance of hill counts towards
frequency of its hypernyms:
natural elevation, geological formation, etc
words(c): set of words children of node c
words(“geo-formation”) = {hill,ridge,grotto,coast,cave,shore,natural elevation}
words(“natural elevation”) = {hill, ridge}
P(c)=
count(w)
wÎwords(c)
å
N
Probability that a random word
in the corpus is instance of c
geological-formaton
shore
hill
natural elevaton
coast
cave
grotoridge
…
entty
(Resnik 1995)
26
INFORMATION CONTENT
Information content:
● IC(c) = -log P(c)
Lowest common subsumer:
● LCS(c1,c2)
The most informative (lowest) node
in the hierarchy subsuming both c1 and c2
e.g. LCS(hill, coast) = geological-formation
27
INFORMATION CONTENT FOR SIMILARITY
Intuition: the more two words have in common, the more similar they
are.
Resnik: similarity = information content of the lowest common
subsumer (LCS) of the two nodes.
sim
resnik
(c1,c2) = -log P(LCS(c1,c2))
(Resnik 1995, 1999)
28
DEKANG LIN METHOD
Intuition: look at both commonalities and differences to measure
similarity of words A and B.
Commonality: the more A and B have in common, the more
similar they are.
IC(common(A,B))
Difference: the more differences between A and B, the less
similar.
IC(description(A,B)-IC(common(A,B))
(Lin 1998)
29
DEKANG LIN SIMILARITY THEOREM
Similarity between A and B is measured by the ratio between:
the amount of information needed to state the commonality of
A and B.
the information needed to fully describe what A and B are.
Lin defines IC(common(A,B)) as 2 x information of the LCS.
simLin(c1,c2 )=
2logP(LCS(c1,c2 ))
logP(c1)+ logP(c2 )
simLin(A,B)µ
IC(common(A,B))
IC(description(A,B))
30
EXAMPLE OF LIN SIMILARITY FUNCTION
simLin(A,B)=
2logP(LCS(c1,c2 ))
logP(c1)+ logP(c2 )
simLin(hill, coast)=
2 logP(geological-formation)
logP(hill)+ logP(coast)
=
2ln0.00176
ln0.0000189 + ln 0.0000216
=.59
31
THE LESK ALGORITHM
Intuition: A and B are similar if their glosses contain similar words.
Drawing paper: paper that is specially prepared for use in drafting.
Decal: the art of transferring designs from specially prepared paper to a wood or
glass or metal surface.
For each word phrase of length n that’s in both glosses:
Add a score of n2 (paper and specially prepared: 12 + 22 = 5).
Compute overlap also for glosses of hypernyms and hyponyms.
(Lesk 1986)
DISTRIBUTIONAL MODELS OF
SIMILARITY
33
THESAURUS-BASED APPROACHES: LIMITATIONS
We don’t have a thesaurus for every language.
Even if we do, they have problems with coverage:
Many words are missing.
Most (if not all) phrases are missing.
Some connections between senses are missing.
Thesauri work less well for verbs and adjectives, which have
less structured hyponymy relations.
34
DISTRIBUTIONAL MODELS OF MEANING
Also called vector-space models of meaning.
Offer much higher recall than hand-built thesauri.
Although they tend to have lower precision.
Intuition of distributional models of meaning:
If A and B have almost identical environments we say that
they are synonyms.
Remember lecture 6, word embeddings?
35
SYNONYMS IN SIMILAR ENVIRONMENTS
tackle the task, tackle fake news.
deal with the task, deal with fake news.
tackle = deal with
I like chocolate, chocolate is tasty, recipe for chocolate.
I like NLP, NLP is hard, I’m in an NLP lecture.
NLP ≠ chocolate
How do we achieve this?
36
WORD-WORD CO-OCCURRENCE MATRIX
Again, word-word co-occurrence matrix.
37
SAMPLE CONTEXTS: BROWN CORPUS
equal amount of sugar, a sliced lemon, a tablespoonful of apricot preserve or
jam, a pinch each of clove and nutmeg,
on board for their enjoyment. Cautiously she sampled her first pineapple and
another fruit whose taste she likened to that of
of a recursive type well suited to programming on the digital computer. In
finding the optimal R-stage policy from that of
substantially affect commerce, for the purpose of gathering data and
information necessary for the study authorized in the first section of this
38
TERM-CONTEXT MATRIX FOR SIMILARITY
Two words are similar in meaning if their context vectors are similar.
38
aardvark computer data pinch result sugar …
apricot 0 0 0 1 0 1
pineapple 0 0 0 1 0 1
digital 0 2 1 0 1 0
information 0 1 6 0 4 0
39
POINTWISE MUTUAL INFORMATION
Instead of raw counts, Pointwise Mutual Information (PMI) is often used.
Do events x and y co-occur more than if they were independent?
PMI between two words: (Church & Hanks 1989)
Do words x and y co-occur more than if they were independent?
Positive PMI between two words (Niwa & Nitta 1994)
Replace all PMI values less than 0 with zero
PMI(X,Y )=log2
P(x,y)
P(x)P(y)
PMI(word1,word2 )=log2
P(word1,word2)
P(word1)P(word2)
40
COMPUTING PMI
P(w=information,c=data) = 6/19 = .32
P(w=information) = 11/19 = .58
P(c=data) = 7/19 = .37
p(w,context) p(w)
computer data pinch result sugar
apricot 0.00 0.00 0.05 0.00 0.05 0.11
pineapple 0.00 0.00 0.05 0.00 0.05 0.11
digital 0.11 0.05 0.00 0.05 0.00 0.21
information 0.05 0.32 0.00 0.21 0.00 0.58
p(context) 0.16 0.37 0.11 0.26 0.11
41
COMPUTING PMI
PMI(information,data) = log
2
(.32 / (.37 * .58)) = .57
p(w,context) p(w)
computer data pinch result sugar
apricot 0.00 0.00 0.05 0.00 0.05 0.11
pineapple 0.00 0.00 0.05 0.00 0.05 0.11
digital 0.11 0.05 0.00 0.05 0.00 0.21
information 0.05 0.32 0.00 0.21 0.00 0.58
p(context) 0.16 0.37 0.11 0.26 0.11
PPMI(w,context)
computer data pinch result sugar
apricot – – 2.25 – 2.25
pineapple – – 2.25 – 2.25
digital 1.66 0.00 – 0.00 –
information 0.00 0.57 – 0.47 –
pmiij =log2
pij
pi*p* j
42
WEIGHING PMI
PMI is biased toward infrequent events.
Various weighting schemes help alleviate this, see e.g. Turney and Pantel
(2010)
Add-one smoothing can also help.
43
STATE-OF-THE-ART WORD SIMILARITY
Currently, the state of the art approach for measuring word similarity is using
word embeddings.
See Lecture 6!
44
RELATED TASK: WSD
Word Sense Disambiguation (WSD).
Related task in which we aim to identify which specific sense of a word
is being used in a particular instance in text.
I put my money in the bank → bank
1
We took the picture in the east bank of the Nile river → bank2
As with computation of similarity, context can help WSD.
https://www.cs.york.ac.uk/semeval-2013/task12/
EVALUATION
46
TWO TYPES OF EVALUATION
Evaluation can be:
Intrinsic (in-vitro) evaluation.
Extrinsic (in-vivo) evaluation.
47
INTRINSIC EVALUATION
Need a corpus with human-annotated similarity scores.
Correlation between algorithm and human word similarity ratings.
The WordSimilarity-353 Test Collection:
http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/
48
EXTRINSIC EVALUATION
A number of tasks to test with:
Spelling error detection.
Word sense disambiguation.
Taking multiple-choice vocabulary tests (TOEFL/Cambridge).
49
RESOURCES
WordNet web interface:
http://wordnetweb.princeton.edu/perl/webwn
MeSH (Medical Subject Headings) thesaurus:
https://www.ncbi.nlm.nih.gov/mesh
50
REFERENCES
Resnik, P. (1995). Using Information Content to Evaluate Semantic Similarity
in a Taxonomy. In Proceedings of the 14th International Joint Conference on
Artificial Intelligence (IJCAI-95).
Resnik, P. (1999). Semantic similarity in a taxonomy: An information-based
measure and its application to problems of ambiguity in natural language. J.
Artif. Intell. Res.(JAIR), 11, 95-130.
Lin, D. (1998). An information-theoretic definition of similarity. In ICML, pp.
296-304.
Lesk, M. (1986). Automatic sense disambiguation using machine readable
dictionaries: how to tell a pine cone from an ice cream cone. In Proceedings of
the 5th annual international conference on Systems documentation (pp. 24-
26). ACM.
51
REFERENCES
Church, K., Gale, W., Hanks, P., & Hindle, D. (1989). Word associations and
typical predicate-argument relations. In In Proceedings of the International
Workshop on Parsing Technologies.
Niwa, Y., & Nitta, Y. (1994, August). Co-occurrence vectors from corpora vs.
distance vectors from dictionaries. In Proceedings of the 15th conference on
Computational linguistics-Volume 1 (pp. 304-309). Association for
Computational Linguistics.
Turney, P. D., & Pantel, P. (2010). From frequency to meaning: Vector space
models of semantics. Journal of artificial intelligence research, 37, 141-188.
52
ASSOCIATED READING
Jurafsky, Daniel, and James H. Martin. 2009. Speech and Language
Processing: An Introduction to Natural Language Processing, Speech
Recognition, and Computational Linguistics. 3rd edition. Chapter 17.