PowerPoint Presentation
LECTURE 11
Word Senses and Similarity
Arkaitz Zubiaga, 14th February, 2018
2
Word Senses: Concepts.
Thesauri: Wordnet.
Thesaurus Methods.
Distributonal Models of Similarity.
Evaluaton.
LECTURE 11: CONTENTS
WORD SENSES: CONCEPTS
4
Homonymy: same word can have diferent, unrelated meanings:
I put my money in the bank
1
.
We took the picture in the east bank
2
of the Nile river.
bank
1
: fnancial insttuton.
bank
2
: sloping land.
HOMONYMY
5
Polysemy: same word, related meanings:
The bank
1
was constructed in 1875 out of local red brick.
I withdrew the money from the bank
2
.
bank
1
: the building belonging to a fnancial insttuton.
bank
2
: a fnancial insttuton.
POLYSEMY
6
Polysemy is ofen systemati:
Building & organisaton:
I’m heading to the university.
The university staf has gone on strike.
Author & work:
Shakespeare wrote nice stuf.
I love reading Shakespeare.
SYSTEMATIC POLYSEMY
7
Synonyms: words with same meaning in some or all contexts.
flbert / hazelnut
couch / sofa
big / large
SYNONYMY
8
But there are few (or no) examples of perfect synonymy.
e.g. big/large:
My big sister… [= older sister]
My large sister… [≠ older sister]
SYNONYMY
9
Antonyms: Senses that are opposites with respeit to one
feature of meaning, very similar otherwise:
dark/light.
short/long.
fast/slow.
hot/cold.
up/down.
ANTONYMY
10
One sense is a hyponym of another if the frst sense is more
speiifi, denotng a subilass of the other:
car is a hyponym of vehicle
apple is a hyponym of fruit
Conversely hypernym:
vehicle is a hypernym of car
fruit is a hypernym of apple
HYPONYMY AND HYPERNYMY
11
An instanie is an individual, a proper noun that is a unique
entty:
London is an instanie of city.
But city is a ilass:
city is a hyponym of municipality or locaton.
INSTANCES VS HYPONYMS
THESAURI
13
THESAURI
Thesaurus (plural thesauri) is a reference work that lists words
grouped together according to similarity of meaning.
Useful for different tasks (which we’ll see in next lectures):
Information Extraction.
Information Retrieval.
Question Answering.
Machine Translation.
14
WORDNET 3.1
Wordnet is a popular dataset (thesaurus + aspects of dictionary):
https://wordnet.princeton.edu/
It’s integrated in NLTK:
http://www.nltk.org/howto/wordnet.html
Structured into 117,000 synsets (synonym sets).
Linked to other synsets through “conceptual relations”.
Synsets have brief definition (“gloss”) and example sentences.
https://wordnet.princeton.edu/
http://www.nltk.org/howto/wordnet.html
15
SENSES OF “BASS” IN WORDNET
16
SYNSETS: SYNONYM SETS
Example: chump as a noun with the gloss:
“a person who is gullible and easy to take advantage of”
This sense of “chump” is shared by 9 words:
chump
1
, fool
2
, gull
1
, mark
9
, patsy
1
, fall guy
1
, sucker
1
, soft touch
1
,
mug
2
Note: only those senses of the words, e.g. mug
1
is a “cup”,
belongs to a different synset.
17
WORDNET HYPERNYM HIERARCHY FOR “BASS”
18
WORDNET NOUN RELATIONS
THESAURUS METHODS
20
WORD SIMILARITY AND RELATEDNESS
Synonymy is binary.
Similarity is a looser metric, two words are similar when they share
features of meaning.
bank
1
is similar to fund
3
.
bank
2
is similar to slope
5
.
Relatedness measures possible associations.
motorbike and bike are similar.
motorbike and fuel are related, not similar.
21
TWO TYPES OF SIMILARITY ALGORITHMS
Thesaurus-based algorithms:
Are words “nearby” in hypernym hierarchy?
Do words have similar glosses (definitions)?
Distributional algorithms:
Do words have similar distributional contexts?
22
PATH-BASED SIMILARITY
Two senses are similar if there is
a short path between them in the
thesaurus hierarchy
pathlen(c1,c2) = 1 + # of edges in shortest path
simpath(c1,c2) =
simpath(nickel,coin) = 1/2 = .5
simpath(nickel,Richter scale) = 1/8 = .125
23
PATH-BASED SIMILARITY
Two senses are similar if there is
a short path between them in the
thesaurus hierarchy
pathlen(c1,c2) = 1 + # of edges in shortest path
simpath(c1,c2) =
simpath(nickel,coin) = 1/2 = .5
simpath(nickel,Richter scale) = 1/8 = .125
Problem: assumes uniform distance for all
links.
simpath(nickel, money) = 6
simpath(nickel, standard) = 6
24
ALTERNATIVE THESAURUS-BASED SIMILARITY
Thesaurus-based similarity algorithms:
Information content for similarity measurement (Resnik).
Lin similarity function.
The Lesk algorithm.
25
INFORMATION CONTENT
Train by counting in a corpus.
Each instance of hill counts towards
frequency of its hypernyms:
natural elevation, geological formation, etc
words(c): set of words children of node c
words(“geo-formation”) = {hill,ridge,grotto,coast,cave,shore,natural elevation}
words(“natural elevation”) = {hill, ridge}
P(c)=
count(w)
wÎwords(c)
å
N
Probability that a random word
in the corpus is instance of c
geological-formaton
shore
hill
natural elevaton
coast
cave
grotoridge
…
entty
(Resnik 1995)
26
INFORMATION CONTENT
Information content:
● IC(c) = -log P(c)
Lowest common subsumer:
● LCS(c1,c2)
The most informative (lowest) node
in the hierarchy subsuming both c1 and c2
e.g. LCS(hill, coast) = geological-formation
27
INFORMATION CONTENT FOR SIMILARITY
Intuition: the more two words have in common, the more similar they
are.
Resnik: similarity = information content of the lowest common
subsumer (LCS) of the two nodes.
sim
resnik
(c1,c2) = -log P(LCS(c1,c2))
(Resnik 1995, 1999)
28
DEKANG LIN METHOD
Intuition: look at both commonalities and differences to measure
similarity of words A and B.
Commonality: the more A and B have in common, the more
similar they are.
IC(common(A,B))
Difference: the more differences between A and B, the less
similar.
IC(description(A,B)-IC(common(A,B))
(Lin 1998)
29
DEKANG LIN SIMILARITY THEOREM
Similarity between A and B is measured by the ratio between:
the amount of information needed to state the commonality of
A and B.
the information needed to fully describe what A and B are.
Lin defines IC(common(A,B)) as 2 x information of the LCS.
simLin(c1,c2 )=
2 logP(LCS(c1,c2 ))
logP(c1)+ logP(c2 )
simLin(A,B)µ
IC(common(A,B))
IC(description(A,B))
30
EXAMPLE OF LIN SIMILARITY FUNCTION
simLin(A,B)=
2 logP(LCS(c1,c2 ))
logP(c1)+ logP(c2 )
simLin(hill, coast)=
2 logP(geological-formation)
logP(hill)+ logP(coast)
=
2 ln0.00176
ln0.0000189+ ln0.0000216
=.59
31
THE LESK ALGORITHM
Intuition: A and B are similar if their glosses contain similar words.
Drawing paper: paper that is specially prepared for use in drafting.
Decal: the art of transferring designs from specially prepared paper to a wood or
glass or metal surface.
For each word phrase of length n that’s in both glosses:
Add a score of n2 (paper and specially prepared: 12 + 22 = 5).
Compute overlap also for glosses of hypernyms and hyponyms.
(Lesk 1986)
DISTRIBUTIONAL MODELS OF
SIMILARITY
33
THESAURUS-BASED APPROACHES: LIMITATIONS
We don’t have a thesaurus for every language.
Even if we do, they have problems with coverage:
Many words are missing.
Most (if not all) phrases are missing.
Some connections between senses are missing.
Thesauri work less well for verbs and adjectives, which have
less structured hyponymy relations.
34
DISTRIBUTIONAL MODELS OF MEANING
Also called vector-space models of meaning.
Offer much higher recall than hand-built thesauri.
Although they tend to have lower precision.
Intuition of distributional models of meaning:
If A and B have almost identical environments we say that
they are synonyms.
Remember lecture 6, word embeddings?
35
SYNONYMS IN SIMILAR ENVIRONMENTS
tackle the task, tackle fake news.
deal with the task, deal with fake news.
tackle = deal with
I like chocolate, chocolate is tasty, recipe for chocolate.
I like NLP, NLP is hard, I’m in an NLP lecture.
NLP ≠ chocolate
How do we achieve this?
36
WORD-WORD CO-OCCURRENCE MATRIX
Again, word-word co-occurrence matrix.
37
SAMPLE CONTEXTS: BROWN CORPUS
equal amount of sugar, a sliced lemon, a tablespoonful of apricot preserve or
jam, a pinch each of clove and nutmeg,
on board for their enjoyment. Cautiously she sampled her first pineapple and
another fruit whose taste she likened to that of
of a recursive type well suited to programming on the digital computer. In
finding the optimal R-stage policy from that of
substantially affect commerce, for the purpose of gathering data and
information necessary for the study authorized in the first section of this
38
TERM-CONTEXT MATRIX FOR SIMILARITY
Two words are similar in meaning if their context vectors are similar.
38
aardvark computer data pinch result sugar …
apricot 0 0 0 1 0 1
pineapple 0 0 0 1 0 1
digital 0 2 1 0 1 0
information 0 1 6 0 4 0
39
POINTWISE MUTUAL INFORMATION
Instead of raw counts, Pointwise Mutual Information (PMI) is often used.
Do events x and y co-occur more than if they were independent?
PMI between two words: (Church & Hanks 1989)
Do words x and y co-occur more than if they were independent?
Positive PMI between two words (Niwa & Nitta 1994)
Replace all PMI values less than 0 with zero
PMI(X,Y)=log2
P(x,y)
P(x)P(y)
PMI(word1,word2 )=log2
P(word1,word2)
P(word1)P(word2)
40
COMPUTING PMI
P(w=information,c=data) = 6/19 = .32
P(w=information) = 11/19 = .58
P(c=data) = 7/19 = .37
p(w,context) p(w)
computer data pinch result sugar
apricot 0.00 0.00 0.05 0.00 0.05 0.11
pineapple 0.00 0.00 0.05 0.00 0.05 0.11
digital 0.11 0.05 0.00 0.05 0.00 0.21
information 0.05 0.32 0.00 0.21 0.00 0.58
p(context) 0.16 0.37 0.11 0.26 0.11
41
COMPUTING PMI
PMI(information,data) = log
2
(.32 / (.37 * .58)) = .57
p(w,context) p(w)
computer data pinch result sugar
apricot 0.00 0.00 0.05 0.00 0.05 0.11
pineapple 0.00 0.00 0.05 0.00 0.05 0.11
digital 0.11 0.05 0.00 0.05 0.00 0.21
information 0.05 0.32 0.00 0.21 0.00 0.58
p(context) 0.16 0.37 0.11 0.26 0.11
PPMI(w,context)
computer data pinch result sugar
apricot – – 2.25 – 2.25
pineapple – – 2.25 – 2.25
digital 1.66 0.00 – 0.00 –
information 0.00 0.57 – 0.47 –
pmiij =log2
pij
pi*p* j
42
WEIGHING PMI
PMI is biased toward infrequent events.
Various weighting schemes help alleviate this, see e.g. Turney and Pantel
(2010)
Add-one smoothing can also help.
43
STATE-OF-THE-ART WORD SIMILARITY
Currently, the state of the art approach for measuring word similarity is using
word embeddings.
See Lecture 6!
44
RELATED TASK: WSD
Word Sense Disambiguation (WSD).
Related task in which we aim to identify which specific sense of a word
is being used in a particular instance in text.
I put my money in the bank → bank
1
We took the picture in the east bank of the Nile river → bank
2
As with computation of similarity, context can help WSD.
https://www.cs.york.ac.uk/semeval-2013/task12/
https://www.cs.york.ac.uk/semeval-2013/task12/
EVALUATION
46
TWO TYPES OF EVALUATION
Evaluation can be:
Intrinsic (in-vitro) evaluation.
Extrinsic (in-vivo) evaluation.
47
INTRINSIC EVALUATION
Need a corpus with human-annotated similarity scores.
Correlation between algorithm and human word similarity ratings.
The WordSimilarity-353 Test Collection:
http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/
http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/
48
EXTRINSIC EVALUATION
A number of tasks to test with:
Spelling error detection.
Word sense disambiguation.
Taking multiple-choice vocabulary tests (TOEFL/Cambridge).
49
RESOURCES
WordNet web interface:
http://wordnetweb.princeton.edu/perl/webwn
MeSH (Medical Subject Headings) thesaurus:
https://www.ncbi.nlm.nih.gov/mesh
http://wordnetweb.princeton.edu/perl/webwn
https://www.ncbi.nlm.nih.gov/mesh
50
REFERENCES
Resnik, P. (1995). Using Information Content to Evaluate Semantic Similarity
in a Taxonomy. In Proceedings of the 14th International Joint Conference on
Artificial Intelligence (IJCAI-95).
Resnik, P. (1999). Semantic similarity in a taxonomy: An information-based
measure and its application to problems of ambiguity in natural language. J.
Artif. Intell. Res.(JAIR), 11, 95-130.
Lin, D. (1998). An information-theoretic definition of similarity. In ICML, pp.
296-304.
Lesk, M. (1986). Automatic sense disambiguation using machine readable
dictionaries: how to tell a pine cone from an ice cream cone. In Proceedings of
the 5th annual international conference on Systems documentation (pp. 24-
26). ACM.
51
REFERENCES
Church, K., Gale, W., Hanks, P., & Hindle, D. (1989). Word associations and
typical predicate-argument relations. In In Proceedings of the International
Workshop on Parsing Technologies.
Niwa, Y., & Nitta, Y. (1994, August). Co-occurrence vectors from corpora vs.
distance vectors from dictionaries. In Proceedings of the 15th conference on
Computational linguistics-Volume 1 (pp. 304-309). Association for
Computational Linguistics.
Turney, P. D., & Pantel, P. (2010). From frequency to meaning: Vector space
models of semantics. Journal of artificial intelligence research, 37, 141-188.
52
ASSOCIATED READING
Jurafsky, Daniel, and James H. Martin. 2009. Speech and Language
Processing: An Introduction to Natural Language Processing, Speech
Recognition, and Computational Linguistics. 3rd edition. Chapter 17.
Slide 1
Slide 2
Slide 3
Slide 4
Slide 5
Slide 6
Slide 7
Slide 8
Slide 9
Slide 10
Slide 11
Slide 12
Slide 13
Slide 14
Slide 15
Slide 16
Slide 17
Slide 18
Slide 19
Slide 20
Slide 21
Slide 22
Slide 23
Slide 24
Slide 25
Slide 26
Slide 27
Slide 28
Slide 29
Slide 30
Slide 31
Slide 32
Slide 33
Slide 34
Slide 35
Slide 36
Slide 37
Slide 38
Slide 39
Slide 40
Slide 41
Slide 42
Slide 43
Slide 44
Slide 45
Slide 46
Slide 47
Slide 48
Slide 49
Slide 50
Slide 51
Slide 52