程序代写代做代考 scheme information retrieval algorithm PowerPoint Presentation

PowerPoint Presentation

LECTURE 11

Word Senses and Similarity

Arkaitz Zubiaga, 14th February, 2018

 Word Senses: Concepts.

 Thesauri: Wordnet.

 Thesaurus Methods.

 Distributonal Models of Similarity.

 Evaluaton.

LECTURE 11: CONTENTS

WORD SENSES: CONCEPTS

 Homonymy: same word can have diferent, unrelated meanings:

 I put my money in the bank
1
.

 We took the picture in the east bank
2
of the Nile river.

 bank
1
: fnancial insttuton.

 bank
2
: sloping land.

HOMONYMY

 Polysemy: same word, related meanings:

 The bank
1
was constructed in 1875 out of local red brick.

 I withdrew the money from the bank
2
.

 bank
1
: the building belonging to a fnancial insttuton.

 bank
2
: a fnancial insttuton.

POLYSEMY

 Polysemy is ofen systemati:

 Building & organisaton:

 I’m heading to the university.

 The university staf has gone on strike.

 Author & work:

 Shakespeare wrote nice stuf.

 I love reading Shakespeare.

SYSTEMATIC POLYSEMY

 Synonyms: words with same meaning in some or all contexts.

 flbert / hazelnut

 couch / sofa

 big / large

SYNONYMY

 But there are few (or no) examples of perfect synonymy.

 e.g. big/large:

 My big sister… [= older sister]

 My large sister… [≠ older sister]

SYNONYMY

 Antonyms: Senses that are opposites with respeit to one
feature of meaning, very similar otherwise:

 dark/light.

 short/long.

 fast/slow.

 hot/cold.

 up/down.

ANTONYMY

 One sense is a hyponym of another if the frst sense is more
speiifi, denotng a subilass of the other:

 car is a hyponym of vehicle

 apple is a hyponym of fruit

 Conversely hypernym:

 vehicle is a hypernym of car

 fruit is a hypernym of apple

HYPONYMY AND HYPERNYMY

 An instanie is an individual, a proper noun that is a unique
entty:

 London is an instanie of city.

 But city is a ilass:

 city is a hyponym of municipality or locaton.

INSTANCES VS HYPONYMS

THESAURI

 Thesaurus (plural thesauri) is a reference work that lists words
grouped together according to similarity of meaning.

 Useful for different tasks (which we’ll see in next lectures):
 Information Extraction.
 Information Retrieval.
 Question Answering.
 Machine Translation.

WORDNET 3.1

 Wordnet is a popular dataset (thesaurus + aspects of dictionary):
https://wordnet.princeton.edu/

 It’s integrated in NLTK:
http://www.nltk.org/howto/wordnet.html

 Structured into 117,000 synsets (synonym sets).
 Linked to other synsets through “conceptual relations”.
 Synsets have brief definition (“gloss”) and example sentences.

https://wordnet.princeton.edu/
http://www.nltk.org/howto/wordnet.html

SENSES OF “BASS” IN WORDNET

SYNSETS: SYNONYM SETS

 Example: chump as a noun with the gloss:
 “a person who is gullible and easy to take advantage of”

 This sense of “chump” is shared by 9 words:

 chump
1
, fool

2
, gull

1
, mark

9
, patsy

1
, fall guy

1
, sucker

1
, soft touch

1
,

mug
2

 Note: only those senses of the words, e.g. mug
1
is a “cup”,

belongs to a different synset.

WORDNET HYPERNYM HIERARCHY FOR “BASS”

WORDNET NOUN RELATIONS

THESAURUS METHODS

WORD SIMILARITY AND RELATEDNESS

 Synonymy is binary.
 Similarity is a looser metric, two words are similar when they share

features of meaning.

 bank
1
is similar to fund

3
.

 bank
2
is similar to slope

5
.

 Relatedness measures possible associations.
 motorbike and bike are similar.
 motorbike and fuel are related, not similar.

TWO TYPES OF SIMILARITY ALGORITHMS

 Thesaurus-based algorithms:
 Are words “nearby” in hypernym hierarchy?
 Do words have similar glosses (definitions)?

 Distributional algorithms:
 Do words have similar distributional contexts?

PATH-BASED SIMILARITY

 Two senses are similar if there is
a short path between them in the
thesaurus hierarchy

 pathlen(c1,c2) = 1 + # of edges in shortest path

 simpath(c1,c2) =

 simpath(nickel,coin) = 1/2 = .5
 simpath(nickel,Richter scale) = 1/8 = .125

PATH-BASED SIMILARITY

 Two senses are similar if there is
a short path between them in the
thesaurus hierarchy

 pathlen(c1,c2) = 1 + # of edges in shortest path

 simpath(c1,c2) =

 simpath(nickel,coin) = 1/2 = .5
 simpath(nickel,Richter scale) = 1/8 = .125

Problem: assumes uniform distance for all
links.
simpath(nickel, money) = 6
simpath(nickel, standard) = 6

ALTERNATIVE THESAURUS-BASED SIMILARITY

 Thesaurus-based similarity algorithms:
 Information content for similarity measurement (Resnik).
 Lin similarity function.
 The Lesk algorithm.

INFORMATION CONTENT

 Train by counting in a corpus.
 Each instance of hill counts towards

frequency of its hypernyms:
natural elevation, geological formation, etc

 words(c): set of words children of node c

words(“geo-formation”) = {hill,ridge,grotto,coast,cave,shore,natural elevation}
words(“natural elevation”) = {hill, ridge}

P(c)=

count(w)
wÎwords(c)

Probability that a random word
in the corpus is instance of c

geological-formaton

shore

hill

natural elevaton

coast

cave

grotoridge

…

entty

(Resnik 1995)

INFORMATION CONTENT

 Information content:
● IC(c) = -log P(c)

 Lowest common subsumer:
● LCS(c1,c2)

 The most informative (lowest) node
in the hierarchy subsuming both c1 and c2

e.g. LCS(hill, coast) = geological-formation

INFORMATION CONTENT FOR SIMILARITY

 Intuition: the more two words have in common, the more similar they
are.

 Resnik: similarity = information content of the lowest common
subsumer (LCS) of the two nodes.

 sim
resnik

(c1,c2) = -log P(LCS(c1,c2))

(Resnik 1995, 1999)

DEKANG LIN METHOD

 Intuition: look at both commonalities and differences to measure
similarity of words A and B.

 Commonality: the more A and B have in common, the more
similar they are.

IC(common(A,B))

 Difference: the more differences between A and B, the less
similar.

IC(description(A,B)-IC(common(A,B))
(Lin 1998)

DEKANG LIN SIMILARITY THEOREM

 Similarity between A and B is measured by the ratio between:
 the amount of information needed to state the commonality of

A and B.
 the information needed to fully describe what A and B are.

 Lin defines IC(common(A,B)) as 2 x information of the LCS.

simLin(c1,c2 )=
2 logP(LCS(c1,c2 ))
logP(c1)+ logP(c2 )

simLin(A,B)µ
IC(common(A,B))
IC(description(A,B))

EXAMPLE OF LIN SIMILARITY FUNCTION

simLin(A,B)=
2 logP(LCS(c1,c2 ))
logP(c1)+ logP(c2 )

simLin(hill, coast)=
2 logP(geological-formation)
logP(hill)+ logP(coast)

=
2 ln0.00176

ln0.0000189+ ln0.0000216
=.59

THE LESK ALGORITHM

 Intuition: A and B are similar if their glosses contain similar words.

 Drawing paper: paper that is specially prepared for use in drafting.

 Decal: the art of transferring designs from specially prepared paper to a wood or
glass or metal surface.

 For each word phrase of length n that’s in both glosses:

 Add a score of n2 (paper and specially prepared: 12 + 22 = 5).
 Compute overlap also for glosses of hypernyms and hyponyms.

(Lesk 1986)

DISTRIBUTIONAL MODELS OF
SIMILARITY

THESAURUS-BASED APPROACHES: LIMITATIONS

 We don’t have a thesaurus for every language.
 Even if we do, they have problems with coverage:

 Many words are missing.
 Most (if not all) phrases are missing.
 Some connections between senses are missing.
 Thesauri work less well for verbs and adjectives, which have

less structured hyponymy relations.

DISTRIBUTIONAL MODELS OF MEANING

 Also called vector-space models of meaning.
 Offer much higher recall than hand-built thesauri.

 Although they tend to have lower precision.

 Intuition of distributional models of meaning:
 If A and B have almost identical environments we say that

they are synonyms.
 Remember lecture 6, word embeddings?

SYNONYMS IN SIMILAR ENVIRONMENTS

 tackle the task, tackle fake news.
deal with the task, deal with fake news.

tackle = deal with

 I like chocolate, chocolate is tasty, recipe for chocolate.
I like NLP, NLP is hard, I’m in an NLP lecture.

NLP ≠ chocolate

 How do we achieve this?

WORD-WORD CO-OCCURRENCE MATRIX

 Again, word-word co-occurrence matrix.

SAMPLE CONTEXTS: BROWN CORPUS

 equal amount of sugar, a sliced lemon, a tablespoonful of apricot preserve or
jam, a pinch each of clove and nutmeg,

 on board for their enjoyment. Cautiously she sampled her first pineapple and
another fruit whose taste she likened to that of

 of a recursive type well suited to programming on the digital computer. In
finding the optimal R-stage policy from that of

 substantially affect commerce, for the purpose of gathering data and
information necessary for the study authorized in the first section of this

TERM-CONTEXT MATRIX FOR SIMILARITY

 Two words are similar in meaning if their context vectors are similar.

aardvark computer data pinch result sugar …

apricot 0 0 0 1 0 1

pineapple 0 0 0 1 0 1

digital 0 2 1 0 1 0

information 0 1 6 0 4 0

POINTWISE MUTUAL INFORMATION

 Instead of raw counts, Pointwise Mutual Information (PMI) is often used.
 Do events x and y co-occur more than if they were independent?

 PMI between two words: (Church & Hanks 1989)
 Do words x and y co-occur more than if they were independent?

 Positive PMI between two words (Niwa & Nitta 1994)
 Replace all PMI values less than 0 with zero

PMI(X,Y)=log2
P(x,y)
P(x)P(y)

PMI(word1,word2 )=log2
P(word1,word2)
P(word1)P(word2)

COMPUTING PMI

 P(w=information,c=data) = 6/19 = .32
 P(w=information) = 11/19 = .58
 P(c=data) = 7/19 = .37

p(w,context) p(w)
computer data pinch result sugar

apricot 0.00 0.00 0.05 0.00 0.05 0.11

pineapple 0.00 0.00 0.05 0.00 0.05 0.11

digital 0.11 0.05 0.00 0.05 0.00 0.21

information 0.05 0.32 0.00 0.21 0.00 0.58

p(context) 0.16 0.37 0.11 0.26 0.11

COMPUTING PMI



PMI(information,data) = log
2
(.32 / (.37 * .58)) = .57

p(w,context) p(w)
computer data pinch result sugar

apricot 0.00 0.00 0.05 0.00 0.05 0.11

pineapple 0.00 0.00 0.05 0.00 0.05 0.11

digital 0.11 0.05 0.00 0.05 0.00 0.21

information 0.05 0.32 0.00 0.21 0.00 0.58

p(context) 0.16 0.37 0.11 0.26 0.11

PPMI(w,context)
computer data pinch result sugar

apricot – – 2.25 – 2.25

pineapple – – 2.25 – 2.25

digital 1.66 0.00 – 0.00 –

information 0.00 0.57 – 0.47 –

pmiij =log2
pij

pi*p* j

WEIGHING PMI

 PMI is biased toward infrequent events.
 Various weighting schemes help alleviate this, see e.g. Turney and Pantel

(2010)
 Add-one smoothing can also help.

STATE-OF-THE-ART WORD SIMILARITY

 Currently, the state of the art approach for measuring word similarity is using
word embeddings.

 See Lecture 6!

RELATED TASK: WSD

 Word Sense Disambiguation (WSD).
 Related task in which we aim to identify which specific sense of a word

is being used in a particular instance in text.

 I put my money in the bank → bank
1

 We took the picture in the east bank of the Nile river → bank
2

 As with computation of similarity, context can help WSD.

https://www.cs.york.ac.uk/semeval-2013/task12/

EVALUATION

TWO TYPES OF EVALUATION

 Evaluation can be:
 Intrinsic (in-vitro) evaluation.
 Extrinsic (in-vivo) evaluation.

INTRINSIC EVALUATION

 Need a corpus with human-annotated similarity scores.

 Correlation between algorithm and human word similarity ratings.

 The WordSimilarity-353 Test Collection:
http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/

http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/

EXTRINSIC EVALUATION

 A number of tasks to test with:
 Spelling error detection.
 Word sense disambiguation.
 Taking multiple-choice vocabulary tests (TOEFL/Cambridge).

RESOURCES

 WordNet web interface:
http://wordnetweb.princeton.edu/perl/webwn

 MeSH (Medical Subject Headings) thesaurus:
https://www.ncbi.nlm.nih.gov/mesh

http://wordnetweb.princeton.edu/perl/webwn
https://www.ncbi.nlm.nih.gov/mesh

REFERENCES

 Resnik, P. (1995). Using Information Content to Evaluate Semantic Similarity
in a Taxonomy. In Proceedings of the 14th International Joint Conference on
Artificial Intelligence (IJCAI-95).

 Resnik, P. (1999). Semantic similarity in a taxonomy: An information-based
measure and its application to problems of ambiguity in natural language. J.
Artif. Intell. Res.(JAIR), 11, 95-130.

 Lin, D. (1998). An information-theoretic definition of similarity. In ICML, pp.
296-304.

 Lesk, M. (1986). Automatic sense disambiguation using machine readable
dictionaries: how to tell a pine cone from an ice cream cone. In Proceedings of
the 5th annual international conference on Systems documentation (pp. 24-
26). ACM.

REFERENCES

 Church, K., Gale, W., Hanks, P., & Hindle, D. (1989). Word associations and
typical predicate-argument relations. In In Proceedings of the International
Workshop on Parsing Technologies.

 Niwa, Y., & Nitta, Y. (1994, August). Co-occurrence vectors from corpora vs.
distance vectors from dictionaries. In Proceedings of the 15th conference on
Computational linguistics-Volume 1 (pp. 304-309). Association for
Computational Linguistics.

 Turney, P. D., & Pantel, P. (2010). From frequency to meaning: Vector space
models of semantics. Journal of artificial intelligence research, 37, 141-188.

ASSOCIATED READING

 Jurafsky, Daniel, and James H. Martin. 2009. Speech and Language
Processing: An Introduction to Natural Language Processing, Speech
Recognition, and Computational Linguistics. 3rd edition. Chapter 17.

Slide 1
Slide 2
Slide 3
Slide 4
Slide 5
Slide 6
Slide 7
Slide 8
Slide 9
Slide 10
Slide 11
Slide 12
Slide 13
Slide 14
Slide 15
Slide 16
Slide 17
Slide 18
Slide 19
Slide 20
Slide 21
Slide 22
Slide 23
Slide 24
Slide 25
Slide 26
Slide 27
Slide 28
Slide 29
Slide 30
Slide 31
Slide 32
Slide 33
Slide 34
Slide 35
Slide 36
Slide 37
Slide 38
Slide 39
Slide 40
Slide 41
Slide 42
Slide 43
Slide 44
Slide 45
Slide 46
Slide 47
Slide 48
Slide 49
Slide 50
Slide 51
Slide 52

Related Posts