程序代写代做代考 scheme information retrieval algorithm lecture11.pptx

lecture11.pptx

LECTURE 11

Word Senses and Similarity

Arkaitz Zubiaga, 14
th
February, 2018

2

 Word Senses: Concepts.

 Thesauri: Wordnet.

 Thesaurus Methods.

 Distributonal Models of Similarity.

 Evaluaton.

LECTURE 11: CONTENTS

WORD SENSES: CONCEPTS

4

 Homonymy: same word can have differen, Snrela,ed meaninges :

 I put my money in the bank
1
.

 We took the picture in the east bank
2
of the Nile river.

 bank
1
: fnancial insttuton.

 bank
2
: sloping land.

HOMONYMY

5

 Polysemy: same word, rela,ed meaninges:

 The bank
1
was constructed in 1875 out of local red brick.

 I withdrew the money from the bank
2
.

 bank
1
: the building belonging to a fnancial insttuton.

 bank
2
: a fnancial insttuton.

POLYSEMY

6

 Polysemy is often sys,emati:

 BSildinge & orgeanisaton:

 I’m heading to the Sniversi,y.

 The Sniversi,y staf has gone on strike.

 AS,hor & work:

 Shakespeare wrote nice stuf.

 I love reading Shakespeare.

SYSTEMATIC POLYSEMY

7

 Synonyms: words with same meaning in some or all contexts.

 flbert hazelnut

 couch sofa

 big large

SYNONYMY

8

 But there are few (or no) examples of perfect synonymy.

 e.g. big large:

 My bige sister… [= older sister]

 My largee sister… [≠ older sister]

SYNONYMY

9

 An,onyms: Senses that are opposi,es wi,h respei, ,o one
fea,Sre of meaninge, very similar otherwise:

 dark light.

 short long.

 fast slow.

 hot cold.

 up down.

ANTONYMY

10

 One sense is a hyponym of another if the frst sense is more
speiifi, denotng a sSbilass of the other:

 car is a hyponym of vehicle

 apple is a hyponym of fruit

 Conversely hypernym:

 vehicle is a hypernym of car

 fruit is a hypernym of apple

HYPONYMY AND HYPERNYMY

11

 An ins,anie is an individual, a proper noun that is a unique
entty:

 London is an ins,anie of city.

 But city is a ilass:

 city is a hyponym of municipality or locaton.

INSTANCES VS HYPONYMS

THESAURI

13

THESAURI

 Thesaurus (plural thesauri) is a reference work that lists words
grouped together according to similarity of meaning.

 Useful for different tasks (which we’ll see in next lectures):
 Information Extraction.
 Information Retrieval.
 Question Answering.
 Machine Translation.

14

WORDNET 3.1

 Wordnet is a popular dataset (thesaurus + aspects of dictionary):
https://wordnet.princeton.edu/

 It’s integrated in NLTK:
http://www.nltk.org/howto/wordnet.html

 Structured into 117,000 synsets (synonym sets).
 Linked to other synsets through “conceptual relations”.
 Synsets have brief definition (“gloss”) and example sentences.

15

SENSES OF “BASS” IN WORDNET

16

SYNSETS: SYNONYM SETS

 Example: chump as a noun with the gloss:
 “a person who is gullible and easy to take advantage of”

 This sense of “chump” is shared by 9 words:

 chump
1
, fool

2
, gull

1
, mark

9
, patsy

1
, fall guy

1
, sucker

1
, soft touch

1
,

mug
2

 Note: only those senses of the words, e.g. mug
1
is a “cup”,

belongs to a different synset.

17

WORDNET HYPERNYM HIERARCHY FOR “BASS”

18

WORDNET NOUN RELATIONS

THESAURUS METHODS

20

WORD SIMILARITY AND RELATEDNESS

 Synonymy is binary.
 Similarity is a looser metric, two words are similar when they share

features of meaning.

 bank
1
is similar to fund

3
.

 bank
2
is similar to slope

5
.

 Relatedness measures possible associations.
 motorbike and bike are similar.
 motorbike and fuel are related, not similar.

21

TWO TYPES OF SIMILARITY ALGORITHMS

 Thesaurus-based algorithms:
 Are words “nearby” in hypernym hierarchy?
 Do words have similar glosses (definitions)?

 Distributional algorithms:
 Do words have similar distributional contexts?

22

PATH-BASED SIMILARITY

 Two senses are similar if there is
a short path between them in the
thesaurus hierarchy

 pathlen(c1,c2) = 1 + # of edges in shortest path

 simpath(c1,c2) =

 simpath(nickel,coin) = 1/2 = .5
 simpath(nickel,Richter scale) = 1/8 = .125

23

PATH-BASED SIMILARITY

 Two senses are similar if there is
a short path between them in the
thesaurus hierarchy

 pathlen(c1,c2) = 1 + # of edges in shortest path

 simpath(c1,c2) =

 simpath(nickel,coin) = 1/2 = .5
 simpath(nickel,Richter scale) = 1/8 = .125

Problem: assumes uniform distance for all
links.
simpath(nickel, money) = 6
simpath(nickel, standard) = 6

24

ALTERNATIVE THESAURUS-BASED SIMILARITY

 Thesaurus-based similarity algorithms:
 Information content for similarity measurement (Resnik).
 Lin similarity function.
 The Lesk algorithm.

25

INFORMATION CONTENT

 Train by counting in a corpus.
 Each instance of hill counts towards

frequency of its hypernyms:
natural elevation, geological formation, etc

 words(c): set of words children of node c

words(“geo-formation”) = {hill,ridge,grotto,coast,cave,shore,natural elevation}
words(“natural elevation”) = {hill, ridge}

P(c)=

count(w)
wÎwords(c)

å

N

Probability that a random word
in the corpus is instance of c

geological-formaton

shore

hill

natural elevaton

coast

cave

grotoridge

entty

(Resnik 1995)

26

INFORMATION CONTENT

 Information content:
● IC(c) = -log P(c)

 Lowest common subsumer:
● LCS(c1,c2)

 The most informative (lowest) node
in the hierarchy subsuming both c1 and c2

e.g. LCS(hill, coast) = geological-formation

27

INFORMATION CONTENT FOR SIMILARITY

 Intuition: the more two words have in common, the more similar they
are.

 Resnik: similarity = information content of the lowest common
subsumer (LCS) of the two nodes.

 sim
resnik

(c1,c2) = -log P(LCS(c1,c2))

(Resnik 1995, 1999)

28

DEKANG LIN METHOD

 Intuition: look at both commonalities and differences to measure
similarity of words A and B.

 Commonality: the more A and B have in common, the more
similar they are.

IC(common(A,B))

 Difference: the more differences between A and B, the less
similar.

IC(description(A,B)-IC(common(A,B))
(Lin 1998)

29

DEKANG LIN SIMILARITY THEOREM

 Similarity between A and B is measured by the ratio between:
 the amount of information needed to state the commonality of

A and B.
 the information needed to fully describe what A and B are.

 Lin defines IC(common(A,B)) as 2 x information of the LCS.

simLin(c1,c2 )=
2logP(LCS(c1,c2 ))
logP(c1)+ logP(c2 )

simLin(A,B)µ
IC(common(A,B))
IC(description(A,B))

30

EXAMPLE OF LIN SIMILARITY FUNCTION

simLin(A,B)=
2logP(LCS(c1,c2 ))
logP(c1)+ logP(c2 )

simLin(hill, coast)=
2 logP(geological-formation)
logP(hill)+ logP(coast)

=
2ln0.00176

ln0.0000189 + ln 0.0000216
=.59

31

THE LESK ALGORITHM

 Intuition: A and B are similar if their glosses contain similar words.

 Drawing paper: paper that is specially prepared for use in drafting.

 Decal: the art of transferring designs from specially prepared paper to a wood or
glass or metal surface.

 For each word phrase of length n that’s in both glosses:

 Add a score of n2 (paper and specially prepared: 12 + 22 = 5).
 Compute overlap also for glosses of hypernyms and hyponyms.

(Lesk 1986)

DISTRIBUTIONAL MODELS OF
SIMILARITY

33

THESAURUS-BASED APPROACHES: LIMITATIONS

 We don’t have a thesaurus for every language.
 Even if we do, they have problems with coverage:

 Many words are missing.
 Most (if not all) phrases are missing.
 Some connections between senses are missing.
 Thesauri work less well for verbs and adjectives, which have

less structured hyponymy relations.

34

DISTRIBUTIONAL MODELS OF MEANING

 Also called vector-space models of meaning.
 Offer much higher recall than hand-built thesauri.

 Although they tend to have lower precision.

 Intuition of distributional models of meaning:
 If A and B have almost identical environments we say that

they are synonyms.
 Remember lecture 6, word embeddings?

35

SYNONYMS IN SIMILAR ENVIRONMENTS

 tackle the task, tackle fake news.
deal with the task, deal with fake news.

tackle = deal with

 I like chocolate, chocolate is tasty, recipe for chocolate.
I like NLP, NLP is hard, I’m in an NLP lecture.

NLP ≠ chocolate

 How do we achieve this?

36

WORD-WORD CO-OCCURRENCE MATRIX

 Again, word-word co-occurrence matrix.

37

SAMPLE CONTEXTS: BROWN CORPUS

 equal amount of sugar, a sliced lemon, a tablespoonful of apricot preserve or
jam, a pinch each of clove and nutmeg,

 on board for their enjoyment. Cautiously she sampled her first pineapple and
another fruit whose taste she likened to that of

 of a recursive type well suited to programming on the digital computer. In
finding the optimal R-stage policy from that of

 substantially affect commerce, for the purpose of gathering data and
information necessary for the study authorized in the first section of this

38

TERM-CONTEXT MATRIX FOR SIMILARITY

 Two words are similar in meaning if their context vectors are similar.

38

aardvark computer data pinch result sugar …

apricot 0 0 0 1 0 1

pineapple 0 0 0 1 0 1

digital 0 2 1 0 1 0

information 0 1 6 0 4 0

39

POINTWISE MUTUAL INFORMATION

 Instead of raw counts, Pointwise Mutual Information (PMI) is often used.
 Do events x and y co-occur more than if they were independent?

 PMI between two words: (Church & Hanks 1989)
 Do words x and y co-occur more than if they were independent?

 Positive PMI between two words (Niwa & Nitta 1994)
 Replace all PMI values less than 0 with zero

PMI(X,Y )=log2
P(x,y)
P(x)P(y)

PMI(word1,word2 )=log2
P(word1,word2)
P(word1)P(word2)

40

COMPUTING PMI

 P(w=information,c=data) = 6/19 = .32
 P(w=information) = 11/19 = .58
 P(c=data) = 7/19 = .37

p(w,context) p(w)
computer data pinch result sugar

apricot 0.00 0.00 0.05 0.00 0.05 0.11
pineapple 0.00 0.00 0.05 0.00 0.05 0.11
digital 0.11 0.05 0.00 0.05 0.00 0.21
information 0.05 0.32 0.00 0.21 0.00 0.58

p(context) 0.16 0.37 0.11 0.26 0.11

41

COMPUTING PMI

PMI(information,data) = log
2
(.32 / (.37 * .58)) = .57

p(w,context) p(w)
computer data pinch result sugar

apricot 0.00 0.00 0.05 0.00 0.05 0.11
pineapple 0.00 0.00 0.05 0.00 0.05 0.11

digital 0.11 0.05 0.00 0.05 0.00 0.21
information 0.05 0.32 0.00 0.21 0.00 0.58

p(context) 0.16 0.37 0.11 0.26 0.11

PPMI(w,context)
computer data pinch result sugar

apricot – – 2.25 – 2.25
pineapple – – 2.25 – 2.25

digital 1.66 0.00 – 0.00 –
information 0.00 0.57 – 0.47 –

pmiij =log2
pij

pi*p* j

42

WEIGHING PMI

 PMI is biased toward infrequent events.
 Various weighting schemes help alleviate this, see e.g. Turney and Pantel

(2010)
 Add-one smoothing can also help.

43

STATE-OF-THE-ART WORD SIMILARITY

 Currently, the state of the art approach for measuring word similarity is using
word embeddings.

 See Lecture 6!

44

RELATED TASK: WSD

 Word Sense Disambiguation (WSD).
 Related task in which we aim to identify which specific sense of a word

is being used in a particular instance in text.

 I put my money in the bank → bank
1

 We took the picture in the east bank of the Nile river → bank2
 As with computation of similarity, context can help WSD.

https://www.cs.york.ac.uk/semeval-2013/task12/

EVALUATION

46

TWO TYPES OF EVALUATION

 Evaluation can be:
 Intrinsic (in-vitro) evaluation.
 Extrinsic (in-vivo) evaluation.

47

INTRINSIC EVALUATION

 Need a corpus with human-annotated similarity scores.

 Correlation between algorithm and human word similarity ratings.

 The WordSimilarity-353 Test Collection:
http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/

48

EXTRINSIC EVALUATION

 A number of tasks to test with:
 Spelling error detection.
 Word sense disambiguation.
 Taking multiple-choice vocabulary tests (TOEFL/Cambridge).

49

RESOURCES

 WordNet web interface:
http://wordnetweb.princeton.edu/perl/webwn

 MeSH (Medical Subject Headings) thesaurus:
https://www.ncbi.nlm.nih.gov/mesh

50

REFERENCES

 Resnik, P. (1995). Using Information Content to Evaluate Semantic Similarity
in a Taxonomy. In Proceedings of the 14th International Joint Conference on
Artificial Intelligence (IJCAI-95).

 Resnik, P. (1999). Semantic similarity in a taxonomy: An information-based
measure and its application to problems of ambiguity in natural language. J.
Artif. Intell. Res.(JAIR), 11, 95-130.

 Lin, D. (1998). An information-theoretic definition of similarity. In ICML, pp.
296-304.

 Lesk, M. (1986). Automatic sense disambiguation using machine readable
dictionaries: how to tell a pine cone from an ice cream cone. In Proceedings of
the 5th annual international conference on Systems documentation (pp. 24-
26). ACM.

51

REFERENCES

 Church, K., Gale, W., Hanks, P., & Hindle, D. (1989). Word associations and
typical predicate-argument relations. In In Proceedings of the International
Workshop on Parsing Technologies.

 Niwa, Y., & Nitta, Y. (1994, August). Co-occurrence vectors from corpora vs.
distance vectors from dictionaries. In Proceedings of the 15th conference on
Computational linguistics-Volume 1 (pp. 304-309). Association for
Computational Linguistics.

 Turney, P. D., & Pantel, P. (2010). From frequency to meaning: Vector space
models of semantics. Journal of artificial intelligence research, 37, 141-188.

52

ASSOCIATED READING

 Jurafsky, Daniel, and James H. Martin. 2009. Speech and Language
Processing: An Introduction to Natural Language Processing, Speech
Recognition, and Computational Linguistics. 3rd edition. Chapter 17.