l9-lexical-semantics-v3
COPYRIGHT 2021, THE UNIVERSITY OF MELBOURNE
1
COMP90042
Natural Language Processing
Lecture 9
Semester 1 2021 Week 5
Jey Han Lau
Lexical Semantics
COMP90042 L9
2
Sentiment Analysis
• Bag of words, kNN classifier. Training data:
‣ “This is a good movie.” → ☺
‣ “This is a great movie.” → ☺
‣ “This is a terrible film.” → ☹
• “This is a wonderful film.” → ?
• Two problems:
‣ The model does not know that “movie” and “film” are
synonyms. Since “film” appears only in negative examples
the model learns that it is a negative word.
‣ “wonderful” is not in the vocabulary (OOV – Out-Of-
Vocabulary).
COMP90042 L9
3
Sentiment Analysis
• Comparing words directly will not work. How to
make sure we compare word meanings instead?
• Solution: add this information explicitly through a
lexical database.
COMP90042 L9
4
Word Semantics
• Lexical semantics (this lecture)
‣ How the meanings of words connect to one another.
‣ Manually constructed resources: lexical database.
• Distributional semantics (next)
‣ How words relate to each other in the text.
‣ Automatically created resources from corpora.
COMP90042 L9
5
Outline
• Lexical Database
• Word Similarity
• Word Sense Disambiguation
COMP90042 L9
6
What Is Meaning?
• Their dictionary definition
‣ But dictionary definitions are necessarily circular
‣ Only useful if meaning is already understood
• Their relationships with other words
‣ Also circular, but better for text analysis
COMP90042 L9
7
Definitions
• A word sense describes one aspect of the
meaning of a word
COMP90042 L9
8
Definitions
• A word sense describes one aspect of the
meaning of a word
• If a word has multiple senses, it is polysemous
COMP90042 L9
9
Meaning Through Dictionary
• Gloss: textual definition of a sense, given by a
dictionary
• Bank
‣ financial institution that accepts deposits and
channels the money into lending activities
‣ sloping land (especially the slope beside a body
of water)
COMP90042 L9
10
Meaning Through Relations
• Another way to define meaning: by looking at how
it relates to other words
• Synonymy: near identical meaning
‣ vomit vs. throw up
‣ big vs. large
• Antonymy: opposite meaning
‣ long vs. short
‣ big vs. little
COMP90042 L9
11
Meaning Through Relations (2)
• Hypernymy: is-a relation
‣ cat is an animal
‣ mango is a fruit
• Meronymy: part-whole relation
‣ leg is part of a chair
‣ wheel is part of a car
COMP90042 L9
12
• dragon and creature
• book and page
• comedy and tragedy
PollEv.com/jeyhanlau569
What are the relations for these words?
http://PollEv.com/jeyhanlau569
http://PollEv.com/jeyhanlau569
COMP90042 L9
13
COMP90042 L9
14
COMP90042 L9
15
Meaning Through Relations (3)
hypernymy
COMP90042 L9
16
WordNet
• A database of lexical relations
• English WordNet includes ~120,000 nouns,
~12,000 verbs, ~21,000 adjectives, ~4,000
adverbs
• On average: noun has 1.23 senses; verbs 2.16
• WordNets available in most major languages
(www.globalwordnet.org, https://babelnet.org/)
• English version freely available (accessible via
NLTK)
http://www.globalwordnet.org/
https://babelnet.org/
http://www.globalwordnet.org/
https://babelnet.org/
COMP90042 L9
17
WordNet Example
COMP90042 L9
18
Synsets
• Nodes of WordNet are not words or lemmas, but senses
• There are represented by sets of synonyms, or synsets
• Bass synsets:
‣ {bass1, deep6}
‣ {bass6, bass voice1, basso2}
• Another synset:
‣ {chump1, fool2, gull1, mark9, patsy1, fall guy1, sucker1,
soft touch1, mug2}
‣ Gloss: a person who is gullible and easy to take
advantage of
COMP90042 L9
19
Synsets (2)
>>> nltk.corpus.wordnet.synsets(‘bank’)
[Synset(‘bank.n.01’), Synset(‘depository_financial_institution.n.01’), Synset(‘bank.n.03’),
Synset(‘bank.n.04’), Synset(‘bank.n.05’), Synset(‘bank.n.06’), Synset(‘bank.n.07’),
Synset(‘savings_bank.n.02’), Synset(‘bank.n.09’), Synset(‘bank.n.10’), Synset(‘bank.v.01’),
Synset(‘bank.v.02’), Synset(‘bank.v.03’), Synset(‘bank.v.04’), Synset(‘bank.v.05’), Synset(‘deposit.v.02’),
Synset(‘bank.v.07’), Synset(‘trust.v.01′)]
>>> nltk.corpus.wordnet.synsets(‘bank’)[0].definition()
u’sloping land (especially the slope beside a body of water)‘
>>> nltk.corpus.wordnet.synsets(‘bank’)[1].lemma_names()
[u’depository_financial_institution’, u’bank’, u’banking_concern’, u’banking_company’]
COMP90042 L9
20
Noun Relations in WordNet
COMP90042 L9
21
Hypernymy Chain
COMP90042 L9
22
Word Similarity
COMP90042 L9
23
Word Similarity
• Synonymy: film vs. movie
• What about show vs. film? opera vs. film?
• Unlike synonymy (which is a binary relation), word
similarity is a spectrum
• We can use lexical database (e.g. WordNet) or
thesaurus to estimate word similarity
COMP90042 L9
24
Word Similarity with Paths
• Given WordNet, find similarity based on path length
• pathlen(c1,c2) = 1+ edge length in the shortest path
between sense c1 and c2
• similarity between two senses (synsets)
• similarity between two words
simpath(c1, c2) =
1
pathlen(c1, c2)
wordsim(w1, w2) = max
c1∈senses(w1),c2∈senses(w2)
simpath(c1, c2)
Remember that a node
in the Wordnet graph is a
synset (sense), not a word!
COMP90042 L9
25
Examples
simpath(nickel,coin) = 1/2 = 0.5
simpath(nickel,currency)
= 1/4 = 0.25
simpath(nickel,money)
= 1/6 = 0.17
simpath(nickel,Richter scale)
= 1/8 = 0.13
Each node is a synset!
For simplicity we use just
the representative word
simpath(c1, c2) =
1
pathlen(c1, c2)
=
1
1 + edgelen(c1, c2)
COMP90042 L9
26
Beyond Path Length
• simpath(nickel,money) = 0.17
• simpath(nickel,Richter scale) = 0.13
• Problem: edges vary widely in actual semantic distance
‣ Much bigger jumps near top of hierarchy
• Solution 1: include depth information (Wu & Palmer)
‣ Use path to find lowest common subsumer (LCS)
‣ Compare using depths
simwup(c1, c2) =
2 × depth(LCS(c1, c2))
depth(c1) + depth(c2)
High simwup when:
• parent is deep
• senses are shallow
COMP90042 L9
27
Examples
simwup(nickel,money) =
2*2 / (6+3) = 0.44
simwup(dime,Richter scale) = ?
simwup(c1, c2) =
2 × depth(LCS(c1, c2))
depth(c1) + depth(c2)
PollEv.com/jeyhanlau569
http://PollEv.com/jeyhanlau569
http://PollEv.com/jeyhanlau569
COMP90042 L9
28
COMP90042 L9
29
Examples
simwup(c1, c2) =
2 × depth(LCS(c1, c2))
depth(c1) + depth(c2)
simwup(nickel,money) =
2*2 / (6+3) = 0.44
simwup(dime,Richter scale) =
2*1 / (6+3) = 0.22
COMP90042 L9
30
Abstract Nodes
• But node depth is still poor semantic distance
metric
‣ simwup(nickel,money) = 0.44
‣ simwup(nickel,Richter scale) = 0.22
• Nodes high in the hierarchy is very abstract or
general
• How to better capture them?
COMP90042 L9
31
Concept Probability Of A Node
• Intuition:
general node → high concept probability (e.g. object)
narrow node → low concept probability (e.g. vocalist)
• Find all the children node, and sum up their unigram probabilities!
• child(c): synsets that are children of c
• child(geological-formation) =
{hill, ridge, grotto, coast,
natural elevation, cave, shore}
• child(natural elevation) =
{hill, ridge}
∈
P(c) =
∑
s∈child(c)) count(s)
N
COMP90042 L9
32
Example
• Abstract nodes higher in the hierarchy has a
higher P(c)
COMP90042 L9
33
Similarity with Information Content
use IC instead of depth (simwup)
If LCS is entity, -2 log(0.395)!
general concept = small values
narrow concept = large values
high simlin when:
• concept of parent is narrow
• concept of senses are general
IC = − log P(c)
simlin(c1, c2) =
2 × IC(LCS(c1, c2)
IC(c1) + IC(c2)
simlin(hill, coast) =
2 × −log P(geological-formation)
−log P(hill) − log P(coast)
=
−2 log 0.00176
−log 0.0000189 − log 0.0000216
COMP90042 L9
34
Word Sense
Disambiguation
COMP90042 L9
35
Word Sense Disambiguation
• Task: selects the correct sense for words in a
sentence
• Baseline:
‣ Assume the most popular sense
• Good WSD potentially useful for many tasks
‣ Knowing which sense of mouse is used in a
sentence is important!
‣ Less popular nowadays; because sense
information is implicitly captured by contextual
representations (lecture 11)
COMP90042 L9
36
Supervised WSD
• Apply standard machine classifiers
• Feature vectors typically words and syntax around
target
‣ But context is ambiguous too!
‣ How big should context window be? (in practice small)
• Requires sense-tagged corpora
‣ E.g. SENSEVAL, SEMCOR (available in NLTK)
‣ Very time consuming to create!
COMP90042 L9
37
Unsupervised: Lesk
• Lesk: Choose sense whose WordNet gloss overlaps most
with the context
• The bank can guarantee deposits will eventually cover
future tuition costs because it invests in adjustable-rate
mortgage securities.
• bank1: 2 overlapping non-stopwords, deposits and mortgage
• bank2: 0
COMP90042 L9
38
Unsupervised: Clustering
• Gather usages of the word
… a bank is a financial institution that …
… reserve bank, or monetary authority is an institution…
… bed of a river, or stream. The bank consists of the sides …
… right bank to the right. The river channel …
• Perform clustering on context words to learn the
different senses
‣ Rationale: context words of the same sense
should be similar
COMP90042 L9
39
Unsupervised: Clustering
• Disadvantages:
‣ Sense cluster not very interpretable
‣ Need to align with dictionary senses
COMP90042 L9
40
Final Words
• Creation of lexical database involves expert
curation (linguists)
• Modern methods attempt to derive semantic
information directly from corpora, without human
intervention
• Distributional semantics (next lecture!)
COMP90042 L9
41
Reading
• JM3 Ch 18-18.4.1