08-lexical-semantics
Lexical Semantics¶
In this notebook we will use NLTK to access WordNet, look at some senses and lexical relations, and find paths between words. First, let’s load NLTK and make sure WordNet is accessible
In [1]:
from nltk.corpus import wordnet as wn
print(wn.readme(lang=”eng”))
This is the README file for WordNet 3.0
1. About WordNet
WordNet was developed at Princeton University’s Cognitive Science
Laboratory under the direction of George Miller, James S. McDonnell
Distinguished University Professor of Psychology, Emeritus. Over the
years many linguists, lexicographers, students, and software engineers
have contributed to the project.
WordNet is an online lexical reference system. Word forms in WordNet
are represented in their familiar orthography; word meanings are
represented by synonym sets (synsets) – lists of synonymous word forms
that are interchangeable in some context. Two kinds of relations are
recognized: lexical and semantic. Lexical relations hold between word
forms; semantic relations hold between word meanings.
To learn more about WordNet, the book “WordNet: An Electronic Lexical
Database,” containing an updated version of “Five Papers on WordNet”
and additional papers by WordNet users, is available from MIT Press:
http://mitpress.mit.edu/book-home.tcl?isbn=026206197X
2. The WordNet Web Site
We maintain a Web site at:
http://wordnet.princeton.edu
Information about WordNet, access to our online interface, and the
various WordNet packages that you can download are available from our
web site. All of the software documentation is available online, as
well as a FAQ. On this site we also have information about other
applications that use WordNet. If you have an application that you
would like included, please send e-mail to the above address.
3. Contacting Us
Ongoing deveopment work and WordNet related projects are done by a
small group of researchers, lexicographers, and systems programmers.
Since our resources are VERY limited, we request that you please
confine correspondence to WordNet topics only. Please check the
documentation, FAQ, and other resources for the answer to your
question or problem before contacting us.
If you have trouble installing or downloading WordNet, have a bug to
report, or any other problem, please refer to the online FAQ file
first. If you can heal thyself, please do so. The FAQ will be
updated over time. And if you do find a previously unreported
problem, please use our Bug Report Form:
http://wordnet.princeton.edu/cgi-bin/bugsubmit.pl
When reporting a problem, please be as specific as possible, stating
the computer platform you are using, which interface you are using,
and the exact error. The more details you can provide, the more
likely it is that you will get an answer.
There is a WordNet user discussion group mailing list that we invite
our users to join. Users use this list to ask questions of one
another, announce extensions to WordNet that they’ve developed, and
other topics of general usefulness to the user community.
Information on joining the user discussion list, reporting bugs and other
contact information is in found on our website at:
http://wordnet.princeton.edu/contact
4. Current Release
WordNet Version 3.0 is the latest version available for download. Two
basic database packages are available – one for Windows and one for
Unix platforms (including Mac OS X). See the file ChangeLog (Unix) or
CHANGES.txt (Windows) for a list of changes from previous versions.
WordNet packages can either be downloaded from our web site via:
http://wordnet.princeton.edu/obtain
The Windows package is a self-extracting archive that installs itself
when you double-click on it.
Beginning with Version 2.1, we changed the Unix package to a GNU Autotools
package. The WordNet browser makes use of the open source Tcl and Tk
packages. Many systems come with either or both pre-installed. If
your system doesn’t (some systems have Tcl installed, but not Tk)
Tcl/Tk can be downloaded from:
http://www.tcl.tk/
Tcl and Tk must be installed BEFORE you compile WordNet. You must also
have a C compiler before installing Tcl/Tk or WordNet. WordNet has
been built and tested with the GNU gcc compiler. This is
pre-installed on most Unix systems, and can be downloaded from:
http://gcc.gnu.org/
See the file INSTALL for detailed WordNet installation instructions.
As mentioned in lecture, the main nodes in WordNet are synsets, not words. Given any word, we can access relevant synsets using the synsets commands. We can optionally limit to a particular word category (n = noun, v = verb, a = adjective, r = adverb). For each of the synsets of the word type “class”, let’s look at their definition, their corresponding lemmas, an example of their usage, and their hypernyms (often only one, but can be multiple).
In [2]:
for synset in wn.synsets(“book”,”n”):
print(synset.name())
print(synset.definition())
print(synset.lemma_names())
print(synset.examples())
print(synset.hypernyms())
print(“——-“)
book.n.01
a written work or composition that has been published (printed on pages bound together)
[‘book’]
[‘I am reading a good book on economics’]
[Synset(‘publication.n.01’)]
——-
book.n.02
physical objects consisting of a number of pages bound together
[‘book’, ‘volume’]
[‘he used a large book as a doorstop’]
[Synset(‘product.n.02’)]
——-
record.n.05
a compilation of the known facts regarding something or someone
[‘record’, ‘record_book’, ‘book’]
[“Al Smith used to say, `Let’s look at the record'”, ‘his name is in all the record books’]
[Synset(‘fact.n.02’)]
——-
script.n.01
a written version of a play or other dramatic composition; used in preparing for a performance
[‘script’, ‘book’, ‘playscript’]
[]
[Synset(‘dramatic_composition.n.01’)]
——-
ledger.n.01
a record in which commercial accounts are recorded
[‘ledger’, ‘leger’, ‘account_book’, ‘book_of_account’, ‘book’]
[‘they got a subpoena to examine our books’]
[Synset(‘record.n.07’)]
——-
book.n.06
a collection of playing cards satisfying the rules of a card game
[‘book’]
[]
[Synset(‘collection.n.01’)]
——-
book.n.07
a collection of rules or prescribed standards on the basis of which decisions are made
[‘book’, ‘rule_book’]
[‘they run things by the book around here’]
[Synset(‘collection.n.01’)]
——-
koran.n.01
the sacred writings of Islam revealed by God to the prophet Muhammad during his life at Mecca and Medina
[‘Koran’, ‘Quran’, “al-Qur’an”, ‘Book’]
[]
[]
——-
bible.n.01
the sacred writings of the Christian religions
[‘Bible’, ‘Christian_Bible’, ‘Book’, ‘Good_Book’, ‘Holy_Scripture’, ‘Holy_Writ’, ‘Scripture’, ‘Word_of_God’, ‘Word’]
[‘he went to carry the Word to the heathen’]
[Synset(‘sacred_text.n.01’)]
——-
book.n.10
a major division of a long written composition
[‘book’]
[‘the book of Isaiah’]
[Synset(‘section.n.01’)]
——-
book.n.11
a number of sheets (ticket or stamps etc.) bound together on one edge
[‘book’]
[‘he bought a book of stamps’]
[Synset(‘product.n.02’)]
——-
We can see here why WordNet is sometimes seen as too fine-grained, particularly for word sense disambiguation; several of these senses are closely related to each other in meaning. In any case, once we know its name, we can access a particular synset with the synset command, and look at other relationships, such as hyponyms; Note that meronyms and holonyms come in three types: part, member or substance, though we’ll only look at part here.
In [3]:
print(wn.synset(“book.n.02”).hyponyms())
print(wn.synset(“book.n.02”).part_meronyms())
print(wn.synset(“book.n.10”).part_holonyms()) # “book” meaning a division of a text
[Synset(‘album.n.02’), Synset(‘coffee-table_book.n.01’), Synset(‘folio.n.03’), Synset(‘hardback.n.01’), Synset(‘journal.n.04’), Synset(‘notebook.n.01’), Synset(‘novel.n.02’), Synset(‘order_book.n.02’), Synset(‘paperback_book.n.01’), Synset(‘picture_book.n.01’), Synset(‘sketchbook.n.01’)]
[Synset(‘binding.n.05’), Synset(‘fore_edge.n.01’), Synset(‘spine.n.04’)]
[Synset(‘text.n.01’)]
Each synset has a set of lemmas associated with it. Since antonyms are often specific to the word form, they are defined on lemmas, not synsets. Another function, derivationally_related_forms gives other lemmas which are related by derivational morphology, though this is not comprehensive. Finally, lemmas have a count associated with them, derived from a sense tagged corpus: these can be used to identify which senses of a word are more common.
In [4]:
print(wn.synsets(“happy”)[0])
print(wn.synsets(“happy”)[0].lemmas()[0].antonyms())
print(wn.synsets(“happy”)[0].lemmas()[0].derivationally_related_forms())
print(wn.synsets(“happy”)[0].lemmas()[0].count())
Synset(‘happy.a.01’)
[Lemma(‘unhappy.a.01.unhappy’)]
[Lemma(‘happiness.n.01.happiness’), Lemma(‘happiness.n.02.happiness’)]
37
All of the basic similarity measures mentioned in class (and several others) are available in the NLTK WordNet interface, as are other functions which are used in their derivation. For similarity metrics which require information content, we can load statistics from available corpora (the SEMCOR and Brown corpora are popular options).
In [5]:
from nltk.corpus import wordnet_ic
import nltk
nltk.download(‘wordnet_ic’)
print(wn.synset(“book.n.02”).path_similarity(wn.synset(“newspaper.n.03”)))
print(wn.synset(“book.n.02”).wup_similarity(wn.synset(“newspaper.n.03”)))
semcor_ic = wordnet_ic.ic(‘ic-semcor.dat’)
print(wn.synset(“book.n.02”).lin_similarity(wn.synset(“newspaper.n.03”),semcor_ic))
[nltk_data] Downloading package wordnet_ic to
[nltk_data] /Users/laujh/nltk_data…
[nltk_data] Unzipping corpora/wordnet_ic.zip.
0.3333333333333333
0.875
0.5763952661933001
However, they are somewhat opaque in their operation, and only work on synsets. Let’s create a version of basic path distance which doesn’t require you to select a specific synset in advance, and shows you the exact path through the WordNet heirarchy that the score is based on. There are many ways to do this, we’ll do it in a fairly clear but not entirely optimal way. First, given a set of synsets, let’s get a dictionary where the keys correspond to all hypernym synsets, and the values are the next step below on the shortest past to one of the initial synsets.
In [6]:
def get_hypernym_path_dict(synsets):
hypernym_dict = {}
synsets_to_expand = synsets
while synsets_to_expand:
new_synsets_to_expand = set()
for synset in synsets_to_expand:
for hypernym in synset.hypernyms():
if hypernym not in hypernym_dict: # this ensures we get the shortest path
hypernym_dict[hypernym] = synset
new_synsets_to_expand.add(hypernym)
synsets_to_expand = new_synsets_to_expand
return hypernym_dict
hypernym_dict = get_hypernym_path_dict(wn.synsets(“book”,”n”))
print(hypernym_dict)
{Synset(‘publication.n.01’): Synset(‘book.n.01’), Synset(‘product.n.02’): Synset(‘book.n.02’), Synset(‘fact.n.02’): Synset(‘record.n.05’), Synset(‘dramatic_composition.n.01’): Synset(‘script.n.01’), Synset(‘record.n.07’): Synset(‘ledger.n.01’), Synset(‘collection.n.01’): Synset(‘book.n.06’), Synset(‘sacred_text.n.01’): Synset(‘bible.n.01’), Synset(‘section.n.01’): Synset(‘book.n.10’), Synset(‘writing.n.02’): Synset(‘sacred_text.n.01’), Synset(‘creation.n.02’): Synset(‘product.n.02’), Synset(‘group.n.01’): Synset(‘collection.n.01’), Synset(‘music.n.01’): Synset(‘section.n.01’), Synset(‘work.n.02’): Synset(‘publication.n.01’), Synset(‘document.n.03’): Synset(‘record.n.07’), Synset(‘information.n.01’): Synset(‘fact.n.02’), Synset(‘artifact.n.01’): Synset(‘creation.n.02’), Synset(‘auditory_communication.n.01’): Synset(‘music.n.01’), Synset(‘communication.n.02’): Synset(‘document.n.03’), Synset(‘abstraction.n.06’): Synset(‘group.n.01’), Synset(‘written_communication.n.01’): Synset(‘writing.n.02’), Synset(‘message.n.02’): Synset(‘information.n.01’), Synset(‘entity.n.01’): Synset(‘abstraction.n.06’), Synset(‘whole.n.02’): Synset(‘artifact.n.01’), Synset(‘object.n.01’): Synset(‘whole.n.02’), Synset(‘physical_entity.n.01’): Synset(‘object.n.01’)}
We also need a way to build the path using this information
In [7]:
def get_path_using_hypernym_dict(hypernym,hypernym_dict,synsets):
path = [hypernym]
current_synset = hypernym_dict[hypernym]
while current_synset not in synsets:
path.append(current_synset)
current_synset = hypernym_dict[current_synset]
path.append(current_synset)
return path
print(get_path_using_hypernym_dict(wn.synset(‘physical_entity.n.01’),hypernym_dict,wn.synsets(“book”,”n”)))
[Synset(‘physical_entity.n.01’), Synset(‘object.n.01’), Synset(‘whole.n.02’), Synset(‘artifact.n.01’), Synset(‘creation.n.02’), Synset(‘product.n.02’), Synset(‘book.n.02’)]
Now we can build ancestor dictionaries for each of the words, look at the intersection, and then find the shortest path
In [8]:
def get_shortest_path_between(word1,word2):
synsets1 = wn.synsets(word1)
synsets2 = wn.synsets(word2)
# added these two lines to catch situation where word1 and word2 share a synset
match = set(synsets1).intersection(set(synsets2))
if match: return [list(match)[0]]
hypernym_dict1 = get_hypernym_path_dict(synsets1)
hypernym_dict2 = get_hypernym_path_dict(synsets2)
best_path = []
for hypernym in hypernym_dict1:
if hypernym in hypernym_dict2 and hypernym_dict1[hypernym] != hypernym_dict2[hypernym]:
path1 = get_path_using_hypernym_dict(hypernym,hypernym_dict1,synsets1)
path2 = get_path_using_hypernym_dict(hypernym,hypernym_dict2,synsets2)
if not best_path or len(path1) + len(path2) – 1 < len(best_path):
path1.reverse()
best_path = path1 + path2[1:]
return best_path
path = get_shortest_path_between("book","newspaper")
print(1.0/len(path))
print(path)
path = get_shortest_path_between("dog","cat")
print(1.0/len(path))
print(path)
# to see that the last synset includes the "cat" lemma
print(path[-1].lemma_names())
path = get_shortest_path_between("nickel","money")
print(1.0/len(path))
print(path)
path = get_shortest_path_between("computer","pizza")
print(1.0/len(path))
print(path)
path = get_shortest_path_between("film","movie")
print(1.0/len(path))
print(path)
0.3333333333333333
[Synset('book.n.02'), Synset('product.n.02'), Synset('newspaper.n.03')]
0.2
[Synset('dog.n.03'), Synset('chap.n.01'), Synset('male.n.02'), Synset('man.n.01'), Synset('guy.n.01')]
['guy', 'cat', 'hombre', 'bozo']
0.2
[Synset('nickel.n.02'), Synset('coin.n.01'), Synset('coinage.n.01'), Synset('currency.n.01'), Synset('money.n.03')]
0.09090909090909091
[Synset('calculator.n.01'), Synset('expert.n.01'), Synset('person.n.01'), Synset('causal_agent.n.01'), Synset('physical_entity.n.01'), Synset('matter.n.03'), Synset('substance.n.07'), Synset('food.n.01'), Synset('nutriment.n.01'), Synset('dish.n.02'), Synset('pizza.n.01')]
1.0
[Synset('movie.n.01')]
The shortest path does not always correspond to the most obvious relationship between two words: for instance, newspaper and book are join as products (not reading materials), dog and cat by informal senses related to people, rather than animals. Using depth and information-content basic metrics can improve this situation. Another approach is to use the counts of lemmas to ignore rare senses. Note that doing all this for other metrics is somewhat different, because they are based on the idea of lowest common subsumer, which is not necessarily on the shortest path.
In [ ]: