WordNet

Table of contents
  1. Lexical Semantics
  2. Installation
  3. Synsets
    1. Hypernyms
    2. Hyponyms
  4. Shortcomings of Thesaurus

Lexical Semantics

One way to allow computers to understand the semantics of words is to use a thesaurus, and WordNet is one such thesaurus.

WordNet approach is used to represent the lexical semantics (how the meaning of words is structured, used, and understood) of natural language.

It is important to know the difference between lexical semantics and distributional semantics/representation.


Installation

You can download WordNet from the nltk library.

conda install nltk

Once library is installed, download the WordNet data:

import nltk
nltk.download('wordnet')
Download Path

By default, data will be downloaded to /Users/${username}/nltk_data.

You can download the data to a specific path by:

nltk.download('wordnet', download_dir='/path/to/download')

By default, the library searches for data in the following paths:

- '/Users/${username}/nltk_data'
- '/opt/homebrew/Caskroom/miniforge/base/envs/nlp/nltk_data'
- '/opt/homebrew/Caskroom/miniforge/base/envs/nlp/share/nltk_data'
- '/opt/homebrew/Caskroom/miniforge/base/envs/nlp/lib/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'

Import WordNet:

from nltk.corpus import wordnet as wn

Synsets

WordNet groups words into sets of synonyms called synsets.

There are multiple synsets for each word:

wn.synsets('dog')
>> [Synset('dog.n.01'),
    Synset('frump.n.01'),
    Synset('dog.n.03'),
    Synset('cad.n.01'),
    Synset('frank.n.02'),
    Synset('pawl.n.01'),
    Synset('andiron.n.01'),
    Synset('chase.v.01')]

You can also query a specific part of speech (POS):

wn.synsets('dog', pos=wn.VERB)
>> [Synset('chase.v.01')]

Choose a synset for a word:

dog = wn.synset('dog.n.01')

Hypernyms

Each synset has hypernyms (a more general word/parent):

dog.hypernyms()
>> [Synset('canine.n.02'), Synset('domestic_animal.n.01')]

Hyponyms

Each synset has hyponyms (more specific words/children):

dog.hyponyms()
>> [...,
    Synset('corgi.n.01'),
    ...]

Shortcomings of Thesaurus

  • Hard to maintain/keep up-to-date
  • Managed by humans, which is costly
  • Hard to capture nuances of language
  • Subjective

References: