NLP unit 4
NLP unit 4
• Thesaurus-Based Methods
• Distributional Methods
Thesaurus-Based Methods
Use predefined lexical resources, such as WordNet or Roget’s Thesaurus, to
measure word similarity.
Computes similarity based on hierarchical relationships between words
(synonyms, antonyms, hypernyms, etc.).
Example of Thesaurus-Based Word Similarity:
Using WordNet:
• Word Pairs: Car and Truck
• Similarity is calculated based on:
• Path distance in the hierarchy.
• Common parent nodes (hypernyms).
• Semantic relationships like synonymy and hyponymy.
Advantages of Thesaurus-Based Methods
• Provides human-curated lexical relationships.
• Suitable for understanding semantic word similarity (e.g., synonyms).
• Useful in ontology-based NLP applications.
Distributional Methods
• Based on the Distributional Hypothesis: Words that occur in similar
contexts tend to have similar meanings
• Measure similarity using word co-occurrence statistics in large
corpora.
• Common techniques:
• Bag of Words (BoW)
• Word Embeddings (Word2Vec, GloVe)
Distributional Word Similarity Techniques
• Bag of Words (BoW):
• Represents words based on their frequency in documents.
• Limitations: No word order or context information.
• Word Embeddings (Dense Vectors):
• Techniques: Word2Vec, GloVe, FastText.
• Represent words as dense vectors in high-dimensional space.
• Capture semantic similarity (e.g., king – queen > man – woman).
Example of Distributional Word Similarity
• Word2Vec Embedding:
• Similarity between dog and cat based on their vector proximity in embedding
space.
• Visualized as points in high-dimensional space with clustering of semantically
related words.