Lemmatization#

  • Definition: Lemmatization is the process of reducing a word to its base form (lemma), but unlike stemming, it uses vocabulary + morphological analysis + POS (Part of Speech) tags.

  • It produces real words (not chopped forms).

  • Example:

    • Stemming: studies β†’ studi

    • Lemmatization: studies β†’ study


πŸ”Ή Why is Lemmatization Better than Stemming?#

  1. Valid Words: Lemmas are dictionary words.

  2. POS-Aware: Lemmatizer needs to know if the word is a noun, verb, adjective, etc. Example:

    • β€œbetter” β†’

      • As an adjective: good

      • As a verb: better

  3. Context-Sensitive: Uses linguistic rules to avoid incorrect chopping.


πŸ”Ή Process of Lemmatization#

  1. Tokenization β†’ Split text into words/sentences.

  2. POS Tagging β†’ Assign part of speech to each word.

  3. Lemmatization β†’ Use dictionary (WordNet in NLTK or spaCy lexicon) to get lemma.


πŸ”Ή Example#

Sentence: πŸ‘‰ β€œThe cats are sitting outside, and the children were playing happily.”

  • cats β†’ cat

  • sitting β†’ sit

  • children β†’ child

  • playing β†’ play

  • happily β†’ happily (adverbs often remain unchanged)


πŸ”Ή Tools for Lemmatization#

  • NLTK WordNet Lemmatizer (basic, needs POS tags for accuracy).

  • spaCy Lemmatizer (more powerful, uses large linguistic models).


πŸ”Ή When to Use Lemmatization#

βœ… When meaning and grammar matter (chatbots, translation, search engines). βœ… For semantic NLP tasks: text classification, QA, summarization. ❌ Stemming may be enough for speed-focused tasks like keyword extraction.


In short:

  • Stemming β†’ Fast but crude chopping (connection β†’ connect).

  • Lemmatization β†’ Slower but linguistically accurate (better β†’ good).

!python -m spacy download en_core_web_sm
Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     --------------------------------------- 0.0/12.8 MB 660.6 kB/s eta 0:00:20
     ---------------------------------------- 0.1/12.8 MB 1.3 MB/s eta 0:00:10
      --------------------------------------- 0.3/12.8 MB 2.2 MB/s eta 0:00:06
     - -------------------------------------- 0.5/12.8 MB 2.6 MB/s eta 0:00:05
     -- ------------------------------------- 0.6/12.8 MB 2.9 MB/s eta 0:00:05
     -- ------------------------------------- 0.8/12.8 MB 3.1 MB/s eta 0:00:04
     --- ------------------------------------ 1.0/12.8 MB 3.2 MB/s eta 0:00:04
     --- ------------------------------------ 1.1/12.8 MB 3.2 MB/s eta 0:00:04
     --- ------------------------------------ 1.1/12.8 MB 3.2 MB/s eta 0:00:04
     --- ------------------------------------ 1.2/12.8 MB 2.6 MB/s eta 0:00:05
     ---- ----------------------------------- 1.5/12.8 MB 2.9 MB/s eta 0:00:04
     ---- ----------------------------------- 1.6/12.8 MB 2.8 MB/s eta 0:00:04
     ----- ---------------------------------- 1.6/12.8 MB 2.9 MB/s eta 0:00:04
     ----- ---------------------------------- 1.9/12.8 MB 3.0 MB/s eta 0:00:04
     ------ --------------------------------- 2.1/12.8 MB 3.1 MB/s eta 0:00:04
     ------- -------------------------------- 2.3/12.8 MB 3.2 MB/s eta 0:00:04
     ------- -------------------------------- 2.5/12.8 MB 3.2 MB/s eta 0:00:04
     -------- ------------------------------- 2.7/12.8 MB 3.3 MB/s eta 0:00:04
     -------- ------------------------------- 2.8/12.8 MB 3.3 MB/s eta 0:00:04
     --------- ------------------------------ 3.1/12.8 MB 3.4 MB/s eta 0:00:03
     ---------- ----------------------------- 3.2/12.8 MB 3.4 MB/s eta 0:00:03
     ---------- ----------------------------- 3.4/12.8 MB 3.4 MB/s eta 0:00:03
     ----------- ---------------------------- 3.6/12.8 MB 3.4 MB/s eta 0:00:03
     ----------- ---------------------------- 3.7/12.8 MB 3.4 MB/s eta 0:00:03
     ------------ --------------------------- 3.9/12.8 MB 3.4 MB/s eta 0:00:03
     ------------ --------------------------- 4.1/12.8 MB 3.4 MB/s eta 0:00:03
     ------------- -------------------------- 4.2/12.8 MB 3.4 MB/s eta 0:00:03
     ------------- -------------------------- 4.4/12.8 MB 3.4 MB/s eta 0:00:03
     ------------- -------------------------- 4.4/12.8 MB 3.4 MB/s eta 0:00:03
     -------------- ------------------------- 4.6/12.8 MB 3.4 MB/s eta 0:00:03
     -------------- ------------------------- 4.7/12.8 MB 3.4 MB/s eta 0:00:03
     --------------- ------------------------ 4.9/12.8 MB 3.3 MB/s eta 0:00:03
     --------------- ------------------------ 5.0/12.8 MB 3.4 MB/s eta 0:00:03
     --------------- ------------------------ 5.0/12.8 MB 3.3 MB/s eta 0:00:03
     ---------------- ----------------------- 5.2/12.8 MB 3.3 MB/s eta 0:00:03
     ---------------- ----------------------- 5.3/12.8 MB 3.2 MB/s eta 0:00:03
     ----------------- ---------------------- 5.5/12.8 MB 3.3 MB/s eta 0:00:03
     ----------------- ---------------------- 5.6/12.8 MB 3.3 MB/s eta 0:00:03
     ------------------ --------------------- 5.8/12.8 MB 3.3 MB/s eta 0:00:03
     ------------------ --------------------- 6.0/12.8 MB 3.3 MB/s eta 0:00:03
     ------------------- -------------------- 6.2/12.8 MB 3.3 MB/s eta 0:00:03
     ------------------- -------------------- 6.3/12.8 MB 3.3 MB/s eta 0:00:02
     -------------------- ------------------- 6.5/12.8 MB 3.3 MB/s eta 0:00:02
     -------------------- ------------------- 6.6/12.8 MB 3.3 MB/s eta 0:00:02
     --------------------- ------------------ 6.8/12.8 MB 3.3 MB/s eta 0:00:02
     --------------------- ------------------ 7.0/12.8 MB 3.3 MB/s eta 0:00:02
     ---------------------- ----------------- 7.2/12.8 MB 3.3 MB/s eta 0:00:02
     ---------------------- ----------------- 7.3/12.8 MB 3.3 MB/s eta 0:00:02
     ----------------------- ---------------- 7.5/12.8 MB 3.3 MB/s eta 0:00:02
     ----------------------- ---------------- 7.6/12.8 MB 3.3 MB/s eta 0:00:02
     ------------------------ --------------- 7.8/12.8 MB 3.4 MB/s eta 0:00:02
     ------------------------ --------------- 7.9/12.8 MB 3.4 MB/s eta 0:00:02
     ------------------------- -------------- 8.1/12.8 MB 3.3 MB/s eta 0:00:02
     ------------------------- -------------- 8.3/12.8 MB 3.4 MB/s eta 0:00:02
     -------------------------- ------------- 8.4/12.8 MB 3.4 MB/s eta 0:00:02
     -------------------------- ------------- 8.6/12.8 MB 3.4 MB/s eta 0:00:02
     --------------------------- ------------ 8.8/12.8 MB 3.4 MB/s eta 0:00:02
     ---------------------------- ----------- 9.0/12.8 MB 3.4 MB/s eta 0:00:02
     ---------------------------- ----------- 9.1/12.8 MB 3.4 MB/s eta 0:00:02
     ----------------------------- ---------- 9.3/12.8 MB 3.4 MB/s eta 0:00:02
     ----------------------------- ---------- 9.5/12.8 MB 3.4 MB/s eta 0:00:01
     ------------------------------ --------- 9.6/12.8 MB 3.4 MB/s eta 0:00:01
     ------------------------------ --------- 9.8/12.8 MB 3.4 MB/s eta 0:00:01
     ------------------------------ --------- 9.9/12.8 MB 3.4 MB/s eta 0:00:01
     ------------------------------- -------- 10.1/12.8 MB 3.4 MB/s eta 0:00:01
     -------------------------------- ------- 10.3/12.8 MB 3.4 MB/s eta 0:00:01
     -------------------------------- ------- 10.4/12.8 MB 3.5 MB/s eta 0:00:01
     --------------------------------- ------ 10.6/12.8 MB 3.4 MB/s eta 0:00:01
     --------------------------------- ------ 10.7/12.8 MB 3.4 MB/s eta 0:00:01
     --------------------------------- ------ 10.8/12.8 MB 3.4 MB/s eta 0:00:01
     ---------------------------------- ----- 11.0/12.8 MB 3.4 MB/s eta 0:00:01
     ---------------------------------- ----- 11.1/12.8 MB 3.4 MB/s eta 0:00:01
     ----------------------------------- ---- 11.3/12.8 MB 3.4 MB/s eta 0:00:01
     ----------------------------------- ---- 11.5/12.8 MB 3.5 MB/s eta 0:00:01
     ------------------------------------ --- 11.6/12.8 MB 3.4 MB/s eta 0:00:01
     ------------------------------------ --- 11.7/12.8 MB 3.5 MB/s eta 0:00:01
     ------------------------------------- -- 11.9/12.8 MB 3.4 MB/s eta 0:00:01
     ------------------------------------- -- 11.9/12.8 MB 3.5 MB/s eta 0:00:01
     ------------------------------------- -- 12.0/12.8 MB 3.4 MB/s eta 0:00:01
     ------------------------------------- -- 12.1/12.8 MB 3.4 MB/s eta 0:00:01
     -------------------------------------- - 12.2/12.8 MB 3.3 MB/s eta 0:00:01
     -------------------------------------- - 12.4/12.8 MB 3.3 MB/s eta 0:00:01
     ---------------------------------------  12.5/12.8 MB 3.3 MB/s eta 0:00:01
     ---------------------------------------  12.6/12.8 MB 3.3 MB/s eta 0:00:01
     ---------------------------------------  12.7/12.8 MB 3.3 MB/s eta 0:00:01
     ---------------------------------------  12.8/12.8 MB 3.3 MB/s eta 0:00:01
     ---------------------------------------  12.8/12.8 MB 3.3 MB/s eta 0:00:01
     ---------------------------------------- 12.8/12.8 MB 3.2 MB/s eta 0:00:00
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.8.0
βœ” Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')
[notice] A new release of pip is available: 23.2.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip
# Demonstration of Lemmatization using NLTK and spaCy

# Import necessary libraries
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
import spacy

# Download required resources for NLTK
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger_eng')


# Function to convert POS tags to WordNet format for lemmatization
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

# Sample sentence
sentence = "The cats are sitting outside, and the children were playing happily."

# Tokenize and POS tagging
tokens = nltk.word_tokenize(sentence)
pos_tags = nltk.pos_tag(tokens)

# Initialize WordNet Lemmatizer
lemmatizer = WordNetLemmatizer()

# Apply lemmatization with POS tags
lemmatized_words_nltk = [lemmatizer.lemmatize(word, get_wordnet_pos(pos)) for word, pos in pos_tags]

# Now using spaCy for comparison
nlp = spacy.load("en_core_web_sm")
doc = nlp(sentence)
lemmatized_words_spacy = [token.lemma_ for token in doc]

tokens, lemmatized_words_nltk, lemmatized_words_spacy
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\sangouda\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\sangouda\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\sangouda\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\sangouda\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\sangouda\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
(['The',
  'cats',
  'are',
  'sitting',
  'outside',
  ',',
  'and',
  'the',
  'children',
  'were',
  'playing',
  'happily',
  '.'],
 ['The',
  'cat',
  'be',
  'sit',
  'outside',
  ',',
  'and',
  'the',
  'child',
  'be',
  'play',
  'happily',
  '.'],
 ['the',
  'cat',
  'be',
  'sit',
  'outside',
  ',',
  'and',
  'the',
  'child',
  'be',
  'play',
  'happily',
  '.'])