Lemmatization#
Definition: Lemmatization is the process of reducing a word to its base form (lemma), but unlike stemming, it uses vocabulary + morphological analysis + POS (Part of Speech) tags.
It produces real words (not chopped forms).
Example:
Stemming: studies β studi
Lemmatization: studies β study
πΉ Why is Lemmatization Better than Stemming?#
Valid Words: Lemmas are dictionary words.
POS-Aware: Lemmatizer needs to know if the word is a noun, verb, adjective, etc. Example:
βbetterβ β
As an adjective: good
As a verb: better
Context-Sensitive: Uses linguistic rules to avoid incorrect chopping.
πΉ Process of Lemmatization#
Tokenization β Split text into words/sentences.
POS Tagging β Assign part of speech to each word.
Lemmatization β Use dictionary (WordNet in NLTK or spaCy lexicon) to get lemma.
πΉ Example#
Sentence: π βThe cats are sitting outside, and the children were playing happily.β
cats β cat
sitting β sit
children β child
playing β play
happily β happily (adverbs often remain unchanged)
πΉ Tools for Lemmatization#
NLTK WordNet Lemmatizer (basic, needs POS tags for accuracy).
spaCy Lemmatizer (more powerful, uses large linguistic models).
πΉ When to Use Lemmatization#
β When meaning and grammar matter (chatbots, translation, search engines). β For semantic NLP tasks: text classification, QA, summarization. β Stemming may be enough for speed-focused tasks like keyword extraction.
In short:
Stemming β Fast but crude chopping (connection β connect).
Lemmatization β Slower but linguistically accurate (better β good).
!python -m spacy download en_core_web_sm
Collecting en-core-web-sm==3.8.0
Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
---------------------------------------- 0.0/12.8 MB ? eta -:--:--
--------------------------------------- 0.0/12.8 MB 660.6 kB/s eta 0:00:20
---------------------------------------- 0.1/12.8 MB 1.3 MB/s eta 0:00:10
--------------------------------------- 0.3/12.8 MB 2.2 MB/s eta 0:00:06
- -------------------------------------- 0.5/12.8 MB 2.6 MB/s eta 0:00:05
-- ------------------------------------- 0.6/12.8 MB 2.9 MB/s eta 0:00:05
-- ------------------------------------- 0.8/12.8 MB 3.1 MB/s eta 0:00:04
--- ------------------------------------ 1.0/12.8 MB 3.2 MB/s eta 0:00:04
--- ------------------------------------ 1.1/12.8 MB 3.2 MB/s eta 0:00:04
--- ------------------------------------ 1.1/12.8 MB 3.2 MB/s eta 0:00:04
--- ------------------------------------ 1.2/12.8 MB 2.6 MB/s eta 0:00:05
---- ----------------------------------- 1.5/12.8 MB 2.9 MB/s eta 0:00:04
---- ----------------------------------- 1.6/12.8 MB 2.8 MB/s eta 0:00:04
----- ---------------------------------- 1.6/12.8 MB 2.9 MB/s eta 0:00:04
----- ---------------------------------- 1.9/12.8 MB 3.0 MB/s eta 0:00:04
------ --------------------------------- 2.1/12.8 MB 3.1 MB/s eta 0:00:04
------- -------------------------------- 2.3/12.8 MB 3.2 MB/s eta 0:00:04
------- -------------------------------- 2.5/12.8 MB 3.2 MB/s eta 0:00:04
-------- ------------------------------- 2.7/12.8 MB 3.3 MB/s eta 0:00:04
-------- ------------------------------- 2.8/12.8 MB 3.3 MB/s eta 0:00:04
--------- ------------------------------ 3.1/12.8 MB 3.4 MB/s eta 0:00:03
---------- ----------------------------- 3.2/12.8 MB 3.4 MB/s eta 0:00:03
---------- ----------------------------- 3.4/12.8 MB 3.4 MB/s eta 0:00:03
----------- ---------------------------- 3.6/12.8 MB 3.4 MB/s eta 0:00:03
----------- ---------------------------- 3.7/12.8 MB 3.4 MB/s eta 0:00:03
------------ --------------------------- 3.9/12.8 MB 3.4 MB/s eta 0:00:03
------------ --------------------------- 4.1/12.8 MB 3.4 MB/s eta 0:00:03
------------- -------------------------- 4.2/12.8 MB 3.4 MB/s eta 0:00:03
------------- -------------------------- 4.4/12.8 MB 3.4 MB/s eta 0:00:03
------------- -------------------------- 4.4/12.8 MB 3.4 MB/s eta 0:00:03
-------------- ------------------------- 4.6/12.8 MB 3.4 MB/s eta 0:00:03
-------------- ------------------------- 4.7/12.8 MB 3.4 MB/s eta 0:00:03
--------------- ------------------------ 4.9/12.8 MB 3.3 MB/s eta 0:00:03
--------------- ------------------------ 5.0/12.8 MB 3.4 MB/s eta 0:00:03
--------------- ------------------------ 5.0/12.8 MB 3.3 MB/s eta 0:00:03
---------------- ----------------------- 5.2/12.8 MB 3.3 MB/s eta 0:00:03
---------------- ----------------------- 5.3/12.8 MB 3.2 MB/s eta 0:00:03
----------------- ---------------------- 5.5/12.8 MB 3.3 MB/s eta 0:00:03
----------------- ---------------------- 5.6/12.8 MB 3.3 MB/s eta 0:00:03
------------------ --------------------- 5.8/12.8 MB 3.3 MB/s eta 0:00:03
------------------ --------------------- 6.0/12.8 MB 3.3 MB/s eta 0:00:03
------------------- -------------------- 6.2/12.8 MB 3.3 MB/s eta 0:00:03
------------------- -------------------- 6.3/12.8 MB 3.3 MB/s eta 0:00:02
-------------------- ------------------- 6.5/12.8 MB 3.3 MB/s eta 0:00:02
-------------------- ------------------- 6.6/12.8 MB 3.3 MB/s eta 0:00:02
--------------------- ------------------ 6.8/12.8 MB 3.3 MB/s eta 0:00:02
--------------------- ------------------ 7.0/12.8 MB 3.3 MB/s eta 0:00:02
---------------------- ----------------- 7.2/12.8 MB 3.3 MB/s eta 0:00:02
---------------------- ----------------- 7.3/12.8 MB 3.3 MB/s eta 0:00:02
----------------------- ---------------- 7.5/12.8 MB 3.3 MB/s eta 0:00:02
----------------------- ---------------- 7.6/12.8 MB 3.3 MB/s eta 0:00:02
------------------------ --------------- 7.8/12.8 MB 3.4 MB/s eta 0:00:02
------------------------ --------------- 7.9/12.8 MB 3.4 MB/s eta 0:00:02
------------------------- -------------- 8.1/12.8 MB 3.3 MB/s eta 0:00:02
------------------------- -------------- 8.3/12.8 MB 3.4 MB/s eta 0:00:02
-------------------------- ------------- 8.4/12.8 MB 3.4 MB/s eta 0:00:02
-------------------------- ------------- 8.6/12.8 MB 3.4 MB/s eta 0:00:02
--------------------------- ------------ 8.8/12.8 MB 3.4 MB/s eta 0:00:02
---------------------------- ----------- 9.0/12.8 MB 3.4 MB/s eta 0:00:02
---------------------------- ----------- 9.1/12.8 MB 3.4 MB/s eta 0:00:02
----------------------------- ---------- 9.3/12.8 MB 3.4 MB/s eta 0:00:02
----------------------------- ---------- 9.5/12.8 MB 3.4 MB/s eta 0:00:01
------------------------------ --------- 9.6/12.8 MB 3.4 MB/s eta 0:00:01
------------------------------ --------- 9.8/12.8 MB 3.4 MB/s eta 0:00:01
------------------------------ --------- 9.9/12.8 MB 3.4 MB/s eta 0:00:01
------------------------------- -------- 10.1/12.8 MB 3.4 MB/s eta 0:00:01
-------------------------------- ------- 10.3/12.8 MB 3.4 MB/s eta 0:00:01
-------------------------------- ------- 10.4/12.8 MB 3.5 MB/s eta 0:00:01
--------------------------------- ------ 10.6/12.8 MB 3.4 MB/s eta 0:00:01
--------------------------------- ------ 10.7/12.8 MB 3.4 MB/s eta 0:00:01
--------------------------------- ------ 10.8/12.8 MB 3.4 MB/s eta 0:00:01
---------------------------------- ----- 11.0/12.8 MB 3.4 MB/s eta 0:00:01
---------------------------------- ----- 11.1/12.8 MB 3.4 MB/s eta 0:00:01
----------------------------------- ---- 11.3/12.8 MB 3.4 MB/s eta 0:00:01
----------------------------------- ---- 11.5/12.8 MB 3.5 MB/s eta 0:00:01
------------------------------------ --- 11.6/12.8 MB 3.4 MB/s eta 0:00:01
------------------------------------ --- 11.7/12.8 MB 3.5 MB/s eta 0:00:01
------------------------------------- -- 11.9/12.8 MB 3.4 MB/s eta 0:00:01
------------------------------------- -- 11.9/12.8 MB 3.5 MB/s eta 0:00:01
------------------------------------- -- 12.0/12.8 MB 3.4 MB/s eta 0:00:01
------------------------------------- -- 12.1/12.8 MB 3.4 MB/s eta 0:00:01
-------------------------------------- - 12.2/12.8 MB 3.3 MB/s eta 0:00:01
-------------------------------------- - 12.4/12.8 MB 3.3 MB/s eta 0:00:01
--------------------------------------- 12.5/12.8 MB 3.3 MB/s eta 0:00:01
--------------------------------------- 12.6/12.8 MB 3.3 MB/s eta 0:00:01
--------------------------------------- 12.7/12.8 MB 3.3 MB/s eta 0:00:01
--------------------------------------- 12.8/12.8 MB 3.3 MB/s eta 0:00:01
--------------------------------------- 12.8/12.8 MB 3.3 MB/s eta 0:00:01
---------------------------------------- 12.8/12.8 MB 3.2 MB/s eta 0:00:00
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.8.0
β Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')
[notice] A new release of pip is available: 23.2.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip
# Demonstration of Lemmatization using NLTK and spaCy
# Import necessary libraries
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
import spacy
# Download required resources for NLTK
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger_eng')
# Function to convert POS tags to WordNet format for lemmatization
def get_wordnet_pos(tag):
if tag.startswith('J'):
return wordnet.ADJ
elif tag.startswith('V'):
return wordnet.VERB
elif tag.startswith('N'):
return wordnet.NOUN
elif tag.startswith('R'):
return wordnet.ADV
else:
return wordnet.NOUN
# Sample sentence
sentence = "The cats are sitting outside, and the children were playing happily."
# Tokenize and POS tagging
tokens = nltk.word_tokenize(sentence)
pos_tags = nltk.pos_tag(tokens)
# Initialize WordNet Lemmatizer
lemmatizer = WordNetLemmatizer()
# Apply lemmatization with POS tags
lemmatized_words_nltk = [lemmatizer.lemmatize(word, get_wordnet_pos(pos)) for word, pos in pos_tags]
# Now using spaCy for comparison
nlp = spacy.load("en_core_web_sm")
doc = nlp(sentence)
lemmatized_words_spacy = [token.lemma_ for token in doc]
tokens, lemmatized_words_nltk, lemmatized_words_spacy
[nltk_data] Downloading package punkt to
[nltk_data] C:\Users\sangouda\AppData\Roaming\nltk_data...
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data] C:\Users\sangouda\AppData\Roaming\nltk_data...
[nltk_data] Package averaged_perceptron_tagger is already up-to-
[nltk_data] date!
[nltk_data] Downloading package wordnet to
[nltk_data] C:\Users\sangouda\AppData\Roaming\nltk_data...
[nltk_data] Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data] C:\Users\sangouda\AppData\Roaming\nltk_data...
[nltk_data] Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data] C:\Users\sangouda\AppData\Roaming\nltk_data...
[nltk_data] Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data] date!
(['The',
'cats',
'are',
'sitting',
'outside',
',',
'and',
'the',
'children',
'were',
'playing',
'happily',
'.'],
['The',
'cat',
'be',
'sit',
'outside',
',',
'and',
'the',
'child',
'be',
'play',
'happily',
'.'],
['the',
'cat',
'be',
'sit',
'outside',
',',
'and',
'the',
'child',
'be',
'play',
'happily',
'.'])