Stemming#
Stemming is the process of reducing words to their root/base form by chopping off prefixes or suffixes, without necessarily producing a valid dictionary word.
👉 Example:
Playing → play
Studies → studi
Better → better (no change, sometimes not reduced correctly)
The output of stemming is called a stem, which may or may not be a valid English word.
Why do we need Stemming?#
In NLP, words like connect, connected, connection, connecting all have the same meaning base (root = connect). If we treat them as different words, our model vocabulary becomes unnecessarily large and sparse.
Stemming helps by:
Reducing vocabulary size (less features → faster training).
Improving generalization (model sees all variations as the same root).
Making text preprocessing more efficient.
Types of Stemmers#
Porter Stemmer (most popular, simple, rule-based).
Example: “caresses → caress, ponies → poni”.
Snowball Stemmer (improved version of Porter, supports multiple languages).
More aggressive and efficient.
Lancaster Stemmer (very aggressive, may over-stem words).
Example: “maximum → maxim, waiting → wait, studies → study”.
Example of Stemming vs Original Words#
Word |
Porter Stemmer |
Lancaster Stemmer |
|---|---|---|
Studies |
studi |
study |
Studying |
studi |
study |
Studies |
studi |
study |
University |
univers |
univ |
👉 Notice that sometimes stemming produces strange stems (like studi or univers). That’s fine because the algorithm only cares about grouping variations together, not correctness.
Stemming vs Lemmatization (Important!)#
Stemming → Rule-based chopping (may produce invalid words).
Lemmatization → Uses dictionaries + grammar rules (produces valid words).
Example:
“Studies” →
Stemming: studi
Lemmatization: study ✅
So, stemming is faster but less accurate, while lemmatization is slower but linguistically correct.
In short
Stemming = crude, fast, root chopping.
Useful in search engines, IR systems, text mining where exact correctness is not critical.
# Demonstration of Stemming in NLP
from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer
# Sample words
words = ["running", "runs", "easily", "studies", "studying", "better", "connection", "connected"]
# Initialize stemmers
porter = PorterStemmer()
snowball = SnowballStemmer("english")
lancaster = LancasterStemmer()
print("Word".ljust(15), "Porter".ljust(15), "Snowball".ljust(15), "Lancaster".ljust(15))
print("-"*60)
for w in words:
print(
w.ljust(15),
porter.stem(w).ljust(15),
snowball.stem(w).ljust(15),
lancaster.stem(w).ljust(15)
)
Word Porter Snowball Lancaster
------------------------------------------------------------
running run run run
runs run run run
easily easili easili easy
studies studi studi study
studying studi studi study
better better better bet
connection connect connect connect
connected connect connect connect