Stemming#

Stemming is the process of reducing words to their root/base form by chopping off prefixes or suffixes, without necessarily producing a valid dictionary word.

👉 Example:

  • Playing → play

  • Studies → studi

  • Better → better (no change, sometimes not reduced correctly)

The output of stemming is called a stem, which may or may not be a valid English word.


Why do we need Stemming?#

In NLP, words like connect, connected, connection, connecting all have the same meaning base (root = connect). If we treat them as different words, our model vocabulary becomes unnecessarily large and sparse.

Stemming helps by:

  • Reducing vocabulary size (less features → faster training).

  • Improving generalization (model sees all variations as the same root).

  • Making text preprocessing more efficient.


Types of Stemmers#

  1. Porter Stemmer (most popular, simple, rule-based).

    • Example: “caresses → caress, ponies → poni”.

  2. Snowball Stemmer (improved version of Porter, supports multiple languages).

    • More aggressive and efficient.

  3. Lancaster Stemmer (very aggressive, may over-stem words).

    • Example: “maximum → maxim, waiting → wait, studies → study”.


Example of Stemming vs Original Words#

Word

Porter Stemmer

Lancaster Stemmer

Studies

studi

study

Studying

studi

study

Studies

studi

study

University

univers

univ

👉 Notice that sometimes stemming produces strange stems (like studi or univers). That’s fine because the algorithm only cares about grouping variations together, not correctness.


Stemming vs Lemmatization (Important!)#

  • Stemming → Rule-based chopping (may produce invalid words).

  • Lemmatization → Uses dictionaries + grammar rules (produces valid words).

Example:

  • “Studies” →

    • Stemming: studi

    • Lemmatization: study

So, stemming is faster but less accurate, while lemmatization is slower but linguistically correct.


In short

  • Stemming = crude, fast, root chopping.

  • Useful in search engines, IR systems, text mining where exact correctness is not critical.

# Demonstration of Stemming in NLP
from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer

# Sample words
words = ["running", "runs", "easily", "studies", "studying", "better", "connection", "connected"]

# Initialize stemmers
porter = PorterStemmer()
snowball = SnowballStemmer("english")
lancaster = LancasterStemmer()

print("Word".ljust(15), "Porter".ljust(15), "Snowball".ljust(15), "Lancaster".ljust(15))
print("-"*60)

for w in words:
    print(
        w.ljust(15),
        porter.stem(w).ljust(15),
        snowball.stem(w).ljust(15),
        lancaster.stem(w).ljust(15)
    )
Word            Porter          Snowball        Lancaster      
------------------------------------------------------------
running         run             run             run            
runs            run             run             run            
easily          easili          easili          easy           
studies         studi           studi           study          
studying        studi           studi           study          
better          better          better          bet            
connection      connect         connect         connect        
connected       connect         connect         connect