Stopwords#
Stopwords are common words in a language that usually don’t carry significant meaning in text analysis.
Examples in English:
["is", "am", "are", "the", "in", "on", "at", "a", "of", "this", "that"]
👉 In the sentence:
"The cat is on the mat."
The important words are: ["cat", "mat"]
Words like "the", "is", "on" are stopwords.
🔹 Why Remove Stopwords?#
They don’t add much meaning to most NLP tasks (like classification, topic modeling).
They increase dataset size & computation without improving accuracy.
Removing them helps models focus on meaningful words.
⚠️ BUT: In some tasks (e.g., translation, question answering), stopwords must be kept, because they affect grammar and meaning.
🔹 How to Remove Stopwords?#
1. Using NLTK#
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
# Download stopwords (only once)
nltk.download("punkt")
nltk.download("stopwords")
text = "This is a simple example showing the removal of stopwords in NLP."
# Tokenize
words = word_tokenize(text)
# Remove stopwords
filtered_words = [w for w in words if w.lower() not in stopwords.words("english")]
print("Original:", words)
print("After Stopword Removal:", filtered_words)
✅ Output
Original: ['This', 'is', 'a', 'simple', 'example', 'showing', 'the', 'removal', 'of', 'stopwords', 'in', 'NLP', '.']
After Stopword Removal: ['simple', 'example', 'showing', 'removal', 'stopwords', 'NLP']
2. Using spaCy#
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("This is a simple example showing the removal of stopwords in NLP.")
filtered_words = [token.text for token in doc if not token.is_stop]
print(filtered_words)
✅ Output:
['simple', 'example', 'showing', 'removal', 'stopwords', 'NLP']
3. Using Custom Stopwords List#
You can add/remove words from the default list:
custom_stopwords = set(stopwords.words("english"))
custom_stopwords.update(["example", "showing"]) # add extra words
filtered_words = [w for w in words if w.lower() not in custom_stopwords]
print(filtered_words)
Summary
Stopwords = common words with little meaning.
Removing them helps simplify text and improves performance in many NLP tasks.
Different libraries (NLTK, spaCy, sklearn) provide stopword lists.
You can always create a custom list depending on your dataset.
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
# Download required resources (run once)
nltk.download("punkt")
nltk.download("stopwords")
# Example text
text = "This is a great movie and I really enjoyed it!"
# Tokenize text into words
words = word_tokenize(text)
# Load English stopwords
stop_words = set(stopwords.words("english"))
# Remove stopwords
filtered_words = [w for w in words if w.lower() not in stop_words and w.isalpha()]
print("Original Text:", text)
print("Tokenized Words:", words)
print("After Stopword Removal:", filtered_words)
[nltk_data] Downloading package punkt to
[nltk_data] C:\Users\sangouda\AppData\Roaming\nltk_data...
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data] C:\Users\sangouda\AppData\Roaming\nltk_data...
Original Text: This is a great movie and I really enjoyed it!
Tokenized Words: ['This', 'is', 'a', 'great', 'movie', 'and', 'I', 'really', 'enjoyed', 'it', '!']
After Stopword Removal: ['great', 'movie', 'really', 'enjoyed']
[nltk_data] Unzipping corpora\stopwords.zip.