Stopwords#

Stopwords are common words in a language that usually don’t carry significant meaning in text analysis.

Examples in English: ["is", "am", "are", "the", "in", "on", "at", "a", "of", "this", "that"]

👉 In the sentence: "The cat is on the mat." The important words are: ["cat", "mat"] Words like "the", "is", "on" are stopwords.


🔹 Why Remove Stopwords?#

  • They don’t add much meaning to most NLP tasks (like classification, topic modeling).

  • They increase dataset size & computation without improving accuracy.

  • Removing them helps models focus on meaningful words.

⚠️ BUT: In some tasks (e.g., translation, question answering), stopwords must be kept, because they affect grammar and meaning.


🔹 How to Remove Stopwords?#

1. Using NLTK#

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Download stopwords (only once)
nltk.download("punkt")
nltk.download("stopwords")

text = "This is a simple example showing the removal of stopwords in NLP."

# Tokenize
words = word_tokenize(text)

# Remove stopwords
filtered_words = [w for w in words if w.lower() not in stopwords.words("english")]

print("Original:", words)
print("After Stopword Removal:", filtered_words)

Output

Original: ['This', 'is', 'a', 'simple', 'example', 'showing', 'the', 'removal', 'of', 'stopwords', 'in', 'NLP', '.']
After Stopword Removal: ['simple', 'example', 'showing', 'removal', 'stopwords', 'NLP']

2. Using spaCy#

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("This is a simple example showing the removal of stopwords in NLP.")

filtered_words = [token.text for token in doc if not token.is_stop]
print(filtered_words)

✅ Output:

['simple', 'example', 'showing', 'removal', 'stopwords', 'NLP']

3. Using Custom Stopwords List#

You can add/remove words from the default list:

custom_stopwords = set(stopwords.words("english"))
custom_stopwords.update(["example", "showing"])  # add extra words

filtered_words = [w for w in words if w.lower() not in custom_stopwords]
print(filtered_words)

Summary

  • Stopwords = common words with little meaning.

  • Removing them helps simplify text and improves performance in many NLP tasks.

  • Different libraries (NLTK, spaCy, sklearn) provide stopword lists.

  • You can always create a custom list depending on your dataset.

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Download required resources (run once)
nltk.download("punkt")
nltk.download("stopwords")

# Example text
text = "This is a great movie and I really enjoyed it!"

# Tokenize text into words
words = word_tokenize(text)

# Load English stopwords
stop_words = set(stopwords.words("english"))

# Remove stopwords
filtered_words = [w for w in words if w.lower() not in stop_words and w.isalpha()]

print("Original Text:", text)
print("Tokenized Words:", words)
print("After Stopword Removal:", filtered_words)
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\sangouda\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sangouda\AppData\Roaming\nltk_data...
Original Text: This is a great movie and I really enjoyed it!
Tokenized Words: ['This', 'is', 'a', 'great', 'movie', 'and', 'I', 'really', 'enjoyed', 'it', '!']
After Stopword Removal: ['great', 'movie', 'really', 'enjoyed']
[nltk_data]   Unzipping corpora\stopwords.zip.