Text Preprocessing#
Lowercasing#
Convert all text into lowercase to avoid treating
"Apple"and"apple"as different words.Example:
"Natural Language Processing"→"natural language processing"
text = "Natural Language Processing"
text_lower = text.lower()
print(text_lower) # 'natural language processing'
Tokenization#
Breaking text into smaller units (sentences or words).
Example:
"I love NLP."→['I', 'love', 'NLP', '.']
from nltk.tokenize import word_tokenize
word_tokenize("I love NLP.")
Removing Punctuation / Special Characters#
Punctuation often doesn’t add much meaning in text classification tasks.
Example:
"Hello!!! How are you??"→"Hello How are you"
import re
text = "Hello!!! How are you??"
cleaned = re.sub(r'[^\w\s]', '', text) # keep only words and spaces
print(cleaned) # 'Hello How are you'
from nltk.tokenize import word_tokenize
import string
text = "Hello!!! How are you?? I'm fine... thanks :)"
tokens = word_tokenize(text)
tokens = [word for word in tokens if word not in string.punctuation]
print(tokens)
Stopword Removal#
Stopwords are common words (e.g., “is”, “the”, “and”) that carry little meaning.
Example:
"This is a good book"→"good book"
from nltk.corpus import stopwords
nltk.download("stopwords")
words = word_tokenize("This is a good book")
filtered = [w for w in words if w.lower() not in stopwords.words("english")]
print(filtered) # ['good', 'book']
Stemming#
Reducing words to their root form (not always valid words).
Example:
"playing" → "play", "studies" → "studi"
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
print(stemmer.stem("playing")) # play
print(stemmer.stem("studies")) # studi
Lemmatization#
Similar to stemming but uses vocabulary + grammar → produces valid words.
Example:
"studies" → "study", "better" → "good"
from nltk.stem import WordNetLemmatizer
nltk.download("wordnet")
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("studies")) # study
print(lemmatizer.lemmatize("better", pos="a")) # good
Handling Numbers#
Numbers may or may not be useful.
Option 1: Remove numbers →
"I bought 3 apples"→"I bought apples"Option 2: Keep numbers but normalize →
"3"→"three"
Handling Emojis / Emoticons (Optional)#
Emojis can carry meaning in sentiment analysis.
Example:
"I am happy 😊"→"happy"
Spelling Correction#
Example:
"I lve NLP"→"I love NLP"
Libraries like TextBlob or SymSpell can fix spelling.
Text Normalization#
Expanding contractions →
"don't"→"do not"Normalizing slang →
"u"→"you"
Vectorization#
Final step: convert words into numerical form.
Bag of Words (BoW)
TF-IDF (Term Frequency – Inverse Document Frequency)
Word Embeddings (Word2Vec, GloVe, BERT, etc.)
Workflow Summary
Lowercasing
Tokenization (sentence/word level)
Cleaning (punctuation, numbers, special chars)
Stopword removal
Normalization (stemming/lemmatization)
Spelling correction / Slang normalization (if needed)
Convert to vectors (BoW, TF-IDF, embeddings)
⚡ In practice: preprocessing steps depend on the task.
For sentiment analysis, emojis might matter.
For legal/medical NLP, numbers and special terms matter.
For chatbots, spelling correction and contraction expansion are crucial.