Text Preprocessing

Text Preprocessing #

Lowercasing#

Convert all text into lowercase to avoid treating "Apple" and "apple" as different words.
Example: "Natural Language Processing" → "natural language processing"

text = "Natural Language Processing"
text_lower = text.lower()
print(text_lower)  # 'natural language processing'

Tokenization#

Breaking text into smaller units (sentences or words).
Example: "I love NLP." → ['I', 'love', 'NLP', '.']

from nltk.tokenize import word_tokenize
word_tokenize("I love NLP.")

Removing Punctuation / Special Characters#

Punctuation often doesn’t add much meaning in text classification tasks.
Example: "Hello!!! How are you??" → "Hello How are you"

import re
text = "Hello!!! How are you??"
cleaned = re.sub(r'[^\w\s]', '', text)  # keep only words and spaces
print(cleaned)  # 'Hello How are you'

from nltk.tokenize import word_tokenize
import string

text = "Hello!!! How are you?? I'm fine... thanks :)"

tokens = word_tokenize(text)
tokens = [word for word in tokens if word not in string.punctuation]

print(tokens)

Stopword Removal#

Stopwords are common words (e.g., “is”, “the”, “and”) that carry little meaning.
Example: "This is a good book" → "good book"

from nltk.corpus import stopwords
nltk.download("stopwords")

words = word_tokenize("This is a good book")
filtered = [w for w in words if w.lower() not in stopwords.words("english")]
print(filtered)  # ['good', 'book']

Stemming#

Reducing words to their root form (not always valid words).
Example: "playing" → "play", "studies" → "studi"

from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
print(stemmer.stem("playing"))  # play
print(stemmer.stem("studies"))  # studi

Lemmatization#

Similar to stemming but uses vocabulary + grammar → produces valid words.
Example: "studies" → "study", "better" → "good"

from nltk.stem import WordNetLemmatizer
nltk.download("wordnet")
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("studies"))   # study
print(lemmatizer.lemmatize("better", pos="a"))  # good

Handling Numbers#

Numbers may or may not be useful.
- Option 1: Remove numbers → "I bought 3 apples" → "I bought apples"
- Option 2: Keep numbers but normalize → "3" → "three"

Handling Emojis / Emoticons (Optional)#

Emojis can carry meaning in sentiment analysis.
- Example: "I am happy 😊" → "happy"

Spelling Correction#

Example: "I lve NLP" → "I love NLP"

Libraries like TextBlob or SymSpell can fix spelling.

Text Normalization#

Expanding contractions → "don't" → "do not"
Normalizing slang → "u" → "you"

Vectorization#

Final step: convert words into numerical form.
- Bag of Words (BoW)
- TF-IDF (Term Frequency – Inverse Document Frequency)
- Word Embeddings (Word2Vec, GloVe, BERT, etc.)

Workflow Summary

Lowercasing
Tokenization (sentence/word level)
Cleaning (punctuation, numbers, special chars)
Stopword removal
Normalization (stemming/lemmatization)
Spelling correction / Slang normalization (if needed)
Convert to vectors (BoW, TF-IDF, embeddings)

⚡ In practice: preprocessing steps depend on the task.

For sentiment analysis, emojis might matter.
For legal/medical NLP, numbers and special terms matter.
For chatbots, spelling correction and contraction expansion are crucial.

Text Preprocessing

Contents