Text Preprocessing#

Lowercasing#

  • Convert all text into lowercase to avoid treating "Apple" and "apple" as different words.

  • Example: "Natural Language Processing""natural language processing"

text = "Natural Language Processing"
text_lower = text.lower()
print(text_lower)  # 'natural language processing'

Tokenization#

  • Breaking text into smaller units (sentences or words).

  • Example: "I love NLP."['I', 'love', 'NLP', '.']

from nltk.tokenize import word_tokenize
word_tokenize("I love NLP.")

Removing Punctuation / Special Characters#

  • Punctuation often doesn’t add much meaning in text classification tasks.

  • Example: "Hello!!! How are you??""Hello How are you"

import re
text = "Hello!!! How are you??"
cleaned = re.sub(r'[^\w\s]', '', text)  # keep only words and spaces
print(cleaned)  # 'Hello How are you'

from nltk.tokenize import word_tokenize
import string

text = "Hello!!! How are you?? I'm fine... thanks :)"

tokens = word_tokenize(text)
tokens = [word for word in tokens if word not in string.punctuation]

print(tokens)


Stopword Removal#

  • Stopwords are common words (e.g., “is”, “the”, “and”) that carry little meaning.

  • Example: "This is a good book""good book"

from nltk.corpus import stopwords
nltk.download("stopwords")

words = word_tokenize("This is a good book")
filtered = [w for w in words if w.lower() not in stopwords.words("english")]
print(filtered)  # ['good', 'book']

Stemming#

  • Reducing words to their root form (not always valid words).

  • Example: "playing" "play", "studies" "studi"

from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
print(stemmer.stem("playing"))  # play
print(stemmer.stem("studies"))  # studi

Lemmatization#

  • Similar to stemming but uses vocabulary + grammar → produces valid words.

  • Example: "studies" "study", "better" "good"

from nltk.stem import WordNetLemmatizer
nltk.download("wordnet")
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("studies"))   # study
print(lemmatizer.lemmatize("better", pos="a"))  # good

Handling Numbers#

  • Numbers may or may not be useful.

    • Option 1: Remove numbers → "I bought 3 apples""I bought apples"

    • Option 2: Keep numbers but normalize → "3""three"


Handling Emojis / Emoticons (Optional)#

  • Emojis can carry meaning in sentiment analysis.

    • Example: "I am happy 😊""happy"


Spelling Correction#

  • Example: "I lve NLP""I love NLP"

Libraries like TextBlob or SymSpell can fix spelling.


Text Normalization#

  • Expanding contractions → "don't""do not"

  • Normalizing slang → "u""you"


Vectorization#

  • Final step: convert words into numerical form.

    • Bag of Words (BoW)

    • TF-IDF (Term Frequency – Inverse Document Frequency)

    • Word Embeddings (Word2Vec, GloVe, BERT, etc.)


Workflow Summary

  1. Lowercasing

  2. Tokenization (sentence/word level)

  3. Cleaning (punctuation, numbers, special chars)

  4. Stopword removal

  5. Normalization (stemming/lemmatization)

  6. Spelling correction / Slang normalization (if needed)

  7. Convert to vectors (BoW, TF-IDF, embeddings)


⚡ In practice: preprocessing steps depend on the task.

  • For sentiment analysis, emojis might matter.

  • For legal/medical NLP, numbers and special terms matter.

  • For chatbots, spelling correction and contraction expansion are crucial.

Click here for Sections