Text Normalization#

Text normalization is the process of transforming text into a standard form, so that variations in text (like case differences, punctuation, or spelling) do not affect analysis. Essentially, it reduces the complexity and variability of textual data.

Goal: Make text uniform so NLP models can focus on the semantics, not on irrelevant variations.


Key Steps in Text Normalization#

  1. Lowercasing

    • Convert all text to lowercase to avoid treating “Apple” and “apple” differently.

    • Example: "NLP is Amazing!" "nlp is amazing!"

  2. Removing Punctuation

    • Punctuation usually doesn’t carry meaning for many NLP tasks.

    • Example: "Hello, world!" "Hello world"

  3. Removing Numbers (optional)

    • Numbers may or may not be relevant depending on the task.

    • Example: "I have 2 apples""I have apples"

  4. Removing Stopwords

    • Words like “is,” “the,” “a” are common but carry little meaning.

    • Example: "This is a sample text""sample text"

  5. Spelling Correction

    • Correct typos to standardize words.

    • Example: "loove""love"

  6. Tokenization

    • Split text into smaller units like sentences or words.

    • Example: "I love NLP." ["I", "love", "NLP"]

  7. Stemming

    • Reduce words to their root form.

    • Example: "running" "run"

  8. Lemmatization

    • Convert words to their dictionary form using context.

    • Example: "better" "good"

  9. Handling Emojis and Emoticons

    • Convert emojis to text or remove them.

    • Example: "I am happy 😊""I am happy"

  10. Handling Contractions

    • Expand contractions to standard form.

    • Example: "I’m happy""I am happy"

  11. Removing Extra Whitespaces

    • Clean up unnecessary spaces.

    • Example: "Hello   world""Hello world"


Python Demonstration: Text Normalization Pipeline#

!pip install textblob
Collecting textblob
  Obtaining dependency information for textblob from https://files.pythonhosted.org/packages/1e/d6/40aa5aead775582ea0cf35870e5a3f16fab4b967f1ad2debe675f673f923/textblob-0.19.0-py3-none-any.whl.metadata
  Downloading textblob-0.19.0-py3-none-any.whl.metadata (4.4 kB)
Requirement already satisfied: nltk>=3.9 in c:\users\sangouda\appdata\local\programs\python\python312\lib\site-packages (from textblob) (3.9.1)
Requirement already satisfied: click in c:\users\sangouda\appdata\local\programs\python\python312\lib\site-packages (from nltk>=3.9->textblob) (8.2.1)
Requirement already satisfied: joblib in c:\users\sangouda\appdata\local\programs\python\python312\lib\site-packages (from nltk>=3.9->textblob) (1.5.1)
Requirement already satisfied: regex>=2021.8.3 in c:\users\sangouda\appdata\local\programs\python\python312\lib\site-packages (from nltk>=3.9->textblob) (2024.11.6)
Requirement already satisfied: tqdm in c:\users\sangouda\appdata\local\programs\python\python312\lib\site-packages (from nltk>=3.9->textblob) (4.67.1)
Requirement already satisfied: colorama in c:\users\sangouda\appdata\local\programs\python\python312\lib\site-packages (from click->nltk>=3.9->textblob) (0.4.6)
Downloading textblob-0.19.0-py3-none-any.whl (624 kB)
   ---------------------------------------- 0.0/624.3 kB ? eta -:--:--
   - ------------------------------------- 30.7/624.3 kB 660.6 kB/s eta 0:00:01
   -- ------------------------------------ 41.0/624.3 kB 393.8 kB/s eta 0:00:02
   ---- ---------------------------------- 71.7/624.3 kB 653.6 kB/s eta 0:00:01
   --------- ------------------------------ 153.6/624.3 kB 1.0 MB/s eta 0:00:01
   ---------- --------------------------- 174.1/624.3 kB 876.1 kB/s eta 0:00:01
   --------------------- ------------------ 337.9/624.3 kB 1.4 MB/s eta 0:00:01
   --------------------------- ------------ 430.1/624.3 kB 1.4 MB/s eta 0:00:01
   -------------------------------- ------- 501.8/624.3 kB 1.5 MB/s eta 0:00:01
   ------------------------------------- -- 583.7/624.3 kB 1.5 MB/s eta 0:00:01
   ---------------------------------------  614.4/624.3 kB 1.4 MB/s eta 0:00:01
   ---------------------------------------- 624.3/624.3 kB 1.4 MB/s eta 0:00:00
Installing collected packages: textblob
Successfully installed textblob-0.19.0
[notice] A new release of pip is available: 23.2.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip
# Install required libraries
# !pip install nltk textblob

import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from textblob import TextBlob

nltk.download('stopwords')
nltk.download('wordnet')

text = "I loove NLP! It's amazing 😊 and I have 2 apples."

# 1️⃣ Lowercase
text = text.lower()

# 2️⃣ Remove punctuation
text = re.sub(r'[^\w\s]', '', text)

# 3️⃣ Remove numbers
text = re.sub(r'\d+', '', text)

# 4️⃣ Tokenization
words = text.split()

# 5️⃣ Remove stopwords
stop_words = set(stopwords.words('english'))
words = [w for w in words if w not in stop_words]

# 6️⃣ Spelling Correction
words = [str(TextBlob(w).correct()) for w in words]

# 7️⃣ Stemming
stemmer = PorterStemmer()
words = [stemmer.stem(w) for w in words]

# 8️⃣ Lemmatization
lemmatizer = WordNetLemmatizer()
words = [lemmatizer.lemmatize(w) for w in words]

normalized_text = " ".join(words)
print("Normalized Text:")
print(normalized_text)
[nltk_data] Downloading package stopwords to c:\Users\sangouda\AppData
[nltk_data]     \Local\Programs\Python\Python312\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to c:\Users\sangouda\AppData\L
[nltk_data]     ocal\Programs\Python\Python312\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
Normalized Text:
love nap amaz appl

✅ Output Example#

Original Text: "I loove NLP! It's amazing 😊 and I have 2 apples."

Normalized Text: "nlp amaz appl"

After normalization, text is clean, uniform, and ready for NLP models.


Key Points

  • Normalization reduces noise in text.

  • Helps NLP models like text classification, sentiment analysis, and language modeling perform better.

  • Steps may vary depending on task. For example, numbers or emojis may be important in sentiment analysis.