Text Normalization

Text Normalization #

Text normalization is the process of transforming text into a standard form, so that variations in text (like case differences, punctuation, or spelling) do not affect analysis. Essentially, it reduces the complexity and variability of textual data.

Goal: Make text uniform so NLP models can focus on the semantics, not on irrelevant variations.

Key Steps in Text Normalization#

Lowercasing
- Convert all text to lowercase to avoid treating “Apple” and “apple” differently.
- Example: "NLP is Amazing!" → "nlp is amazing!"
Removing Punctuation
- Punctuation usually doesn’t carry meaning for many NLP tasks.
- Example: "Hello, world!" → "Hello world"
Removing Numbers (optional)
- Numbers may or may not be relevant depending on the task.
- Example: "I have 2 apples" → "I have apples"
Removing Stopwords
- Words like “is,” “the,” “a” are common but carry little meaning.
- Example: "This is a sample text" → "sample text"
Spelling Correction
- Correct typos to standardize words.
- Example: "loove" → "love"
Tokenization
- Split text into smaller units like sentences or words.
- Example: "I love NLP." → ["I", "love", "NLP"]
Stemming
- Reduce words to their root form.
- Example: "running" → "run"
Lemmatization
- Convert words to their dictionary form using context.
- Example: "better" → "good"
Handling Emojis and Emoticons
- Convert emojis to text or remove them.
- Example: "I am happy 😊" → "I am happy"
Handling Contractions
- Expand contractions to standard form.
- Example: "I’m happy" → "I am happy"
Removing Extra Whitespaces
- Clean up unnecessary spaces.
- Example: "Hello world" → "Hello world"

Python Demonstration: Text Normalization Pipeline#

!pip install textblob

Collecting textblob
  Obtaining dependency information for textblob from https://files.pythonhosted.org/packages/1e/d6/40aa5aead775582ea0cf35870e5a3f16fab4b967f1ad2debe675f673f923/textblob-0.19.0-py3-none-any.whl.metadata
  Downloading textblob-0.19.0-py3-none-any.whl.metadata (4.4 kB)
Requirement already satisfied: nltk>=3.9 in c:\users\sangouda\appdata\local\programs\python\python312\lib\site-packages (from textblob) (3.9.1)
Requirement already satisfied: click in c:\users\sangouda\appdata\local\programs\python\python312\lib\site-packages (from nltk>=3.9->textblob) (8.2.1)
Requirement already satisfied: joblib in c:\users\sangouda\appdata\local\programs\python\python312\lib\site-packages (from nltk>=3.9->textblob) (1.5.1)
Requirement already satisfied: regex>=2021.8.3 in c:\users\sangouda\appdata\local\programs\python\python312\lib\site-packages (from nltk>=3.9->textblob) (2024.11.6)
Requirement already satisfied: tqdm in c:\users\sangouda\appdata\local\programs\python\python312\lib\site-packages (from nltk>=3.9->textblob) (4.67.1)
Requirement already satisfied: colorama in c:\users\sangouda\appdata\local\programs\python\python312\lib\site-packages (from click->nltk>=3.9->textblob) (0.4.6)
Downloading textblob-0.19.0-py3-none-any.whl (624 kB)
   ---------------------------------------- 0.0/624.3 kB ? eta -:--:--
   - ------------------------------------- 30.7/624.3 kB 660.6 kB/s eta 0:00:01
   -- ------------------------------------ 41.0/624.3 kB 393.8 kB/s eta 0:00:02
   ---- ---------------------------------- 71.7/624.3 kB 653.6 kB/s eta 0:00:01
   --------- ------------------------------ 153.6/624.3 kB 1.0 MB/s eta 0:00:01
   ---------- --------------------------- 174.1/624.3 kB 876.1 kB/s eta 0:00:01
   --------------------- ------------------ 337.9/624.3 kB 1.4 MB/s eta 0:00:01
   --------------------------- ------------ 430.1/624.3 kB 1.4 MB/s eta 0:00:01
   -------------------------------- ------- 501.8/624.3 kB 1.5 MB/s eta 0:00:01
   ------------------------------------- -- 583.7/624.3 kB 1.5 MB/s eta 0:00:01
   ---------------------------------------  614.4/624.3 kB 1.4 MB/s eta 0:00:01
   ---------------------------------------- 624.3/624.3 kB 1.4 MB/s eta 0:00:00
Installing collected packages: textblob
Successfully installed textblob-0.19.0

[notice] A new release of pip is available: 23.2.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip

# Install required libraries
# !pip install nltk textblob

import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from textblob import TextBlob

nltk.download('stopwords')
nltk.download('wordnet')

text = "I loove NLP! It's amazing 😊 and I have 2 apples."

# 1️⃣ Lowercase
text = text.lower()

# 2️⃣ Remove punctuation
text = re.sub(r'[^\w\s]', '', text)

# 3️⃣ Remove numbers
text = re.sub(r'\d+', '', text)

# 4️⃣ Tokenization
words = text.split()

# 5️⃣ Remove stopwords
stop_words = set(stopwords.words('english'))
words = [w for w in words if w not in stop_words]

# 6️⃣ Spelling Correction
words = [str(TextBlob(w).correct()) for w in words]

# 7️⃣ Stemming
stemmer = PorterStemmer()
words = [stemmer.stem(w) for w in words]

# 8️⃣ Lemmatization
lemmatizer = WordNetLemmatizer()
words = [lemmatizer.lemmatize(w) for w in words]

normalized_text = " ".join(words)
print("Normalized Text:")
print(normalized_text)

[nltk_data] Downloading package stopwords to c:\Users\sangouda\AppData
[nltk_data]     \Local\Programs\Python\Python312\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to c:\Users\sangouda\AppData\L
[nltk_data]     ocal\Programs\Python\Python312\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!

Normalized Text:
love nap amaz appl

✅ Output Example#

Original Text: "I loove NLP! It's amazing 😊 and I have 2 apples."

Normalized Text: "nlp amaz appl"

After normalization, text is clean, uniform, and ready for NLP models.

Key Points

Normalization reduces noise in text.
Helps NLP models like text classification, sentiment analysis, and language modeling perform better.
Steps may vary depending on task. For example, numbers or emojis may be important in sentiment analysis.

Text Normalization

Contents

Text Normalization#

Key Steps in Text Normalization#

Python Demonstration: Text Normalization Pipeline#

✅ Output Example#

Text Normalization #