Text Normalization#
Text normalization is the process of transforming text into a standard form, so that variations in text (like case differences, punctuation, or spelling) do not affect analysis. Essentially, it reduces the complexity and variability of textual data.
Goal: Make text uniform so NLP models can focus on the semantics, not on irrelevant variations.
Key Steps in Text Normalization#
Lowercasing
Convert all text to lowercase to avoid treating “Apple” and “apple” differently.
Example:
"NLP is Amazing!" → "nlp is amazing!"
Removing Punctuation
Punctuation usually doesn’t carry meaning for many NLP tasks.
Example:
"Hello, world!" → "Hello world"
Removing Numbers (optional)
Numbers may or may not be relevant depending on the task.
Example:
"I have 2 apples"→"I have apples"
Removing Stopwords
Words like “is,” “the,” “a” are common but carry little meaning.
Example:
"This is a sample text"→"sample text"
Spelling Correction
Correct typos to standardize words.
Example:
"loove"→"love"
Tokenization
Split text into smaller units like sentences or words.
Example:
"I love NLP." → ["I", "love", "NLP"]
Stemming
Reduce words to their root form.
Example:
"running" → "run"
Lemmatization
Convert words to their dictionary form using context.
Example:
"better" → "good"
Handling Emojis and Emoticons
Convert emojis to text or remove them.
Example:
"I am happy 😊"→"I am happy"
Handling Contractions
Expand contractions to standard form.
Example:
"I’m happy"→"I am happy"
Removing Extra Whitespaces
Clean up unnecessary spaces.
Example:
"Hello world"→"Hello world"
Python Demonstration: Text Normalization Pipeline#
!pip install textblob
Collecting textblob
Obtaining dependency information for textblob from https://files.pythonhosted.org/packages/1e/d6/40aa5aead775582ea0cf35870e5a3f16fab4b967f1ad2debe675f673f923/textblob-0.19.0-py3-none-any.whl.metadata
Downloading textblob-0.19.0-py3-none-any.whl.metadata (4.4 kB)
Requirement already satisfied: nltk>=3.9 in c:\users\sangouda\appdata\local\programs\python\python312\lib\site-packages (from textblob) (3.9.1)
Requirement already satisfied: click in c:\users\sangouda\appdata\local\programs\python\python312\lib\site-packages (from nltk>=3.9->textblob) (8.2.1)
Requirement already satisfied: joblib in c:\users\sangouda\appdata\local\programs\python\python312\lib\site-packages (from nltk>=3.9->textblob) (1.5.1)
Requirement already satisfied: regex>=2021.8.3 in c:\users\sangouda\appdata\local\programs\python\python312\lib\site-packages (from nltk>=3.9->textblob) (2024.11.6)
Requirement already satisfied: tqdm in c:\users\sangouda\appdata\local\programs\python\python312\lib\site-packages (from nltk>=3.9->textblob) (4.67.1)
Requirement already satisfied: colorama in c:\users\sangouda\appdata\local\programs\python\python312\lib\site-packages (from click->nltk>=3.9->textblob) (0.4.6)
Downloading textblob-0.19.0-py3-none-any.whl (624 kB)
---------------------------------------- 0.0/624.3 kB ? eta -:--:--
- ------------------------------------- 30.7/624.3 kB 660.6 kB/s eta 0:00:01
-- ------------------------------------ 41.0/624.3 kB 393.8 kB/s eta 0:00:02
---- ---------------------------------- 71.7/624.3 kB 653.6 kB/s eta 0:00:01
--------- ------------------------------ 153.6/624.3 kB 1.0 MB/s eta 0:00:01
---------- --------------------------- 174.1/624.3 kB 876.1 kB/s eta 0:00:01
--------------------- ------------------ 337.9/624.3 kB 1.4 MB/s eta 0:00:01
--------------------------- ------------ 430.1/624.3 kB 1.4 MB/s eta 0:00:01
-------------------------------- ------- 501.8/624.3 kB 1.5 MB/s eta 0:00:01
------------------------------------- -- 583.7/624.3 kB 1.5 MB/s eta 0:00:01
--------------------------------------- 614.4/624.3 kB 1.4 MB/s eta 0:00:01
---------------------------------------- 624.3/624.3 kB 1.4 MB/s eta 0:00:00
Installing collected packages: textblob
Successfully installed textblob-0.19.0
[notice] A new release of pip is available: 23.2.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip
# Install required libraries
# !pip install nltk textblob
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from textblob import TextBlob
nltk.download('stopwords')
nltk.download('wordnet')
text = "I loove NLP! It's amazing 😊 and I have 2 apples."
# 1️⃣ Lowercase
text = text.lower()
# 2️⃣ Remove punctuation
text = re.sub(r'[^\w\s]', '', text)
# 3️⃣ Remove numbers
text = re.sub(r'\d+', '', text)
# 4️⃣ Tokenization
words = text.split()
# 5️⃣ Remove stopwords
stop_words = set(stopwords.words('english'))
words = [w for w in words if w not in stop_words]
# 6️⃣ Spelling Correction
words = [str(TextBlob(w).correct()) for w in words]
# 7️⃣ Stemming
stemmer = PorterStemmer()
words = [stemmer.stem(w) for w in words]
# 8️⃣ Lemmatization
lemmatizer = WordNetLemmatizer()
words = [lemmatizer.lemmatize(w) for w in words]
normalized_text = " ".join(words)
print("Normalized Text:")
print(normalized_text)
[nltk_data] Downloading package stopwords to c:\Users\sangouda\AppData
[nltk_data] \Local\Programs\Python\Python312\nltk_data...
[nltk_data] Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to c:\Users\sangouda\AppData\L
[nltk_data] ocal\Programs\Python\Python312\nltk_data...
[nltk_data] Package wordnet is already up-to-date!
Normalized Text:
love nap amaz appl
✅ Output Example#
Original Text:
"I loove NLP! It's amazing 😊 and I have 2 apples."
Normalized Text:
"nlp amaz appl"
After normalization, text is clean, uniform, and ready for NLP models.
Key Points
Normalization reduces noise in text.
Helps NLP models like text classification, sentiment analysis, and language modeling perform better.
Steps may vary depending on task. For example, numbers or emojis may be important in sentiment analysis.