Handle Numbers in NLP#
Raw numbers (e.g.,
12345,3.14159,2025) don’t have semantic meaning for models.Numbers may represent:
Quantities (e.g., “10 apples”)
Dates (e.g., “2025-09-18”)
Money (e.g., “$100”)
IDs/Phone numbers (irrelevant as features)
If left untreated, the vocabulary size grows unnecessarily, hurting model efficiency.
Common Approaches to Handle Numbers#
1. Remove Numbers#
When numbers are irrelevant (e.g., reviews: “Good movie 10/10” → numbers may not matter).
Example:
"The price is 100 dollars"→"The price is dollars"
2. Replace with a Special Token#
Replace any number with a placeholder like
<NUM>.Example:
"The price is 100 dollars"→"The price is <NUM> dollars"Useful in deep learning (so the model treats all numbers uniformly).
3. Normalize Numbers#
Convert numbers into a canonical form:
"100" → "one hundred""3.14" → "three point one four"
Useful in speech or translation systems.
4. Bucket / Categorize#
Group numbers into ranges:
Age:
23→20-30Salary:
85000→80k-90k
Example:
"He is 27 years old"→"He is [20-30] years old"
5. Keep as Features (when meaningful)#
In tasks like financial text mining, numbers are critical.
Instead of discarding, keep them as separate numeric features in ML models.
6. Advanced Contextual Handling#
Use NLP + Regex to detect patterns:
Dates:
2025-09-18→<DATE>Percentages:
85%→<PERCENT>Currency:
$100→<MONEY>
Example:
"I paid $500 on 2025-09-18"→"I paid <MONEY> on <DATE>"
✅ Example in Python#
import re
text = "I bought 3 apples for $10 on 18/09/2025."
# 1. Remove numbers
remove_numbers = re.sub(r'\d+', '', text)
# 2. Replace numbers with <NUM>
replace_numbers = re.sub(r'\d+', '<NUM>', text)
# 3. Replace dates with <DATE>, money with <MONEY>
custom_replace = re.sub(r'\d{2}/\d{2}/\d{4}', '<DATE>', text)
custom_replace = re.sub(r'\$\d+', '<MONEY>', custom_replace)
print("Original:", text)
print("Remove:", remove_numbers)
print("Replace <NUM>:", replace_numbers)
print("Custom Replace:", custom_replace)
✅ Output#
Original: I bought 3 apples for $10 on 18/09/2025.
Remove: I bought apples for $ on //.
Replace <NUM>: I bought <NUM> apples for $<NUM> on <NUM>/<NUM>/<NUM>.
Custom Replace: I bought 3 apples for <MONEY> on <DATE>.
Summary:
If numbers are irrelevant → remove/replace with
. If numbers carry meaning → normalize or categorize.
If domain-specific (finance, medicine, dates) → extract and treat separately.