Handle Numbers in NLP#

  • Raw numbers (e.g., 12345, 3.14159, 2025) don’t have semantic meaning for models.

  • Numbers may represent:

    • Quantities (e.g., “10 apples”)

    • Dates (e.g., “2025-09-18”)

    • Money (e.g., “$100”)

    • IDs/Phone numbers (irrelevant as features)

  • If left untreated, the vocabulary size grows unnecessarily, hurting model efficiency.


Common Approaches to Handle Numbers#

1. Remove Numbers#

  • When numbers are irrelevant (e.g., reviews: “Good movie 10/10” → numbers may not matter).

  • Example: "The price is 100 dollars""The price is dollars"


2. Replace with a Special Token#

  • Replace any number with a placeholder like <NUM>.

  • Example: "The price is 100 dollars""The price is <NUM> dollars"

  • Useful in deep learning (so the model treats all numbers uniformly).


3. Normalize Numbers#

  • Convert numbers into a canonical form:

    • "100" "one hundred"

    • "3.14" "three point one four"

  • Useful in speech or translation systems.


4. Bucket / Categorize#

  • Group numbers into ranges:

    • Age: 2320-30

    • Salary: 8500080k-90k

  • Example: "He is 27 years old""He is [20-30] years old"


5. Keep as Features (when meaningful)#

  • In tasks like financial text mining, numbers are critical.

  • Instead of discarding, keep them as separate numeric features in ML models.


6. Advanced Contextual Handling#

  • Use NLP + Regex to detect patterns:

    • Dates: 2025-09-18<DATE>

    • Percentages: 85%<PERCENT>

    • Currency: $100<MONEY>

  • Example: "I paid $500 on 2025-09-18""I paid <MONEY> on <DATE>"


✅ Example in Python#

import re

text = "I bought 3 apples for $10 on 18/09/2025."

# 1. Remove numbers
remove_numbers = re.sub(r'\d+', '', text)

# 2. Replace numbers with <NUM>
replace_numbers = re.sub(r'\d+', '<NUM>', text)

# 3. Replace dates with <DATE>, money with <MONEY>
custom_replace = re.sub(r'\d{2}/\d{2}/\d{4}', '<DATE>', text)
custom_replace = re.sub(r'\$\d+', '<MONEY>', custom_replace)

print("Original:", text)
print("Remove:", remove_numbers)
print("Replace <NUM>:", replace_numbers)
print("Custom Replace:", custom_replace)

✅ Output#

Original: I bought 3 apples for $10 on 18/09/2025.
Remove: I bought  apples for $ on //.
Replace <NUM>: I bought <NUM> apples for $<NUM> on <NUM>/<NUM>/<NUM>.
Custom Replace: I bought 3 apples for <MONEY> on <DATE>.

Summary:

  • If numbers are irrelevant → remove/replace with .

  • If numbers carry meaning → normalize or categorize.

  • If domain-specific (finance, medicine, dates) → extract and treat separately.