Tokenization#

1. Sentence Tokenization (Sentence Segmentation)#

  • Breaks a paragraph/document into sentences.

  • Useful for tasks like summarization, translation, and dialogue systems.

  • Example:

    Text: "I love NLP. It is amazing!"
    Sentence Tokens: ["I love NLP.", "It is amazing!"]
    

2. Word Tokenization#

  • Splits sentences into individual words.

  • Example:

    Text: "I love NLP."
    Word Tokens: ["I", "love", "NLP", "."]
    

3. Character Tokenization#

  • Splits text into individual characters.

  • Useful for handling misspellings, rare words, and languages like Chinese.

  • Example:

    Text: "NLP"
    Character Tokens: ["N", "L", "P"]
    

4. Subword Tokenization#

  • Breaks words into meaningful sub-units, instead of whole words or characters.

  • Handles out-of-vocabulary (OOV) words better.

  • Used in modern NLP models like BERT, GPT, T5.

  • Example:

    Word: "unhappiness"
    Subword Tokens: ["un", "happi", "ness"]
    

Common algorithms:

  • Byte Pair Encoding (BPE)

  • WordPiece (used in BERT)

  • SentencePiece (used in T5, XLNet, GPT-2)


5. Whitespace Tokenization#

  • Simply splits text by spaces.

  • Fast but naive: "NLP-based tokenization"["NLP-based", "tokenization"]

  • Problem: doesn’t handle punctuation well.


6. Regex Tokenization#

  • Uses regular expressions to define custom rules.

  • Example: Split by non-alphabetic characters, keep only words.

    Text: "Email me at abc123@gmail.com!"
    Regex Tokens: ["Email", "me", "at", "abc123", "gmail", "com"]
    

7. Morphological Tokenization#

  • Splits words into roots, prefixes, suffixes (morphological units).

  • Example (English): "playing"["play", "ing"]

  • Example (Turkish): "evlerinizden" (from your houses) → ["ev" (house), "ler" (plural), "iniz" (your), "den" (from)]


8. Byte-Level Tokenization#

  • Works at the raw byte level (instead of characters).

  • Used in GPT-2 and GPT-3 models.

  • Handles any language, emoji, or special character.


Summary Table

Tokenization Type

Example Input

Example Output

Used In

Sentence Tokenization

“I love NLP. It’s fun.”

[“I love NLP.”, “It’s fun.”]

Summarization, Translation

Word Tokenization

“I love NLP.”

[“I”, “love”, “NLP”, “.”]

Most NLP tasks

Character Tokenization

“NLP”

[“N”, “L”, “P”]

Chinese/Japanese, OCR

Subword Tokenization

“unhappiness”

[“un”, “happi”, “ness”]

BERT, GPT, T5

Whitespace Tokenization

“NLP-based model”

[“NLP-based”, “model”]

Simple tasks

Regex Tokenization

abc123@gmail.com

[“abc123”, “gmail”, “com”]

Custom NLP pipelines

Morphological

“playing”

[“play”, “ing”]

Morphology-heavy languages

Byte-Level Tokenization

“🔥 NLP”

[bytes representing emoji + “NLP”]

GPT-2, GPT-3


👉 In modern NLP, subword tokenization (BPE, WordPiece, SentencePiece, Byte-Level) is the most popular because it balances vocabulary size and handles rare words gracefully.

# Demonstration of different types of tokenization

import re
from nltk.tokenize import sent_tokenize, word_tokenize

text = "I love NLP. It's amazing! Unhappiness can't stop us 😊."

# 1. Sentence Tokenization
sent_tokens = sent_tokenize(text)

# 2. Word Tokenization
word_tokens = word_tokenize(text)

# 3. Character Tokenization
char_tokens = list(text)

# 4. Whitespace Tokenization
whitespace_tokens = text.split()

# 5. Regex Tokenization (keep only words, split on non-alphabetic)
regex_tokens = re.findall(r"[A-Za-z]+", text)

# 6. Subword Tokenization (simple example: split prefixes/suffixes manually)
example_word = "unhappiness"
subword_tokens = ["un", "happi", "ness"]

# 7. Byte-Level Tokenization (encode text into bytes)
byte_tokens = list(text.encode("utf-8"))

results = {
    "Sentence Tokenization": sent_tokens,
    "Word Tokenization": word_tokens,
    "Character Tokenization": char_tokens[:20],  # show first 20 chars
    "Whitespace Tokenization": whitespace_tokens,
    "Regex Tokenization": regex_tokens,
    "Subword Tokenization (example)": subword_tokens,
    "Byte-Level Tokenization (first 20 bytes)": byte_tokens[:20],
}

results
{'Sentence Tokenization': ['I love NLP.',
  "It's amazing!",
  "Unhappiness can't stop us 😊."],
 'Word Tokenization': ['I',
  'love',
  'NLP',
  '.',
  'It',
  "'s",
  'amazing',
  '!',
  'Unhappiness',
  'ca',
  "n't",
  'stop',
  'us',
  '😊',
  '.'],
 'Character Tokenization': ['I',
  ' ',
  'l',
  'o',
  'v',
  'e',
  ' ',
  'N',
  'L',
  'P',
  '.',
  ' ',
  'I',
  't',
  "'",
  's',
  ' ',
  'a',
  'm',
  'a'],
 'Whitespace Tokenization': ['I',
  'love',
  'NLP.',
  "It's",
  'amazing!',
  'Unhappiness',
  "can't",
  'stop',
  'us',
  '😊.'],
 'Regex Tokenization': ['I',
  'love',
  'NLP',
  'It',
  's',
  'amazing',
  'Unhappiness',
  'can',
  't',
  'stop',
  'us'],
 'Subword Tokenization (example)': ['un', 'happi', 'ness'],
 'Byte-Level Tokenization (first 20 bytes)': [73,
  32,
  108,
  111,
  118,
  101,
  32,
  78,
  76,
  80,
  46,
  32,
  73,
  116,
  39,
  115,
  32,
  97,
  109,
  97]}