Tokenization

Tokenization #

1. Sentence Tokenization (Sentence Segmentation)#

Breaks a paragraph/document into sentences.
Useful for tasks like summarization, translation, and dialogue systems.

Example:

Text: "I love NLP. It is amazing!"
Sentence Tokens: ["I love NLP.", "It is amazing!"]

2. Word Tokenization#

Splits sentences into individual words.

Example:

Text: "I love NLP."
Word Tokens: ["I", "love", "NLP", "."]

3. Character Tokenization#

Splits text into individual characters.
Useful for handling misspellings, rare words, and languages like Chinese.

Example:

Text: "NLP"
Character Tokens: ["N", "L", "P"]

4. Subword Tokenization#

Breaks words into meaningful sub-units, instead of whole words or characters.
Handles out-of-vocabulary (OOV) words better.
Used in modern NLP models like BERT, GPT, T5.

Example:

Word: "unhappiness"
Subword Tokens: ["un", "happi", "ness"]

Common algorithms:

Byte Pair Encoding (BPE)
WordPiece (used in BERT)
SentencePiece (used in T5, XLNet, GPT-2)

5. Whitespace Tokenization#

Simply splits text by spaces.
Fast but naive: "NLP-based tokenization" → ["NLP-based", "tokenization"]
Problem: doesn’t handle punctuation well.

6. Regex Tokenization#

Uses regular expressions to define custom rules.

Example: Split by non-alphabetic characters, keep only words.

Text: "Email me at abc123@gmail.com!"
Regex Tokens: ["Email", "me", "at", "abc123", "gmail", "com"]

7. Morphological Tokenization#

Splits words into roots, prefixes, suffixes (morphological units).
Example (English): "playing" → ["play", "ing"]
Example (Turkish): "evlerinizden" (from your houses) → ["ev" (house), "ler" (plural), "iniz" (your), "den" (from)]

8. Byte-Level Tokenization#

Works at the raw byte level (instead of characters).
Used in GPT-2 and GPT-3 models.
Handles any language, emoji, or special character.

Summary Table

Tokenization Type	Example Input	Example Output	Used In
Sentence Tokenization	“I love NLP. It’s fun.”	[“I love NLP.”, “It’s fun.”]	Summarization, Translation
Word Tokenization	“I love NLP.”	[“I”, “love”, “NLP”, “.”]	Most NLP tasks
Character Tokenization	“NLP”	[“N”, “L”, “P”]	Chinese/Japanese, OCR
Subword Tokenization	“unhappiness”	[“un”, “happi”, “ness”]	BERT, GPT, T5
Whitespace Tokenization	“NLP-based model”	[“NLP-based”, “model”]	Simple tasks
Regex Tokenization	“abc123@gmail.com”	[“abc123”, “gmail”, “com”]	Custom NLP pipelines
Morphological	“playing”	[“play”, “ing”]	Morphology-heavy languages
Byte-Level Tokenization	“🔥 NLP”	[bytes representing emoji + “NLP”]	GPT-2, GPT-3

👉 In modern NLP, subword tokenization (BPE, WordPiece, SentencePiece, Byte-Level) is the most popular because it balances vocabulary size and handles rare words gracefully.

# Demonstration of different types of tokenization

import re
from nltk.tokenize import sent_tokenize, word_tokenize

text = "I love NLP. It's amazing! Unhappiness can't stop us 😊."

# 1. Sentence Tokenization
sent_tokens = sent_tokenize(text)

# 2. Word Tokenization
word_tokens = word_tokenize(text)

# 3. Character Tokenization
char_tokens = list(text)

# 4. Whitespace Tokenization
whitespace_tokens = text.split()

# 5. Regex Tokenization (keep only words, split on non-alphabetic)
regex_tokens = re.findall(r"[A-Za-z]+", text)

# 6. Subword Tokenization (simple example: split prefixes/suffixes manually)
example_word = "unhappiness"
subword_tokens = ["un", "happi", "ness"]

# 7. Byte-Level Tokenization (encode text into bytes)
byte_tokens = list(text.encode("utf-8"))

results = {
    "Sentence Tokenization": sent_tokens,
    "Word Tokenization": word_tokens,
    "Character Tokenization": char_tokens[:20],  # show first 20 chars
    "Whitespace Tokenization": whitespace_tokens,
    "Regex Tokenization": regex_tokens,
    "Subword Tokenization (example)": subword_tokens,
    "Byte-Level Tokenization (first 20 bytes)": byte_tokens[:20],
}

results

{'Sentence Tokenization': ['I love NLP.',
  "It's amazing!",
  "Unhappiness can't stop us 😊."],
 'Word Tokenization': ['I',
  'love',
  'NLP',
  '.',
  'It',
  "'s",
  'amazing',
  '!',
  'Unhappiness',
  'ca',
  "n't",
  'stop',
  'us',
  '😊',
  '.'],
 'Character Tokenization': ['I',
  ' ',
  'l',
  'o',
  'v',
  'e',
  ' ',
  'N',
  'L',
  'P',
  '.',
  ' ',
  'I',
  't',
  "'",
  's',
  ' ',
  'a',
  'm',
  'a'],
 'Whitespace Tokenization': ['I',
  'love',
  'NLP.',
  "It's",
  'amazing!',
  'Unhappiness',
  "can't",
  'stop',
  'us',
  '😊.'],
 'Regex Tokenization': ['I',
  'love',
  'NLP',
  'It',
  's',
  'amazing',
  'Unhappiness',
  'can',
  't',
  'stop',
  'us'],
 'Subword Tokenization (example)': ['un', 'happi', 'ness'],
 'Byte-Level Tokenization (first 20 bytes)': [73,
  32,
  108,
  111,
  118,
  101,
  32,
  78,
  76,
  80,
  46,
  32,
  73,
  116,
  39,
  115,
  32,
  97,
  109,
  97]}

Tokenization

Contents