Tokenization#
1. Sentence Tokenization (Sentence Segmentation)#
Breaks a paragraph/document into sentences.
Useful for tasks like summarization, translation, and dialogue systems.
Example:
Text: "I love NLP. It is amazing!" Sentence Tokens: ["I love NLP.", "It is amazing!"]
2. Word Tokenization#
Splits sentences into individual words.
Example:
Text: "I love NLP." Word Tokens: ["I", "love", "NLP", "."]
3. Character Tokenization#
Splits text into individual characters.
Useful for handling misspellings, rare words, and languages like Chinese.
Example:
Text: "NLP" Character Tokens: ["N", "L", "P"]
4. Subword Tokenization#
Breaks words into meaningful sub-units, instead of whole words or characters.
Handles out-of-vocabulary (OOV) words better.
Used in modern NLP models like BERT, GPT, T5.
Example:
Word: "unhappiness" Subword Tokens: ["un", "happi", "ness"]
Common algorithms:
Byte Pair Encoding (BPE)
WordPiece (used in BERT)
SentencePiece (used in T5, XLNet, GPT-2)
5. Whitespace Tokenization#
Simply splits text by spaces.
Fast but naive:
"NLP-based tokenization"→["NLP-based", "tokenization"]Problem: doesn’t handle punctuation well.
6. Regex Tokenization#
Uses regular expressions to define custom rules.
Example: Split by non-alphabetic characters, keep only words.
Text: "Email me at abc123@gmail.com!" Regex Tokens: ["Email", "me", "at", "abc123", "gmail", "com"]
7. Morphological Tokenization#
Splits words into roots, prefixes, suffixes (morphological units).
Example (English):
"playing"→["play", "ing"]Example (Turkish):
"evlerinizden"(from your houses) →["ev" (house), "ler" (plural), "iniz" (your), "den" (from)]
8. Byte-Level Tokenization#
Works at the raw byte level (instead of characters).
Used in GPT-2 and GPT-3 models.
Handles any language, emoji, or special character.
Summary Table
Tokenization Type |
Example Input |
Example Output |
Used In |
|---|---|---|---|
Sentence Tokenization |
“I love NLP. It’s fun.” |
[“I love NLP.”, “It’s fun.”] |
Summarization, Translation |
Word Tokenization |
“I love NLP.” |
[“I”, “love”, “NLP”, “.”] |
Most NLP tasks |
Character Tokenization |
“NLP” |
[“N”, “L”, “P”] |
Chinese/Japanese, OCR |
Subword Tokenization |
“unhappiness” |
[“un”, “happi”, “ness”] |
BERT, GPT, T5 |
Whitespace Tokenization |
“NLP-based model” |
[“NLP-based”, “model”] |
Simple tasks |
Regex Tokenization |
[“abc123”, “gmail”, “com”] |
Custom NLP pipelines |
|
Morphological |
“playing” |
[“play”, “ing”] |
Morphology-heavy languages |
Byte-Level Tokenization |
“🔥 NLP” |
[bytes representing emoji + “NLP”] |
GPT-2, GPT-3 |
👉 In modern NLP, subword tokenization (BPE, WordPiece, SentencePiece, Byte-Level) is the most popular because it balances vocabulary size and handles rare words gracefully.
# Demonstration of different types of tokenization
import re
from nltk.tokenize import sent_tokenize, word_tokenize
text = "I love NLP. It's amazing! Unhappiness can't stop us 😊."
# 1. Sentence Tokenization
sent_tokens = sent_tokenize(text)
# 2. Word Tokenization
word_tokens = word_tokenize(text)
# 3. Character Tokenization
char_tokens = list(text)
# 4. Whitespace Tokenization
whitespace_tokens = text.split()
# 5. Regex Tokenization (keep only words, split on non-alphabetic)
regex_tokens = re.findall(r"[A-Za-z]+", text)
# 6. Subword Tokenization (simple example: split prefixes/suffixes manually)
example_word = "unhappiness"
subword_tokens = ["un", "happi", "ness"]
# 7. Byte-Level Tokenization (encode text into bytes)
byte_tokens = list(text.encode("utf-8"))
results = {
"Sentence Tokenization": sent_tokens,
"Word Tokenization": word_tokens,
"Character Tokenization": char_tokens[:20], # show first 20 chars
"Whitespace Tokenization": whitespace_tokens,
"Regex Tokenization": regex_tokens,
"Subword Tokenization (example)": subword_tokens,
"Byte-Level Tokenization (first 20 bytes)": byte_tokens[:20],
}
results
{'Sentence Tokenization': ['I love NLP.',
"It's amazing!",
"Unhappiness can't stop us 😊."],
'Word Tokenization': ['I',
'love',
'NLP',
'.',
'It',
"'s",
'amazing',
'!',
'Unhappiness',
'ca',
"n't",
'stop',
'us',
'😊',
'.'],
'Character Tokenization': ['I',
' ',
'l',
'o',
'v',
'e',
' ',
'N',
'L',
'P',
'.',
' ',
'I',
't',
"'",
's',
' ',
'a',
'm',
'a'],
'Whitespace Tokenization': ['I',
'love',
'NLP.',
"It's",
'amazing!',
'Unhappiness',
"can't",
'stop',
'us',
'😊.'],
'Regex Tokenization': ['I',
'love',
'NLP',
'It',
's',
'amazing',
'Unhappiness',
'can',
't',
'stop',
'us'],
'Subword Tokenization (example)': ['un', 'happi', 'ness'],
'Byte-Level Tokenization (first 20 bytes)': [73,
32,
108,
111,
118,
101,
32,
78,
76,
80,
46,
32,
73,
116,
39,
115,
32,
97,
109,
97]}