Terminologies#
Corpus → A collection of text (paragraphs, sentences). Example: one big paragraph is a corpus.
Document → An individual sentence or text unit inside the corpus.
Words → Each individual element (tokens) within a sentence.
Vocabulary → The set of unique words in the corpus.
Tokenization:
Breaking text into smaller units (tokens).
Levels of tokenization:
Paragraph → Sentences (sentence tokenization).
Sentence → Words (word tokenization).
Example:
Corpus: “My name is Krish. I am also a YouTuber.”
Sentence Tokens:
“My name is Krish”
“I am also a YouTuber”
Word Tokens: [My, name, is, Krish, I, am, also, a, YouTuber].
Vocabulary Example:
Text: “I like to drink apple juice. My friend likes mango juice.”
Total words = 11, Unique words = 9 (if “like” and “likes” are treated separately, count increases).
Vocabulary = {I, like, to, drink, apple, juice, my, friend, likes, mango}.
Importance → Tokenization is a key step in text preprocessing for NLP tasks because models require numerical representations (vectors) of words.
Elaboration & Deeper Insights#
Why Corpus, Document, Vocabulary Matter?
Corpus = dataset you are working with (like raw text).
Document = training instance (like one review in sentiment analysis).
Vocabulary = the dictionary of your text world → forms the basis for encoding words into vectors.
Example: In text classification, your vocabulary determines the feature space.
Types of Tokenization
Sentence Tokenization → Splits by punctuation (., !, ?).
Useful in summarization, translation.
Word Tokenization → Splits by spaces, punctuation.
Useful in bag-of-words, embeddings.
Subword Tokenization (modern NLP) → Breaks words into smaller chunks (e.g., “playing” → “play” + “ing”).
Used in BERT, GPT, Hugging Face models to handle unknown words and reduce vocabulary size.
Practical Importance of Tokenization
Models like Naive Bayes, SVM, or Deep Learning require text → numbers.
Tokenization provides the units for encoding into:
Count Vectors (Bag of Words)
TF-IDF
Word Embeddings (Word2Vec, GloVe)
Transformers embeddings (BERT, GPT)
Challenges in Tokenization
Ambiguity: “New York” should be one token, not two.
Languages: In Chinese/Japanese, words are not separated by spaces.
Morphology: “like” vs “likes” vs “liked” → same base meaning but different tokens.
Modern approaches use lemmatization + subword tokenization to solve these.
Takeaway
Corpus → Documents → Sentences → Words → Vocabulary forms the basic hierarchy of NLP.
Tokenization is the gateway to all NLP tasks: without breaking text into structured units, no ML/DL model can process it.
Future steps after tokenization include: normalization, stopword removal, stemming, lemmatization, embeddings.
# Demonstration of NLP basics: Corpus, Documents, Vocabulary, Tokenization
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from collections import Counter
# Download necessary NLTK data
nltk.download("punkt")
# Example corpus
corpus = "My name is Krish. I am also a YouTuber. I like to drink apple juice. My friend likes mango juice."
# Sentence Tokenization
sentences = sent_tokenize(corpus)
# Word Tokenization
words = word_tokenize(corpus)
# Vocabulary (unique words)
vocabulary = set(words)
# Word Frequency (to show importance in corpus)
word_freq = Counter(words)
sentences, words, vocabulary, word_freq.most_common(10)
[nltk_data] Downloading package punkt to
[nltk_data] C:\Users\sangouda\AppData\Roaming\nltk_data...
[nltk_data] Package punkt is already up-to-date!
(['My name is Krish.',
'I am also a YouTuber.',
'I like to drink apple juice.',
'My friend likes mango juice.'],
['My',
'name',
'is',
'Krish',
'.',
'I',
'am',
'also',
'a',
'YouTuber',
'.',
'I',
'like',
'to',
'drink',
'apple',
'juice',
'.',
'My',
'friend',
'likes',
'mango',
'juice',
'.'],
{'.',
'I',
'Krish',
'My',
'YouTuber',
'a',
'also',
'am',
'apple',
'drink',
'friend',
'is',
'juice',
'like',
'likes',
'mango',
'name',
'to'},
[('.', 4),
('My', 2),
('I', 2),
('juice', 2),
('name', 1),
('is', 1),
('Krish', 1),
('am', 1),
('also', 1),
('a', 1)])