Terminologies

Terminologies #

Corpus → A collection of text (paragraphs, sentences). Example: one big paragraph is a corpus.
Document → An individual sentence or text unit inside the corpus.
Words → Each individual element (tokens) within a sentence.
Vocabulary → The set of unique words in the corpus.
Tokenization:
- Breaking text into smaller units (tokens).
- Levels of tokenization:
  1. Paragraph → Sentences (sentence tokenization).
  2. Sentence → Words (word tokenization).
- Example:
  - Corpus: “My name is Krish. I am also a YouTuber.”
  - Sentence Tokens:
    1. “My name is Krish”
    2. “I am also a YouTuber”
  - Word Tokens: [My, name, is, Krish, I, am, also, a, YouTuber].
Vocabulary Example:
- Text: “I like to drink apple juice. My friend likes mango juice.”
- Total words = 11, Unique words = 9 (if “like” and “likes” are treated separately, count increases).
- Vocabulary = {I, like, to, drink, apple, juice, my, friend, likes, mango}.
Importance → Tokenization is a key step in text preprocessing for NLP tasks because models require numerical representations (vectors) of words.

Elaboration & Deeper Insights #

Why Corpus, Document, Vocabulary Matter?
- Corpus = dataset you are working with (like raw text).
- Document = training instance (like one review in sentiment analysis).
- Vocabulary = the dictionary of your text world → forms the basis for encoding words into vectors.
- Example: In text classification, your vocabulary determines the feature space.

Types of Tokenization
- Sentence Tokenization → Splits by punctuation (., !, ?).
  - Useful in summarization, translation.
- Word Tokenization → Splits by spaces, punctuation.
  - Useful in bag-of-words, embeddings.
- Subword Tokenization (modern NLP) → Breaks words into smaller chunks (e.g., “playing” → “play” + “ing”).
  - Used in BERT, GPT, Hugging Face models to handle unknown words and reduce vocabulary size.

Practical Importance of Tokenization
- Models like Naive Bayes, SVM, or Deep Learning require text → numbers.
- Tokenization provides the units for encoding into:
  - Count Vectors (Bag of Words)
  - TF-IDF
  - Word Embeddings (Word2Vec, GloVe)
  - Transformers embeddings (BERT, GPT)

Challenges in Tokenization
- Ambiguity: “New York” should be one token, not two.
- Languages: In Chinese/Japanese, words are not separated by spaces.
- Morphology: “like” vs “likes” vs “liked” → same base meaning but different tokens.
- Modern approaches use lemmatization + subword tokenization to solve these.

Takeaway

Corpus → Documents → Sentences → Words → Vocabulary forms the basic hierarchy of NLP.
Tokenization is the gateway to all NLP tasks: without breaking text into structured units, no ML/DL model can process it.
Future steps after tokenization include: normalization, stopword removal, stemming, lemmatization, embeddings.

# Demonstration of NLP basics: Corpus, Documents, Vocabulary, Tokenization

import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from collections import Counter

# Download necessary NLTK data
nltk.download("punkt")

# Example corpus
corpus = "My name is Krish. I am also a YouTuber. I like to drink apple juice. My friend likes mango juice."

# Sentence Tokenization
sentences = sent_tokenize(corpus)

# Word Tokenization
words = word_tokenize(corpus)

# Vocabulary (unique words)
vocabulary = set(words)

# Word Frequency (to show importance in corpus)
word_freq = Counter(words)

sentences, words, vocabulary, word_freq.most_common(10)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\sangouda\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!

(['My name is Krish.',
  'I am also a YouTuber.',
  'I like to drink apple juice.',
  'My friend likes mango juice.'],
 ['My',
  'name',
  'is',
  'Krish',
  '.',
  'I',
  'am',
  'also',
  'a',
  'YouTuber',
  '.',
  'I',
  'like',
  'to',
  'drink',
  'apple',
  'juice',
  '.',
  'My',
  'friend',
  'likes',
  'mango',
  'juice',
  '.'],
 {'.',
  'I',
  'Krish',
  'My',
  'YouTuber',
  'a',
  'also',
  'am',
  'apple',
  'drink',
  'friend',
  'is',
  'juice',
  'like',
  'likes',
  'mango',
  'name',
  'to'},
 [('.', 4),
  ('My', 2),
  ('I', 2),
  ('juice', 2),
  ('name', 1),
  ('is', 1),
  ('Krish', 1),
  ('am', 1),
  ('also', 1),
  ('a', 1)])

Terminologies

Contents

Terminologies#

Elaboration & Deeper Insights#

Terminologies #

Elaboration & Deeper Insights #