Terminologies

Terminologies#

  • Corpus → A collection of text (paragraphs, sentences). Example: one big paragraph is a corpus.

  • Document → An individual sentence or text unit inside the corpus.

  • Words → Each individual element (tokens) within a sentence.

  • Vocabulary → The set of unique words in the corpus.

  • Tokenization:

    • Breaking text into smaller units (tokens).

    • Levels of tokenization:

      1. Paragraph → Sentences (sentence tokenization).

      2. Sentence → Words (word tokenization).

    • Example:

      • Corpus: “My name is Krish. I am also a YouTuber.”

      • Sentence Tokens:

        1. “My name is Krish”

        2. “I am also a YouTuber”

      • Word Tokens: [My, name, is, Krish, I, am, also, a, YouTuber].

  • Vocabulary Example:

    • Text: “I like to drink apple juice. My friend likes mango juice.”

    • Total words = 11, Unique words = 9 (if “like” and “likes” are treated separately, count increases).

    • Vocabulary = {I, like, to, drink, apple, juice, my, friend, likes, mango}.

  • Importance → Tokenization is a key step in text preprocessing for NLP tasks because models require numerical representations (vectors) of words.


Elaboration & Deeper Insights#

  1. Why Corpus, Document, Vocabulary Matter?

    • Corpus = dataset you are working with (like raw text).

    • Document = training instance (like one review in sentiment analysis).

    • Vocabulary = the dictionary of your text world → forms the basis for encoding words into vectors.

    • Example: In text classification, your vocabulary determines the feature space.


  1. Types of Tokenization

    • Sentence Tokenization → Splits by punctuation (., !, ?).

      • Useful in summarization, translation.

    • Word Tokenization → Splits by spaces, punctuation.

      • Useful in bag-of-words, embeddings.

    • Subword Tokenization (modern NLP) → Breaks words into smaller chunks (e.g., “playing” → “play” + “ing”).

      • Used in BERT, GPT, Hugging Face models to handle unknown words and reduce vocabulary size.


  1. Practical Importance of Tokenization

    • Models like Naive Bayes, SVM, or Deep Learning require text → numbers.

    • Tokenization provides the units for encoding into:

      • Count Vectors (Bag of Words)

      • TF-IDF

      • Word Embeddings (Word2Vec, GloVe)

      • Transformers embeddings (BERT, GPT)


  1. Challenges in Tokenization

    • Ambiguity: “New York” should be one token, not two.

    • Languages: In Chinese/Japanese, words are not separated by spaces.

    • Morphology: “like” vs “likes” vs “liked” → same base meaning but different tokens.

    • Modern approaches use lemmatization + subword tokenization to solve these.


Takeaway

  • Corpus → Documents → Sentences → Words → Vocabulary forms the basic hierarchy of NLP.

  • Tokenization is the gateway to all NLP tasks: without breaking text into structured units, no ML/DL model can process it.

  • Future steps after tokenization include: normalization, stopword removal, stemming, lemmatization, embeddings.

# Demonstration of NLP basics: Corpus, Documents, Vocabulary, Tokenization

import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from collections import Counter

# Download necessary NLTK data
nltk.download("punkt")

# Example corpus
corpus = "My name is Krish. I am also a YouTuber. I like to drink apple juice. My friend likes mango juice."

# Sentence Tokenization
sentences = sent_tokenize(corpus)

# Word Tokenization
words = word_tokenize(corpus)

# Vocabulary (unique words)
vocabulary = set(words)

# Word Frequency (to show importance in corpus)
word_freq = Counter(words)

sentences, words, vocabulary, word_freq.most_common(10)
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\sangouda\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
(['My name is Krish.',
  'I am also a YouTuber.',
  'I like to drink apple juice.',
  'My friend likes mango juice.'],
 ['My',
  'name',
  'is',
  'Krish',
  '.',
  'I',
  'am',
  'also',
  'a',
  'YouTuber',
  '.',
  'I',
  'like',
  'to',
  'drink',
  'apple',
  'juice',
  '.',
  'My',
  'friend',
  'likes',
  'mango',
  'juice',
  '.'],
 {'.',
  'I',
  'Krish',
  'My',
  'YouTuber',
  'a',
  'also',
  'am',
  'apple',
  'drink',
  'friend',
  'is',
  'juice',
  'like',
  'likes',
  'mango',
  'name',
  'to'},
 [('.', 4),
  ('My', 2),
  ('I', 2),
  ('juice', 2),
  ('name', 1),
  ('is', 1),
  ('Krish', 1),
  ('am', 1),
  ('also', 1),
  ('a', 1)])