Vectorization#
1. Bag of Words (BoW)#
Concept: Represents a text as a vector of word counts. Ignores grammar and word order.
Steps:
Build a vocabulary of all unique words across the corpus.
Count the frequency of each word in every document.
Pros: Simple, easy to implement.
Cons: Ignores context and semantics, sparse vectors.
Example: Corpus: [“I love NLP”, “NLP is amazing”] BoW vectors:
Doc1:
[1, 0, 1, 0, 1]Doc2:
[0, 1, 1, 1, 0]
2. TF (Term Frequency)#
Concept: Represents words by their frequency in a document.
Formula:
\[ TF(word) = \frac{\text{Number of times word appears in document}}{\text{Total number of words in document}} \]Pros: Captures importance of words relative to the document.
3. TF-IDF (Term Frequency-Inverse Document Frequency)#
Concept: Adjusts term frequency by how rare a word is across all documents. Rare words get more weight.
Formula:
\[ TFIDF(word) = TF(word) \times \log\frac{N}{DF(word)} \]\(N\) = Total number of documents
\(DF(word)\) = Number of documents containing the word
Pros: Reduces importance of common words like “the”, “is”.
Use: Widely used in text classification and information retrieval.
4. Word Embeddings#
Concept: Dense vector representations capturing semantic meaning of words.
Techniques:
Word2Vec: Predicts a word from its context (skip-gram or CBOW).
GloVe: Uses global word co-occurrence matrix.
FastText: Captures subword information for rare words.
Pros: Captures meaning, similar words have close vectors.
Cons: Pretrained embeddings may not always fit your corpus.
5. Contextualized Embeddings (Transformers)#
Concept: Word representation depends on context in the sentence.
Models: BERT, GPT, RoBERTa, XLNet, etc.
Pros: Handles polysemy (words with multiple meanings), state-of-the-art performance.
Cons: Computationally expensive.
6. One-Hot Encoding#
Concept: Represent each word as a binary vector with a 1 at the word’s index in the vocabulary.
Pros: Simple, easy to implement.
Cons: Very sparse, does not capture meaning or similarity.
Comparison Summary
Technique |
Sparse/Dense |
Captures Semantics |
Context Awareness |
|---|---|---|---|
Bag of Words |
Sparse |
❌ |
❌ |
TF |
Sparse |
❌ |
❌ |
TF-IDF |
Sparse |
❌ |
❌ |
Word Embeddings |
Dense |
✅ |
❌ |
Contextual Embeddings |
Dense |
✅ |
✅ |
One-Hot Encoding |
Sparse |
❌ |
❌ |