Vectorization

Vectorization #

Concept: Represents a text as a vector of word counts. Ignores grammar and word order.
Steps:
1. Build a vocabulary of all unique words across the corpus.
2. Count the frequency of each word in every document.
Pros: Simple, easy to implement.
Cons: Ignores context and semantics, sparse vectors.

Example: Corpus: [“I love NLP”, “NLP is amazing”] BoW vectors:

Concept: Represents words by their frequency in a document.
Formula:

\[ TF(word) = \frac{\text{Number of times word appears in document}}{\text{Total number of words in document}} \]
Pros: Captures importance of words relative to the document.

Concept: Adjusts term frequency by how rare a word is across all documents. Rare words get more weight.
Formula:

\[ TFIDF(word) = TF(word) \times \log\frac{N}{DF(word)} \]
- \(N\) = Total number of documents
- \(DF(word)\) = Number of documents containing the word
Pros: Reduces importance of common words like “the”, “is”.
Use: Widely used in text classification and information retrieval.

Concept: Dense vector representations capturing semantic meaning of words.
Techniques:
- Word2Vec: Predicts a word from its context (skip-gram or CBOW).
- GloVe: Uses global word co-occurrence matrix.
- FastText: Captures subword information for rare words.
Pros: Captures meaning, similar words have close vectors.
Cons: Pretrained embeddings may not always fit your corpus.

Concept: Word representation depends on context in the sentence.
Models: BERT, GPT, RoBERTa, XLNet, etc.
Pros: Handles polysemy (words with multiple meanings), state-of-the-art performance.
Cons: Computationally expensive.

Concept: Represent each word as a binary vector with a 1 at the word’s index in the vocabulary.
Pros: Simple, easy to implement.
Cons: Very sparse, does not capture meaning or similarity.

Comparison Summary

Technique	Sparse/Dense	Captures Semantics	Context Awareness
Bag of Words	Sparse	❌	❌
TF	Sparse	❌	❌
TF-IDF	Sparse	❌	❌
Word Embeddings	Dense	✅	❌
Contextual Embeddings	Dense	✅	✅
One-Hot Encoding	Sparse	❌	❌