Vectorization#

1. Bag of Words (BoW)#

  • Concept: Represents a text as a vector of word counts. Ignores grammar and word order.

  • Steps:

    1. Build a vocabulary of all unique words across the corpus.

    2. Count the frequency of each word in every document.

  • Pros: Simple, easy to implement.

  • Cons: Ignores context and semantics, sparse vectors.

Example: Corpus: [“I love NLP”, “NLP is amazing”] BoW vectors:

  • Doc1: [1, 0, 1, 0, 1]

  • Doc2: [0, 1, 1, 1, 0]


2. TF (Term Frequency)#

  • Concept: Represents words by their frequency in a document.

  • Formula:

    \[ TF(word) = \frac{\text{Number of times word appears in document}}{\text{Total number of words in document}} \]
  • Pros: Captures importance of words relative to the document.


3. TF-IDF (Term Frequency-Inverse Document Frequency)#

  • Concept: Adjusts term frequency by how rare a word is across all documents. Rare words get more weight.

  • Formula:

    \[ TFIDF(word) = TF(word) \times \log\frac{N}{DF(word)} \]
    • \(N\) = Total number of documents

    • \(DF(word)\) = Number of documents containing the word

  • Pros: Reduces importance of common words like “the”, “is”.

  • Use: Widely used in text classification and information retrieval.


4. Word Embeddings#

  • Concept: Dense vector representations capturing semantic meaning of words.

  • Techniques:

    • Word2Vec: Predicts a word from its context (skip-gram or CBOW).

    • GloVe: Uses global word co-occurrence matrix.

    • FastText: Captures subword information for rare words.

  • Pros: Captures meaning, similar words have close vectors.

  • Cons: Pretrained embeddings may not always fit your corpus.


5. Contextualized Embeddings (Transformers)#

  • Concept: Word representation depends on context in the sentence.

  • Models: BERT, GPT, RoBERTa, XLNet, etc.

  • Pros: Handles polysemy (words with multiple meanings), state-of-the-art performance.

  • Cons: Computationally expensive.


6. One-Hot Encoding#

  • Concept: Represent each word as a binary vector with a 1 at the word’s index in the vocabulary.

  • Pros: Simple, easy to implement.

  • Cons: Very sparse, does not capture meaning or similarity.


Comparison Summary

Technique

Sparse/Dense

Captures Semantics

Context Awareness

Bag of Words

Sparse

TF

Sparse

TF-IDF

Sparse

Word Embeddings

Dense

Contextual Embeddings

Dense

One-Hot Encoding

Sparse

Click here for Sections