TF-IDF#

TF-IDF is a numerical statistic that reflects the importance of a word in a document relative to a collection of documents (corpus).

It combines two measures:

  1. TF (Term Frequency): How often a word appears in a document.

  2. IDF (Inverse Document Frequency): How unique or rare a word is across all documents in the corpus.

Mathematically:

\[ \text{TF-IDF}(t,d) = \text{TF}(t,d) \times \text{IDF}(t) \]

Where:

\[ \text{IDF}(t) = \log \frac{N}{1 + \text{DF}(t)} \]
  • \(t\) → term (word)

  • \(d\) → document

  • \(N\) → total number of documents

  • \(\text{DF}(t)\) → number of documents containing the term \(t\)

  • Adding 1 in the denominator avoids division by zero


2. Intuition#

  • High TF: Word appears frequently in a document → important for that document.

  • High IDF: Word appears in fewer documents → more unique → carries more information.

  • High TF-IDF: Word is frequent in a document and rare across other documents → highly significant.

Example:

  • Word “the” → high TF but appears in almost all documents → low IDF → low TF-IDF.

  • Word “NLP” → appears multiple times in a specific document but rarely elsewhere → high TF-IDF.


3. Example#

Suppose we have 3 documents:

Doc1: "I love NLP"
Doc2: "NLP is amazing"
Doc3: "I love Python"

Step 1: Calculate TF

  • Doc1: I(1), love(1), NLP(1) → total 3 words

  • TF(“NLP”, Doc1) = 1 / 3 ≈ 0.333

Step 2: Calculate IDF

  • NLP appears in 2 documents → DF(“NLP”) = 2

  • Total documents N = 3

  • IDF(“NLP”) = log(3 / (1+2)) = log(3/3) = log(1) = 0

Step 3: TF-IDF

  • TF-IDF(“NLP”, Doc1) = TF × IDF = 0.333 × 0 = 0

Note: Words appearing in almost all documents get TF-IDF close to 0.

  • Word “Python” appears only in Doc3 → TF-IDF will be high for Doc3.


from sklearn.feature_extraction.text import TfidfVectorizer

# Sample documents
docs = [
    "I love NLP",
    "NLP is amazing",
    "I love Python"
]

# Initialize TF-IDF Vectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(docs)

# Feature names (words)
print("Vocabulary:", vectorizer.get_feature_names_out())

# TF-IDF matrix
print("TF-IDF matrix:\n", X.toarray())
Vocabulary: ['amazing' 'is' 'love' 'nlp' 'python']
TF-IDF matrix:
 [[0.         0.         0.70710678 0.70710678 0.        ]
 [0.62276601 0.62276601 0.         0.4736296  0.        ]
 [0.         0.         0.60534851 0.         0.79596054]]
The Kernel crashed while executing code in the current cell or a previous cell. 

Please review the code in the cell(s) to identify a possible cause of the failure. 

Click <a href='https://aka.ms/vscodeJupyterKernelCrash'>here</a> for more info. 

View Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details.

5. Key Points#

  • TF-IDF reduces the weight of common words like “the”, “is”, “and”.

  • Highlights words unique to a document.

  • Widely used in text classification, search engines, and recommendation systems.

Absolutely! Let’s break down the intuition behind TF-IDF in a simple, clear way.


TF-IDF Intuition#

TF-IDF stands for Term Frequency – Inverse Document Frequency. It’s a way to weight words based on importance in a corpus. The key idea is:

  1. Words that appear often in a document are important → TF (Term Frequency).

  2. Words that appear in many documents are less informative → IDF (Inverse Document Frequency).


Step 1: Term Frequency (TF)#

  • Measures how often a word occurs in a document.

  • Intuition: If a word occurs more often in a document, it’s probably important for that document.

\[ TF(word) = \frac{\text{Number of times word appears in doc}}{\text{Total number of words in doc}} \]

Example:

Document: "I love NLP and NLP is amazing"

  • Word "NLP" occurs 2 times out of 6 words → TF(NLP) = 2/6 = 0.33

  • Word "amazing" occurs 1 time out of 6 → TF(amazing) = 1/6 = 0.167


Step 2: Inverse Document Frequency (IDF)#

  • Measures how rare a word is across all documents.

  • Intuition: Words like "the", "is", "and" appear everywhere → not important. Words like "NLP", "Python" appear less → more important.

\[ IDF(word) = \log \frac{\text{Total number of documents}}{\text{Number of documents containing the word}} \]

Example:

Corpus:

  1. "I love NLP"

  2. "NLP is amazing"

  3. "I love Python"

  • "NLP" appears in 2 documents → IDF(NLP) = log(3/2) ≈ 0.176

  • "Python" appears in 1 document → IDF(Python) = log(3/1) ≈ 1.098


Step 3: TF-IDF#

Finally, multiply TF and IDF to get TF-IDF weight:

\[ TF\text{-}IDF(word, doc) = TF(word, doc) \times IDF(word) \]
  • High TF-IDF → Important word in the document

  • Low TF-IDF → Common or less relevant word

Intuition:

  • Words that are frequent in a document but rare in the corpus get the highest scores.

  • Words that are common across documents get low scores, even if frequent in one document.


Analogy

  • Imagine reading news articles:

    • The word "the" appears in every article → not useful.

    • The word "NLP" appears in only tech articles → important.

  • TF-IDF mathematically captures this intuition.