TF-IDF

TF-IDF #

TF-IDF is a numerical statistic that reflects the importance of a word in a document relative to a collection of documents (corpus).

It combines two measures:

TF (Term Frequency): How often a word appears in a document.
IDF (Inverse Document Frequency): How unique or rare a word is across all documents in the corpus.

Mathematically:

\[ \text{TF-IDF}(t,d) = \text{TF}(t,d) \times \text{IDF}(t) \]

Where:

\[ \text{IDF}(t) = \log \frac{N}{1 + \text{DF}(t)} \]

\(t\) → term (word)
\(d\) → document
\(N\) → total number of documents
\(\text{DF}(t)\) → number of documents containing the term \(t\)
Adding 1 in the denominator avoids division by zero

2. Intuition#

High TF: Word appears frequently in a document → important for that document.
High IDF: Word appears in fewer documents → more unique → carries more information.
High TF-IDF: Word is frequent in a document and rare across other documents → highly significant.

Example:

Word “the” → high TF but appears in almost all documents → low IDF → low TF-IDF.
Word “NLP” → appears multiple times in a specific document but rarely elsewhere → high TF-IDF.

3. Example#

Suppose we have 3 documents:

Doc1: "I love NLP"
Doc2: "NLP is amazing"
Doc3: "I love Python"

Step 1: Calculate TF

Doc1: I(1), love(1), NLP(1) → total 3 words
TF(“NLP”, Doc1) = 1 / 3 ≈ 0.333

Step 2: Calculate IDF

NLP appears in 2 documents → DF(“NLP”) = 2
Total documents N = 3
IDF(“NLP”) = log(3 / (1+2)) = log(3/3) = log(1) = 0

Step 3: TF-IDF

TF-IDF(“NLP”, Doc1) = TF × IDF = 0.333 × 0 = 0

Note: Words appearing in almost all documents get TF-IDF close to 0.

Word “Python” appears only in Doc3 → TF-IDF will be high for Doc3.

from sklearn.feature_extraction.text import TfidfVectorizer

# Sample documents
docs = [
    "I love NLP",
    "NLP is amazing",
    "I love Python"
]

# Initialize TF-IDF Vectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(docs)

# Feature names (words)
print("Vocabulary:", vectorizer.get_feature_names_out())

# TF-IDF matrix
print("TF-IDF matrix:\n", X.toarray())

Vocabulary: ['amazing' 'is' 'love' 'nlp' 'python']
TF-IDF matrix:
 [[0.         0.         0.70710678 0.70710678 0.        ]
 [0.62276601 0.62276601 0.         0.4736296  0.        ]
 [0.         0.         0.60534851 0.         0.79596054]]

The Kernel crashed while executing code in the current cell or a previous cell. 

Please review the code in the cell(s) to identify a possible cause of the failure. 

Click <a href='https://aka.ms/vscodeJupyterKernelCrash'>here</a> for more info. 

View Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details.

5. Key Points#

TF-IDF reduces the weight of common words like “the”, “is”, “and”.
Highlights words unique to a document.
Widely used in text classification, search engines, and recommendation systems.

Absolutely! Let’s break down the intuition behind TF-IDF in a simple, clear way.

TF-IDF Intuition#

TF-IDF stands for Term Frequency – Inverse Document Frequency. It’s a way to weight words based on importance in a corpus. The key idea is:

Words that appear often in a document are important → TF (Term Frequency).
Words that appear in many documents are less informative → IDF (Inverse Document Frequency).

Step 1: Term Frequency (TF)#

Measures how often a word occurs in a document.
Intuition: If a word occurs more often in a document, it’s probably important for that document.

\[ TF(word) = \frac{\text{Number of times word appears in doc}}{\text{Total number of words in doc}} \]

Example:

Document: "I love NLP and NLP is amazing"

Word "NLP" occurs 2 times out of 6 words → TF(NLP) = 2/6 = 0.33
Word "amazing" occurs 1 time out of 6 → TF(amazing) = 1/6 = 0.167

Step 2: Inverse Document Frequency (IDF)#

Measures how rare a word is across all documents.
Intuition: Words like "the", "is", "and" appear everywhere → not important. Words like "NLP", "Python" appear less → more important.

\[ IDF(word) = \log \frac{\text{Total number of documents}}{\text{Number of documents containing the word}} \]

Example:

Corpus:

"I love NLP"
"NLP is amazing"
"I love Python"

"NLP" appears in 2 documents → IDF(NLP) = log(3/2) ≈ 0.176
"Python" appears in 1 document → IDF(Python) = log(3/1) ≈ 1.098

Step 3: TF-IDF#

Finally, multiply TF and IDF to get TF-IDF weight:

\[ TF\text{-}IDF(word, doc) = TF(word, doc) \times IDF(word) \]

High TF-IDF → Important word in the document
Low TF-IDF → Common or less relevant word

Intuition:

Words that are frequent in a document but rare in the corpus get the highest scores.
Words that are common across documents get low scores, even if frequent in one document.

Analogy

Imagine reading news articles:
- The word "the" appears in every article → not useful.
- The word "NLP" appears in only tech articles → important.
TF-IDF mathematically captures this intuition.

TF-IDF

Contents