TF-IDF#
TF-IDF is a numerical statistic that reflects the importance of a word in a document relative to a collection of documents (corpus).
It combines two measures:
TF (Term Frequency): How often a word appears in a document.
IDF (Inverse Document Frequency): How unique or rare a word is across all documents in the corpus.
Mathematically:
Where:
\(t\) → term (word)
\(d\) → document
\(N\) → total number of documents
\(\text{DF}(t)\) → number of documents containing the term \(t\)
Adding 1 in the denominator avoids division by zero
2. Intuition#
High TF: Word appears frequently in a document → important for that document.
High IDF: Word appears in fewer documents → more unique → carries more information.
High TF-IDF: Word is frequent in a document and rare across other documents → highly significant.
Example:
Word “the” → high TF but appears in almost all documents → low IDF → low TF-IDF.
Word “NLP” → appears multiple times in a specific document but rarely elsewhere → high TF-IDF.
3. Example#
Suppose we have 3 documents:
Doc1: "I love NLP"
Doc2: "NLP is amazing"
Doc3: "I love Python"
Step 1: Calculate TF
Doc1: I(1), love(1), NLP(1) → total 3 words
TF(“NLP”, Doc1) = 1 / 3 ≈ 0.333
Step 2: Calculate IDF
NLP appears in 2 documents → DF(“NLP”) = 2
Total documents N = 3
IDF(“NLP”) = log(3 / (1+2)) = log(3/3) = log(1) = 0
Step 3: TF-IDF
TF-IDF(“NLP”, Doc1) = TF × IDF = 0.333 × 0 = 0
Note: Words appearing in almost all documents get TF-IDF close to 0.
Word “Python” appears only in Doc3 → TF-IDF will be high for Doc3.
from sklearn.feature_extraction.text import TfidfVectorizer
# Sample documents
docs = [
"I love NLP",
"NLP is amazing",
"I love Python"
]
# Initialize TF-IDF Vectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(docs)
# Feature names (words)
print("Vocabulary:", vectorizer.get_feature_names_out())
# TF-IDF matrix
print("TF-IDF matrix:\n", X.toarray())
Vocabulary: ['amazing' 'is' 'love' 'nlp' 'python']
TF-IDF matrix:
[[0. 0. 0.70710678 0.70710678 0. ]
[0.62276601 0.62276601 0. 0.4736296 0. ]
[0. 0. 0.60534851 0. 0.79596054]]
The Kernel crashed while executing code in the current cell or a previous cell.
Please review the code in the cell(s) to identify a possible cause of the failure.
Click <a href='https://aka.ms/vscodeJupyterKernelCrash'>here</a> for more info.
View Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details.
5. Key Points#
TF-IDF reduces the weight of common words like “the”, “is”, “and”.
Highlights words unique to a document.
Widely used in text classification, search engines, and recommendation systems.
Absolutely! Let’s break down the intuition behind TF-IDF in a simple, clear way.
TF-IDF Intuition#
TF-IDF stands for Term Frequency – Inverse Document Frequency. It’s a way to weight words based on importance in a corpus. The key idea is:
Words that appear often in a document are important → TF (Term Frequency).
Words that appear in many documents are less informative → IDF (Inverse Document Frequency).
Step 1: Term Frequency (TF)#
Measures how often a word occurs in a document.
Intuition: If a word occurs more often in a document, it’s probably important for that document.
Example:
Document: "I love NLP and NLP is amazing"
Word
"NLP"occurs 2 times out of 6 words → TF(NLP) = 2/6 = 0.33Word
"amazing"occurs 1 time out of 6 → TF(amazing) = 1/6 = 0.167
Step 2: Inverse Document Frequency (IDF)#
Measures how rare a word is across all documents.
Intuition: Words like
"the","is","and"appear everywhere → not important. Words like"NLP","Python"appear less → more important.
Example:
Corpus:
"I love NLP""NLP is amazing""I love Python"
"NLP"appears in 2 documents → IDF(NLP) = log(3/2) ≈ 0.176"Python"appears in 1 document → IDF(Python) = log(3/1) ≈ 1.098
Step 3: TF-IDF#
Finally, multiply TF and IDF to get TF-IDF weight:
High TF-IDF → Important word in the document
Low TF-IDF → Common or less relevant word
Intuition:
Words that are frequent in a document but rare in the corpus get the highest scores.
Words that are common across documents get low scores, even if frequent in one document.
Analogy
Imagine reading news articles:
The word
"the"appears in every article → not useful.The word
"NLP"appears in only tech articles → important.
TF-IDF mathematically captures this intuition.