TF (Term Frequency)#
Term Frequency (TF) measures how frequently a term (word) occurs in a document relative to the total number of words in that document.
Mathematically:
\[
\text{TF}(t,d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d}
\]
\(t\) → specific term (word)
\(d\) → specific document
It tells you how important a word is in a document based on its frequency.
2. Example#
Suppose we have a document:
Document: "I love NLP and I love Python"
Step 1: Count each word
Word |
Count |
|---|---|
I |
2 |
love |
2 |
NLP |
1 |
and |
1 |
Python |
1 |
Step 2: Total words = 7
Step 3: Calculate TF
\[
TF(\text{"I"}) = 2 / 7 \approx 0.285
\]
\[
TF(\text{"love"}) = 2 / 7 \approx 0.285
\]
\[
TF(\text{"NLP"}) = 1 / 7 \approx 0.142
\]
…and so on.
3. Key Points#
TF only considers frequency in a single document.
It does not consider the importance across multiple documents (that’s where TF-IDF comes in).
Common words like “the”, “is”, “and” usually have high TF but may not be important semantically.
from sklearn.feature_extraction.text import CountVectorizer
# Sample document
doc = ["I love NLP and I love Python"]
# Initialize CountVectorizer (this calculates raw term counts)
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(doc)
# Feature names
print("Vocabulary:", vectorizer.get_feature_names_out())
# Term counts
print("Term counts:", X.toarray())
# Term Frequency (manually calculating)
import numpy as np
tf = X.toarray()[0] / np.sum(X.toarray()[0])
print("Term Frequency (TF):", tf)
Vocabulary: ['and' 'love' 'nlp' 'python']
Term counts: [[1 2 1 1]]
Term Frequency (TF): [0.2 0.4 0.2 0.2]
from sklearn.feature_extraction.text import CountVectorizer
# Sample document
doc = ["I love NLP and I love Python"]
# Initialize CountVectorizer (this calculates raw term counts)
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(doc)
# Feature names
print("Vocabulary:", vectorizer.get_feature_names_out())
# Term counts
print("Term counts:", X.toarray())
# Term Frequency (manually calculating)
import numpy as np
tf = X.toarray()[0] / np.sum(X.toarray()[0])
print("Term Frequency (TF):", tf)
Vocabulary: ['and' 'love' 'nlp' 'python']
Term counts: [[1 2 1 1]]
Term Frequency (TF): [0.2 0.4 0.2 0.2]
The Kernel crashed while executing code in the current cell or a previous cell.
Please review the code in the cell(s) to identify a possible cause of the failure.
Click <a href='https://aka.ms/vscodeJupyterKernelCrash'>here</a> for more info.
View Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details.