TF (Term Frequency)

TF (Term Frequency)#

Term Frequency (TF) measures how frequently a term (word) occurs in a document relative to the total number of words in that document.

Mathematically:

\[ \text{TF}(t,d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d} \]
  • \(t\) → specific term (word)

  • \(d\) → specific document

It tells you how important a word is in a document based on its frequency.


2. Example#

Suppose we have a document:

Document: "I love NLP and I love Python"

Step 1: Count each word

Word

Count

I

2

love

2

NLP

1

and

1

Python

1

Step 2: Total words = 7

Step 3: Calculate TF

\[ TF(\text{"I"}) = 2 / 7 \approx 0.285 \]
\[ TF(\text{"love"}) = 2 / 7 \approx 0.285 \]
\[ TF(\text{"NLP"}) = 1 / 7 \approx 0.142 \]

…and so on.


3. Key Points#

  • TF only considers frequency in a single document.

  • It does not consider the importance across multiple documents (that’s where TF-IDF comes in).

  • Common words like “the”, “is”, “and” usually have high TF but may not be important semantically.

from sklearn.feature_extraction.text import CountVectorizer

# Sample document
doc = ["I love NLP and I love Python"]

# Initialize CountVectorizer (this calculates raw term counts)
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(doc)

# Feature names
print("Vocabulary:", vectorizer.get_feature_names_out())

# Term counts
print("Term counts:", X.toarray())

# Term Frequency (manually calculating)
import numpy as np
tf = X.toarray()[0] / np.sum(X.toarray()[0])
print("Term Frequency (TF):", tf)
Vocabulary: ['and' 'love' 'nlp' 'python']
Term counts: [[1 2 1 1]]
Term Frequency (TF): [0.2 0.4 0.2 0.2]
from sklearn.feature_extraction.text import CountVectorizer

# Sample document
doc = ["I love NLP and I love Python"]

# Initialize CountVectorizer (this calculates raw term counts)
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(doc)

# Feature names
print("Vocabulary:", vectorizer.get_feature_names_out())

# Term counts
print("Term counts:", X.toarray())

# Term Frequency (manually calculating)
import numpy as np
tf = X.toarray()[0] / np.sum(X.toarray()[0])
print("Term Frequency (TF):", tf)
Vocabulary: ['and' 'love' 'nlp' 'python']
Term counts: [[1 2 1 1]]
Term Frequency (TF): [0.2 0.4 0.2 0.2]
The Kernel crashed while executing code in the current cell or a previous cell. 

Please review the code in the cell(s) to identify a possible cause of the failure. 

Click <a href='https://aka.ms/vscodeJupyterKernelCrash'>here</a> for more info. 

View Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details.