Bag of Words (BoW)#
The Bag of Words model is a way to represent text data numerically by treating a document as a “bag” of its words, ignoring grammar and word order, but keeping multiplicity (how many times a word appears).
Essentially, BoW converts text into a vector of numbers, which can then be used for machine learning algorithms.
How BoW Works: Step-by-Step#
Collect the corpus
A corpus is the entire collection of text documents you want to analyze.
Example corpus:
Doc1: I love NLP Doc2: NLP is amazing Doc3: I love machine learning
Create the vocabulary
Extract all unique words from the corpus.
Vocabulary =
[I, love, NLP, is, amazing, machine, learning]
Vectorize the documents
For each document, count the occurrence of each word in the vocabulary.
Represent each document as a vector of word counts.
Example Table:
Vocabulary |
I |
love |
NLP |
is |
amazing |
machine |
learning |
|---|---|---|---|---|---|---|---|
Doc1: I love NLP |
1 |
1 |
1 |
0 |
0 |
0 |
0 |
Doc2: NLP is amazing |
0 |
0 |
1 |
1 |
1 |
0 |
0 |
Doc3: I love machine learning |
1 |
1 |
0 |
0 |
0 |
1 |
1 |
Key Features of Bag of Words#
Simplicity
Very easy to understand and implement.
Ignores grammar and word order
Only considers presence and frequency of words, not sequence.
“I love NLP” and “NLP love I” are treated the same.
Frequency-based representation
Each vector entry shows how many times a word occurs in the document.
Sparse vectors
Many words in the vocabulary may not appear in every document, resulting in zeros.
Advantages of BoW#
Simple and intuitive.
Works well for basic text classification problems.
Can be combined with TF-IDF to improve performance.
Disadvantages of BoW#
Ignores word order → loses context.
High dimensionality → very large vocabulary can lead to large sparse vectors.
Cannot capture semantics → “good” and “great” are treated as different words.
BoW in Python using scikit-learn#
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
"I love NLP",
"NLP is amazing",
"I love machine learning"
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print("Vocabulary:", vectorizer.get_feature_names_out())
print("BoW Vectors:\n", X.toarray())
Vocabulary: ['amazing' 'is' 'learning' 'love' 'machine' 'nlp']
BoW Vectors:
[[0 0 0 1 0 1]
[1 1 0 0 0 1]
[0 0 1 1 1 0]]
Key Points
BoW converts text to numeric vectors.
Represents word presence and frequency.
Does not capture meaning or context.
What are N-Grams?#
Definition: N-Grams are continuous sequences of n items (usually words or characters) from a given text.
They help capture context and word order which basic Bag-of-Words ignores.
The “n” in N-Grams refers to the number of items in the sequence.
Types of N-Grams#
Unigram (1-gram)
Sequence of 1 word.
Captures individual word frequency.
Example:
Text: "I love NLP" Unigrams: ["I", "love", "NLP"]
Bigram (2-gram)
Sequence of 2 consecutive words.
Captures some local context.
Example:
Text: "I love NLP" Bigrams: ["I love", "love NLP"]
Trigram (3-gram)
Sequence of 3 consecutive words.
Captures slightly longer context.
Example:
Text: "I love NLP models" Trigrams: ["I love NLP", "love NLP models"]
n-gram (general)
Sequence of n consecutive words.
Example: 4-gram (quadgram) from “I love NLP models today”:
["I love NLP models", "love NLP models today"]
Why use N-Grams?#
Helps capture context and word order in text.
Useful in:
Text classification
Sentiment analysis
Spam detection
Predictive text / autocomplete
Can be used for both words and characters:
Character-level n-grams are useful for spelling correction, language modeling, or handling noisy text.
Trade-offs
N-Gram Type |
Pros |
Cons |
|---|---|---|
Unigram |
Simple, less memory |
Ignores word order/context |
Bigram |
Captures local context |
Increases feature space |
Trigram |
Captures longer context |
Higher dimensionality, sparsity |
Higher N |
More context |
Exponential increase in features |
N-Grams are a bridge between simple Bag-of-Words and advanced embeddings, giving models some sense of word sequences without requiring deep learning.
# Import required library
from nltk import ngrams
from nltk.tokenize import word_tokenize
# Sample text
text = "I love natural language processing"
# Tokenize the text into words
tokens = word_tokenize(text)
print("Tokens:", tokens)
# Unigrams (1-gram)
unigrams = list(ngrams(tokens, 1))
print("\nUnigrams:")
for uni in unigrams:
print(uni)
# Bigrams (2-gram)
bigrams = list(ngrams(tokens, 2))
print("\nBigrams:")
for bi in bigrams:
print(bi)
# Trigrams (3-gram)
trigrams = list(ngrams(tokens, 3))
print("\nTrigrams:")
for tri in trigrams:
print(tri)
Tokens: ['I', 'love', 'natural', 'language', 'processing']
Unigrams:
('I',)
('love',)
('natural',)
('language',)
('processing',)
Bigrams:
('I', 'love')
('love', 'natural')
('natural', 'language')
('language', 'processing')
Trigrams:
('I', 'love', 'natural')
('love', 'natural', 'language')
('natural', 'language', 'processing')
The Kernel crashed while executing code in the current cell or a previous cell.
Please review the code in the cell(s) to identify a possible cause of the failure.
Click <a href='https://aka.ms/vscodeJupyterKernelCrash'>here</a> for more info.
View Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details.