Bag of Words (BoW)

Bag of Words (BoW)#

The Bag of Words model is a way to represent text data numerically by treating a document as a “bag” of its words, ignoring grammar and word order, but keeping multiplicity (how many times a word appears).

Essentially, BoW converts text into a vector of numbers, which can then be used for machine learning algorithms.

How BoW Works: Step-by-Step#

Collect the corpus
- A corpus is the entire collection of text documents you want to analyze.
- Example corpus:
```
Doc1: I love NLP
Doc2: NLP is amazing
Doc3: I love machine learning
```
Create the vocabulary
- Extract all unique words from the corpus.
- Vocabulary = [I, love, NLP, is, amazing, machine, learning]
Vectorize the documents
- For each document, count the occurrence of each word in the vocabulary.
- Represent each document as a vector of word counts.

Example Table:

Vocabulary	I	love	NLP	is	amazing	machine	learning
Doc1: I love NLP	1	1	1	0	0	0	0
Doc2: NLP is amazing	0	0	1	1	1	0	0
Doc3: I love machine learning	1	1	0	0	0	1	1

Key Features of Bag of Words#

Simplicity
- Very easy to understand and implement.
Ignores grammar and word order
- Only considers presence and frequency of words, not sequence.
- “I love NLP” and “NLP love I” are treated the same.
Frequency-based representation
- Each vector entry shows how many times a word occurs in the document.
Sparse vectors
- Many words in the vocabulary may not appear in every document, resulting in zeros.

Advantages of BoW#

Simple and intuitive.
Works well for basic text classification problems.
Can be combined with TF-IDF to improve performance.

Disadvantages of BoW#

Ignores word order → loses context.
High dimensionality → very large vocabulary can lead to large sparse vectors.
Cannot capture semantics → “good” and “great” are treated as different words.

BoW in Python using scikit-learn#

from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "I love NLP",
    "NLP is amazing",
    "I love machine learning"
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

print("Vocabulary:", vectorizer.get_feature_names_out())
print("BoW Vectors:\n", X.toarray())

Vocabulary: ['amazing' 'is' 'learning' 'love' 'machine' 'nlp']
BoW Vectors:
 [[0 0 0 1 0 1]
 [1 1 0 0 0 1]
 [0 0 1 1 1 0]]

Key Points

BoW converts text to numeric vectors.
Represents word presence and frequency.
Does not capture meaning or context.

What are N-Grams?#

Definition: N-Grams are continuous sequences of n items (usually words or characters) from a given text.
They help capture context and word order which basic Bag-of-Words ignores.
The “n” in N-Grams refers to the number of items in the sequence.

Types of N-Grams#

Unigram (1-gram)
- Sequence of 1 word.
- Captures individual word frequency.
- Example:
```
Text: "I love NLP"
Unigrams: ["I", "love", "NLP"]
```
Bigram (2-gram)
- Sequence of 2 consecutive words.
- Captures some local context.
- Example:
```
Text: "I love NLP"
Bigrams: ["I love", "love NLP"]
```
Trigram (3-gram)
- Sequence of 3 consecutive words.
- Captures slightly longer context.
- Example:
```
Text: "I love NLP models"
Trigrams: ["I love NLP", "love NLP models"]
```
n-gram (general)
- Sequence of n consecutive words.
- Example: 4-gram (quadgram) from “I love NLP models today”:
```
["I love NLP models", "love NLP models today"]
```

Why use N-Grams?#

Helps capture context and word order in text.
Useful in:
- Text classification
- Sentiment analysis
- Spam detection
- Predictive text / autocomplete
Can be used for both words and characters:
- Character-level n-grams are useful for spelling correction, language modeling, or handling noisy text.

Trade-offs

N-Gram Type	Pros	Cons
Unigram	Simple, less memory	Ignores word order/context
Bigram	Captures local context	Increases feature space
Trigram	Captures longer context	Higher dimensionality, sparsity
Higher N	More context	Exponential increase in features

N-Grams are a bridge between simple Bag-of-Words and advanced embeddings, giving models some sense of word sequences without requiring deep learning.

# Import required library
from nltk import ngrams
from nltk.tokenize import word_tokenize

# Sample text
text = "I love natural language processing"

# Tokenize the text into words
tokens = word_tokenize(text)

print("Tokens:", tokens)

# Unigrams (1-gram)
unigrams = list(ngrams(tokens, 1))
print("\nUnigrams:")
for uni in unigrams:
    print(uni)

# Bigrams (2-gram)
bigrams = list(ngrams(tokens, 2))
print("\nBigrams:")
for bi in bigrams:
    print(bi)

# Trigrams (3-gram)
trigrams = list(ngrams(tokens, 3))
print("\nTrigrams:")
for tri in trigrams:
    print(tri)

Tokens: ['I', 'love', 'natural', 'language', 'processing']

Unigrams:
('I',)
('love',)
('natural',)
('language',)
('processing',)

Bigrams:
('I', 'love')
('love', 'natural')
('natural', 'language')
('language', 'processing')

Trigrams:
('I', 'love', 'natural')
('love', 'natural', 'language')
('natural', 'language', 'processing')

The Kernel crashed while executing code in the current cell or a previous cell. 

Please review the code in the cell(s) to identify a possible cause of the failure. 

Click <a href='https://aka.ms/vscodeJupyterKernelCrash'>here</a> for more info. 

View Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details.

Bag of Words (BoW)

Contents