Bag of Words (BoW)#

The Bag of Words model is a way to represent text data numerically by treating a document as a “bag” of its words, ignoring grammar and word order, but keeping multiplicity (how many times a word appears).

Essentially, BoW converts text into a vector of numbers, which can then be used for machine learning algorithms.


How BoW Works: Step-by-Step#

  1. Collect the corpus

    • A corpus is the entire collection of text documents you want to analyze.

    • Example corpus:

      Doc1: I love NLP
      Doc2: NLP is amazing
      Doc3: I love machine learning
      
  2. Create the vocabulary

    • Extract all unique words from the corpus.

    • Vocabulary = [I, love, NLP, is, amazing, machine, learning]

  3. Vectorize the documents

    • For each document, count the occurrence of each word in the vocabulary.

    • Represent each document as a vector of word counts.

Example Table:

Vocabulary

I

love

NLP

is

amazing

machine

learning

Doc1: I love NLP

1

1

1

0

0

0

0

Doc2: NLP is amazing

0

0

1

1

1

0

0

Doc3: I love machine learning

1

1

0

0

0

1

1


Key Features of Bag of Words#

  1. Simplicity

    • Very easy to understand and implement.

  2. Ignores grammar and word order

    • Only considers presence and frequency of words, not sequence.

    • “I love NLP” and “NLP love I” are treated the same.

  3. Frequency-based representation

    • Each vector entry shows how many times a word occurs in the document.

  4. Sparse vectors

    • Many words in the vocabulary may not appear in every document, resulting in zeros.


Advantages of BoW#

  • Simple and intuitive.

  • Works well for basic text classification problems.

  • Can be combined with TF-IDF to improve performance.


Disadvantages of BoW#

  • Ignores word order → loses context.

  • High dimensionality → very large vocabulary can lead to large sparse vectors.

  • Cannot capture semantics → “good” and “great” are treated as different words.


BoW in Python using scikit-learn#

from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "I love NLP",
    "NLP is amazing",
    "I love machine learning"
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

print("Vocabulary:", vectorizer.get_feature_names_out())
print("BoW Vectors:\n", X.toarray())
Vocabulary: ['amazing' 'is' 'learning' 'love' 'machine' 'nlp']
BoW Vectors:
 [[0 0 0 1 0 1]
 [1 1 0 0 0 1]
 [0 0 1 1 1 0]]

Key Points

  • BoW converts text to numeric vectors.

  • Represents word presence and frequency.

  • Does not capture meaning or context.

What are N-Grams?#

  • Definition: N-Grams are continuous sequences of n items (usually words or characters) from a given text.

  • They help capture context and word order which basic Bag-of-Words ignores.

  • The “n” in N-Grams refers to the number of items in the sequence.


Types of N-Grams#

  1. Unigram (1-gram)

    • Sequence of 1 word.

    • Captures individual word frequency.

    • Example:

      Text: "I love NLP"
      Unigrams: ["I", "love", "NLP"]
      
  2. Bigram (2-gram)

    • Sequence of 2 consecutive words.

    • Captures some local context.

    • Example:

      Text: "I love NLP"
      Bigrams: ["I love", "love NLP"]
      
  3. Trigram (3-gram)

    • Sequence of 3 consecutive words.

    • Captures slightly longer context.

    • Example:

      Text: "I love NLP models"
      Trigrams: ["I love NLP", "love NLP models"]
      
  4. n-gram (general)

    • Sequence of n consecutive words.

    • Example: 4-gram (quadgram) from “I love NLP models today”:

      ["I love NLP models", "love NLP models today"]
      

Why use N-Grams?#

  • Helps capture context and word order in text.

  • Useful in:

    • Text classification

    • Sentiment analysis

    • Spam detection

    • Predictive text / autocomplete

  • Can be used for both words and characters:

    • Character-level n-grams are useful for spelling correction, language modeling, or handling noisy text.


Trade-offs

N-Gram Type

Pros

Cons

Unigram

Simple, less memory

Ignores word order/context

Bigram

Captures local context

Increases feature space

Trigram

Captures longer context

Higher dimensionality, sparsity

Higher N

More context

Exponential increase in features


N-Grams are a bridge between simple Bag-of-Words and advanced embeddings, giving models some sense of word sequences without requiring deep learning.

# Import required library
from nltk import ngrams
from nltk.tokenize import word_tokenize

# Sample text
text = "I love natural language processing"

# Tokenize the text into words
tokens = word_tokenize(text)

print("Tokens:", tokens)

# Unigrams (1-gram)
unigrams = list(ngrams(tokens, 1))
print("\nUnigrams:")
for uni in unigrams:
    print(uni)

# Bigrams (2-gram)
bigrams = list(ngrams(tokens, 2))
print("\nBigrams:")
for bi in bigrams:
    print(bi)

# Trigrams (3-gram)
trigrams = list(ngrams(tokens, 3))
print("\nTrigrams:")
for tri in trigrams:
    print(tri)
Tokens: ['I', 'love', 'natural', 'language', 'processing']

Unigrams:
('I',)
('love',)
('natural',)
('language',)
('processing',)

Bigrams:
('I', 'love')
('love', 'natural')
('natural', 'language')
('language', 'processing')

Trigrams:
('I', 'love', 'natural')
('love', 'natural', 'language')
('natural', 'language', 'processing')
The Kernel crashed while executing code in the current cell or a previous cell. 

Please review the code in the cell(s) to identify a possible cause of the failure. 

Click <a href='https://aka.ms/vscodeJupyterKernelCrash'>here</a> for more info. 

View Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details.