Bag of Words Algorithm in Python Introduction

Are you trying to implement a machine learning algorithm to classify documents? Need to determine the intent of a sentence to use in a chatbot? You might be asking yourself the same question. How do I convert text into a form that my machine learning algorithm can use?

In the following post we will go over a simple to use model to convert sentences into vectors called the Bag of Words model. We will implement this algorithm in python from scratch and then we will use Scikit learn built in functions to vectorize sentences.

After reading this blog post you will be able to:

• Understand the Bag of Words Model

• Implement your custom Bag of Words algorithm in Python

• Vectorize Sentences using SciKit Learn CountVectorizer

Bag of Words Machine Learning

What is the Bag of Words Model?

Bag of words model is one of a series of techniques from a field of computer science known as Natural Language Processing or NLP to extract features from text. The way it does this is by counting the frequency of words in a document. A document can be defined as you need, it can be a single sentence or all Wikipedia. The output of the bag of words model is a frequency vector.

The first step in this model is defining the vocabulary. Let’s say we have the following 3 sentences:

  • Machine learning is great
  • Natural Language Processing is a complex field
  • Natural Language Processing is used in machine learning

Our vocabulary is all the distinct words we extracted from our documents arranged as a vector.

The second step is to use the vocabulary vector to convert sentences into a frequency vector based on the vocabulary. The bag of words model for the first sentence (Machine learning is great) is simply as shown below in the column on the right:

Bag of Words Model Vector

The result is a numerical vector which can be utilized as inputs in the various machine learning algorithms to classify documents into topics, to be used in chatbots or for many other machine learning applications where text is the initial input. Although this is a simple model, it has proven to be effective in various scenarios.

Bag of Words Custom Python Code

Let’s now implement a simple Bag of Words model in Python from scratch using the above 3 sentences as our documents.

First’ import the required dependencies. We will be using Numpy to handle our vectors and the regular expression library re to extract the words from the sentences.

import numpy as np
import re

The first function we will implement is to extract the words from a document using regular expressions. As we do so, we will be converting all words to lower case and exclude our stop words.

def extract_words(sentence):
    ignore_words = ['a']
    words = re.sub("[^\w]", " ",  sentence).split() #nltk.word_tokenize(sentence)
    words_cleaned = [w.lower() for w in words if w not in ignore_words]
    return words_cleaned    

Next, we implement our tokenize_sentences function. This function builds our vocabulary by looping through all our documents (sentences), extracting the words from each, removing duplicates using the set function and returning a sorted list of words.

def tokenize_sentences(sentences):
    words = []
    for sentence in sentences:
        w = extract_words(sentence)
        words.extend(w)
        
    words = sorted(list(set(words)))
    return words

Our last function is the implementation of the bag of words model. This function takes an input of a sentence and words (our vocabulary). It then extracts the words from the input sentence using the previously defined function. It creates a vector of zeros using numpy zeros function with a length of the number of words in our vocabulary.

Lastly, for each word in our sentence, we loop through our vocabulary and if the word exists we increase the count by 1. We return the numpy array of frequency counts.

def bagofwords(sentence, words):
    sentence_words = extract_words(sentence)
    # frequency word count
    bag = np.zeros(len(words))
    for sw in sentence_words:
        for i,word in enumerate(words):
            if word == sw: 
                bag[i] += 1
                
    return np.array(bag)

Convert Sentences to Vectors

Let’s now implement our bag of words model to our three sentences by first creating our vocabulary. The vocabulary variable will contain a list of words as below:

• ['complex', 'field', 'great', 'in', 'is', 'language', 'learning', 'machine', 'natural', 'processing', 'used']

sentences = ["Machine learning is great","Natural Language Processing is a complex field","Natural Language Processing is used in machine learning"]
vocabulary = tokenize_sentences(sentences)

Passing our sentence, Machine Learning Is Great, through our bag of words model returns a vector of frequency counts as shown previously.

• array([ 0.,  0.,  1.,  0.,  1.,  0.,  1.,  1.,  0.,  0.,  0.])

bagofwords("Machine learning is great", vocabulary)

SciKit Learn CountVectorizer

Python’s SciKit-Learn provides built in functions to implement the above bag of words model. Let’s implement all the above in simply 4 lines of code. 

First, we import the CountVectorizer class from SciKit’s feature_extraction methods. We then initialize the class by passing required parameters. Most we have left empty except the analyzer of which we are using the word analyzer.  The next line of code trains our vectorizers. Lastly, we use our vectorizer to transform our sentence. This last line returns the same array as our custom implementation:

• array([[0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0]], dtype=int64)

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(analyzer = "word", tokenizer = None, preprocessor = None, stop_words = None, max_features = 5000) 
train_data_features = vectorizer.fit_transform(sentences)
vectorizer.transform(["Machine learning Machine"]).toarray()

Conclusion

We have gone through a simple implementation of the bag of words model from the NLP toolbox. This model extracts features from text in the form of frequency counts. The resulting vectors can be utilized in various machine learning algorithms be it to classify documents into topics or as an important part of your chatbots. Last, we used the built-in bag of words model from SciKit learns feature extraction functions to convert sentences into vectors.

Full Python Code

import numpy as np
import re

def tokenize_sentences(sentences):
    words = []
    for sentence in sentences:
        w = extract_words(sentence)
        words.extend(w)
        
    words = sorted(list(set(words)))
    return words

def extract_words(sentence):
    ignore_words = ['a']
    words = re.sub("[^\w]", " ",  sentence).split() #nltk.word_tokenize(sentence)
    words_cleaned = [w.lower() for w in words if w not in ignore_words]
    return words_cleaned    
    
def bagofwords(sentence, words):
    sentence_words = extract_words(sentence)
    # frequency word count
    bag = np.zeros(len(words))
    for sw in sentence_words:
        for i,word in enumerate(words):
            if word == sw: 
                bag[i] += 1
                
    return np.array(bag)

sentences = ["Machine learning is great","Natural Language Processing is a complex field","Natural Language Processing is used in machine learning"]
vocabulary = tokenize_sentences(sentences)
bagofwords("Machine learning is great", vocabulary)

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(analyzer = "word", tokenizer = None, preprocessor = None, stop_words = None, max_features = 5000) 
train_data_features = vectorizer.fit_transform(sentences)
vectorizer.transform(["Machine learning is great"]).toarray()

MJ

Advanced analytics professional currently practicing in the healthcare sector. Passionate about Machine Learning, Operations Research and Programming. Enjoys the outdoors and extreme sports.

Related Articles

>