Understanding the Naïve Bayes Algorithm

python machinelearning naivebayes classification © Copyright 2019, Akshay Sehgal | 13th Jan 22


Table of contents:

  1. Introduction
    1.1 Overview
    1.2 The Bayes Theorem
    1.3 The Naive Assumption
    1.4 Decision Boundaries
    1.5 Different Classifiers
  2. Implementing Naive Bayes with Sklearn
  3. Behind the scenes
    3.1 Gaussian Naive Bayes (For numeric features)
    3.2 Multinomial classifier (For text classification)
  4. References

Prerequisites: Basic understanding of Python & Scikit-learn library

1. Introduction

1.1 Overview

Naive Bayes is a supervised classification algorithm which belongs to the family of "Probabilistic Classifiers". As the name suggests, it is uses Bayes' theorem at its core with a 'naive' assumption.

The algorithm is widely used for simple classification problems and works well with text data in a bag of words form. Infact, it was most popularized by its use as a spam email classifier by google. To this date, its widely used as a benchmarking model by data scientists in hackatons on kaggle.

Before we get into the crux of the algorithm, its important to know what bayes rule is.

1.2 The Bayes Theorem

You can get a great intuitive sense about conditional probability from lectures of Prof John Tsitsiklis (MITOpenCourseware).

"You know something about the world, and based on what you know, you setup a probability model and you write down probabilities about the different outcomes. Then someone gives you some new information, which changes your beliefs and thus changes the probabilities of your outcomes."

In a simple sense, bayes' theorem talks about the 'new' probabilities of an experiment given that an event has occured.

Let's say you have a sample space the probabilities for different events are given as below -

Now let's say someone tells you that B has occured. The area outside the circle that represents B is now meaningless since the new sample space is now B. This means, we need to revise the probabilities that have been assiged to the spaces inside B.

Now someone asks you, what is the probability that A occurs, given that B has occured. Since B has occured and the new sample space is now B, the area that was initially represented by $P(A \cap B)$ is now the $P(A \mid B)$. But the probability now has to be recalculated.

What is the probability of this new section? Well, we can just use simple ratios for solving this.

$$ P (A \mid B) = {P(A \cap B) \over P(B)} = {0.3 \over (0.3+0.2)}$$

Now, the whole exercise we did above can also be done for when A has occured. Meaning that we can use symmetry to show -

$$ P (B \mid A) = {P(A \cap B) \over P(A)}$$

Using the two symmetric equations and replacing $P(A \cap B)$, we can write the famous equation -

$$ P (A \mid B) = {P (B \mid A) \times P(A) \over P(B)}$$

1.3 The Naive Assumption

Now, let's see an 'expanded' version of this definition which is more relevant to data science. We want to know what is the probability of $y$ given that $x_1, x_2, ... , x_n$ have occured (where $x_i$ are different features and y is the class label we want to predict). This is given by -

$$P(y \mid x_1, x_2, ... , x_n) = {P(x_1, x_2, ... , x_n \mid y) \times P(y) \over P(x_1, x_2, ... , x_n)}$$

Here comes the Naive assumption of conditional independence. The assumption is that all the features $x_1, x_2 ..., x_n$ are independent. This means that $P(A,B) = P(A).P(B)$. Applying this to the above equation we get -

$$P(y \mid x_1, x_2, ... , x_n) = {P(x_1 \mid y).P(x_2 \mid y)... P(x_n \mid y).P(y) \over P(x_1).P(x_2)...P(x_n)}$$

Using proper convention -

$$P(y \mid x_1, x_2, ... , x_n) = {P(y)\prod_{i=1}^{n} P(x_i \mid y) \over \prod_{i=1}^{n} P(x_i)}$$

Since the denominator is constant, we can write the equation as -

$$P(y \mid x_1, x_2, ... , x_n) \propto P(y)\prod_{i=1}^{n} P(x_i \mid y)$$

In order to turn this into a classifier, we need to pick up the one with the max probability. This can be expressed as -

$$y = argmax_y(P(y)\prod_{i=1}^{n} P(x_i \mid y))$$

The last step is known as Maximum A Posteriori decision rule.

1.4 Decision Boundaries

When working with different algorithms, its important to undertand where a particular algorithm will work and where it may not. For this, I have always found a 2 dimensional example of decision boundaries quite useful from an intuitive point of view, since the core 'nature' of the classifier retains itself in higher dimensional spaces as well. The objective is to separate the blue from the red points. The decision boundary with the confidence is plotted.

Notice below (image from sklearn documentation), the naive bayes classifier is capable of fitting smooth continous decision boundaries, but fails when the data needs a high degree polynomial.

1.5 Different Classifiers

Now, when you actually plan on implementing this, you will notice there are multiple classifiers available that can be used for a naive bayes model. The three classifiers behave almost the same way, however with minor differences and assumptions. The above equation calculates $P(x_i \mid y)$ in a straight forward way for features with discrete values, but what if the features are continous variables. Clearly, this will need you to 'assume' the nature of the distribution since unlike discrete valued features, where you could just sum up the number of times a discrete value occured along with the specific y class label. In this case we can use classifiers such as GaussianNB.

Simply put, by choosing different classifiers, you get to choose the assumptions regarding the nature of distributions of $P(x_i \mid y)$. More about these classifiers in later sections.

2. Implementing Naive Bayes with Sklearn

Lets do a quick implementation of Naive Bayes. This is how you would actually implement it when working with data. The focus will be on the algorithm rather than data preprocessing for now. For this example, I will select the iris data which is available as part of the sklearn datasets API.

The value 1 corresponds to versicolor, as seen by the y_labels (0 represents setosa). This is a straightforward sklearn workflow and should be quite familiar if you have implemented any model with sklearn before.

3. Behind the scenes

This section we are going to implement the different classifiers from scratch, making sure we actually show what's happening behind the scene. Also, since I am not a big fan of for loops, I will try to give a vectorized implementation of the model, as much as possible.

3.1 Gaussian Naive Bayes

When your features are continous in nature, you assume that they are conditionally independent and that the data associated with each class is normally distributed. To calculate the $P(x_i \mid y)$ values, we segment the data by the class, compute mean, variance and then derive the probability distribution as below.

$$P(x_i \mid y) = {1 \over \sqrt{2 \pi \sigma^2_y}}exp \Bigg(-{(x_i - \mu_y)^2 \over 2 \sigma^2_y} \Bigg)$$

The first step is to calculate the summary statistics. For gaussian classifier, we need the class-wise mean and standard deviation and we need the class priors, which is nothing but the class probability in the entire data. With these 3, we should be able to predict which class a particular sample belongs to. Following is a diagram which shows the vectorized computation of the gaussian classifier in a graphical representation.

The 0 here means first class 'setosa'.

Few interesting notes based on this calculation -

  1. The prediction here is directly made using the priors matrix, classwise mean matrix and classwise standard deviation matrix.
  2. Training the algorithm with new data simply means updating these 3 matrices and re-running calculations to predict the labels.

3.2 Multinomial classifier

Multinomial classifier is a bit more complex but is the usual choice for text based data. Here you assume that your data is distributed multinomially. Multinomial distribution is a generalized case of binomial, bernoulli and categorical distributions depending on the number of trials (n) and the number of outcomes (k). If there are only 2 possible outcomes (k=2) and multiple trials (n>1) then your multinomial distribution becomes a binomial distribution.

Lets say you have a few sentences with labels -

Lets say you are asked to classify the sentence "Great smartphone". What you are trying to do is to just find the probability of the words "Great" and "Smartphone" to occur with the label Positive/Negative. Based on the product of the probabilities of the 2 words (remember the 'naive' assumption) you calculate which has a higher product of probabilities and then predict that as the corresponding label. This is done by simply calculating $N_w / N$ where $N_w$ is the number of times the word occurs with the label, and N is the total number of words that occur with the given label.

Here comes the issue while doing this. If the word isn't used in the traning data (like in this case "Smartphone"), the value of the product of probabilities will become 0. How do we solve this? We use a techinque called additive smoothing (laplace smoothing) which just adds some stuff to the numerator and denominator. $P(x_i \mid y)$ is the probability $\theta_i$ of $i$ appearing in a sample belonging to $y$.

$$ \hat\theta_i = {N_w + \alpha \over N + \alpha d} $$

Here $\alpha$ is set to 1 incase of laplace smoothing, while $\alpha$ is set > 1 for Lidstone smoothing. $N_w$ is the number of times the word occurs with that label, N is the number of words in vocabulary which occur with that label, and d is the total size of the vocabulary.

Note that for the implementation I will only show the case where we use count vectorizer but for improving the model, it is important to leverage various techiniques such as stopword removal, lemmitization, tfidf vectorizer etc. We will start with creating a small text dataset with 2 labels.

The 1 here represents 'positive' label.

Quite clearly, the first thing that comes to mind is that if we replace the count vectorizer with tfidf vectorizer, the words that are not so important at representing a label will get lesser weights, while improving the weights for words which actually represent the labels. In this case, words like device, experience etc each contribute to either positive or negative labels respectively.

Similarly, removing stopwords and lemmatizing words will have even more impact on the accuracy of the model.

4. References