Naive Bayes Classification
Naive Bayes is an incredibly powerful, fast and accurate machine learning mode despite its ancient age. In this mini-tutorial we will elaborate different Naive Bayes implementations you can use with Scikit-Learn and Python for classification problems.
How to Construct?
1- GaussianNB
We can implement Logistic Regression algorithm simply using Scikit-Learn’s LogisticRegression class to create a machine learning model. LogisticRegression class belongs to linear_model module and can be used to construct a model. Gaussian Naive Bayes is used on features with continuous numerical values. For other types of Naive Bayes implementations please see the explanations down the page.
a) Creating GaussianNB Model:
You can create a LogisticRegression model using Python and Scikit-Learn. Here is a simple Python code to do that:
from sklearn import naive_bayes
GNB = naive_bayes.GaussianNB()
b) Training GaussianNB Model:
Once we have a Logistic Regression model, we can train it as below, so we can use it for predictions:
GNB.fit(X_train, y_train)
c) Predicting with GaussianNB Model:
After Logistic Model is trained it will be ready for inference or in other words process of making predictions. Logistic Regression is used solely in classification predictions. So we can only use it in classification problems.
Most other supervised machine learning algorithms also have regression versions and are capable of solving regression problems by predicting continuous values. Another exception that can only do classification similar to Logistic Regression is Naive Bayes algorithm.
Here is a basic Python code that shows how prediction can be done using a trained Logistic Regression model:
yhat = GNB.predict(X_test)
Remember, Machine Learning usually is an iterative process which requires going back and forth to either adjust the data, adjust the process or adjust the model in use. Logistic Regression offers plenty of optimization opportunities like most other machine learning algorithms.
You can have better control in the performance, accuracy, efficiency and the general outcome of your Logistic Regression implementation by adjusting its hyperparameters. We have prepared a tutorial that shows how Logistic Regression models can be tuned which can be found below:
Difference between Gaussian, Multinomial, Complement, Bernoulli and Categorical Naive Bayes Models
2- Different Naive Bayes Models Based Distribution of Feature Values
Although GaussianNB is the most commonly utilized Naive Bayes implementation, based on the distribution in the data you can also use a number of other options when constructing your Naive Bayes classification model.
Luckily Scikit-Learn has built-in classes for these different Naive Bayes versions. Here are different options that can be used for different distributions:
- Gaussian: Continous feature values
- Multinomial: Discrete feature values
- Complement: Multinomial with extra caution for imbalanced data.
- Bernoulli: Binary feature values
- Categorical: Categorical feature values
a) MultinomialNB
Multinomial Naive Bayes works well with frequencies. Instead of features with continuous values MultinomialNB works well with discrete features. Some examples to such features are:
- Word counts: 1,2,3,4,5,6,7,8,9… etc.
from sklearn import naive_bayes
MNB = naive_bayes.MultinomialNB()
Just create the MultinomialNB model using naive_bayes module from sklearn and you can follow the rest of the procedures similar to Gaussian example above.
b) ComplementNB
Another option is Complement Naive Bayes implementation. Complement Naive Bayes is derived from Multinomial Naive Bayes as a better alternative for biased datasets where some classes are not represented as frequently as others. Complement Naive Bayes aims to fix severe assumptions when and where present.
The idea of ComplementNB implementation is that it looks at the least probable samples for each class to conclude that they should belong to the other classes hence addressing bias with reverse engineering. Model can be created in Scikit-Learn as below:
from sklearn import naive_bayes
CNB = naive_bayes.ComplementNB()
c) BernoulliNB
Bernoulli Naive Bayes is a binary implementation of Naive Bayes algorithm. BernoulliNB is another option in Scikit-Learn. It works with binary or boolean values instead of occurance values as in Multinomial example. Some example features would be:
- character exists or not
- word exists or not
- age above 70: True or False
- age below 10: True or False
- student or alumni
- employed or unemployed
- electric or fossil fuel etc.
It can be constructed using following Python code.
from sklearn import naive_bayes
BNB = naive_bayes.BernoulliNB()
d) CategoricalNB
Finally, you can also use Categorical Naive Bayes when features have categorical discrete values. This means features may have discrete values and categorical but they are not binary either. Some example to such features are:
- Color: red, blue, brown, black, green, yellow
- Movie rating: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
- Energy source: electric, diesel, gas, biofuel, hydrogen
- Transportation mode: air, land, rail, road, water, cable, space, human-powered, tube, pipeline
from sklearn import naive_bayes
CNB = naive_bayes.CategoricalNB()
GaussianNB
- Works well with continous counts
BernoulliNB
- Works well with binary features
MultinomialNB
- Works well with frequency counts
- When data is imbalanced ComplementNB can be used to fix severe bias.
CategoricalNB
- Works well with categorical features
Naive Bayes Classifiers Explained
Summary
In this Naive Bayes classifier tutorial we have seen how to create, train and predict with Naive Bayes models. Additionally we have learned the major differences between different Naive Bayes implementations as well as their use cases and advantages.
For a more general list of advantages of all Naive Bayes machine learning models you can refer to the article below: