Naive Bayes Tuning
Naive Bayes model has a couple of useful hyperparameters to tune in Scikit-Learn. Aside of hyperparameters probably the most importatant factor in a Naive Bayes implementation is the independence of predictors (features).
Although independent variables (features) are expected to be independent, this is often not the case and there is some sort of correlation between features.
For Naive Bayes to make accurate predictions the main requirement is providing an input of strongly independent variables and if model is producing subpar prediction results this should be the first thing to investigate.
That being said here are some useful parameters that can be optimized and tuned in Naive Bayes machine learning implementations:
Primer on Naive Bayes, Priors and Conditional Probability
There are two types of probabilities in a Naive Bayes implementation:
- Prior probability: Probability of a feature or target before new data is introduced to the model.
- Conditional probability: Probability given an event hence conditional
Priors are probabilities of each class or sample. We also have two types of priors in Naive Bayes applications:
- target prior: Prior probability of target occurance. Class (target) prior is usually shown as P(A).
- feature prior: Prior probability of feature occurance. Feature prior is shown as P(B).
P(B) | P(A) is the conditional probability of class given the feature and it is what Naive Bayes attempts to solve. Also this machine learning model is called Naive because it assumes independence between features which is seldom the case hence the model is being naive about it.
For example, Naive Bayes Theorem can shown with this formula:
- P(A|B) = ( P(B|A) * P(A) ) / P(B)
- P(A|B) = is the conditional probability of A given the event B (Posterior)
- P(B|A) = is the conditional probability of B given the event A (Likelihood)
- P(A) = is the prior probability of event A (Prior)
- P(B) = is the prior probability of event B (Also called Normalization)
Limiting feature amount in the splits
1- priors
priors parameter gives an option to specify priors instead of model deriving them from the frequencies in data and this provides an opportunity to have advanced control over a Naive Bayes model’s probability calculations using Scikit-Learn.
Sum of priors should always add up to 1. This is true when model is calculating the priors as well as when user is passing priors as an array to the model. If priors don’t add up to 1 following error will likely occur:
ValueError: The sum of the priors should be 1.
Here is an example with Python code showing custom prior array being passed to Gaussian Naive Bayes model.
GNB = GaussianNB(priors = [0.33, 0.33, 0.34])
Adjusting priors can be useful to address bias in a dataset. For example if dataset is small and target values are occuring in a biased way Naive Bayes model may think that the frequency of target A is less than target B and C and this will affect the results. But by intercepting and assigning custom priors that you know are more accurate you can contribute to the accuracy of the model.
A massive benefit of Naive Bayes is that it calculates prior probabilities for a given dataset once which is a trivially light mathematical computation and then Naive Bayes model is capable of making predictions based on those values. This approach makes Naive Bayes staggeringly fast. You can read more about Naive Bayes advantages:
prior probability
- target prior
- feature prior
conditional probability
- P(A | B) : Probability of target (A) given feature B.
- P(B | A) : Probability of feature (B) given target A.
Using training data Naive Bayes machine learning model can come of with a conditional probability of P(A | B) using conditional probability of P (B | A) values derived from targets at hand.
After that this calculated conditional probability P(A | B) will be used to create P(B | A) conditional probabilities without needing the target values during the inference (prediction) phase.
Limiting feature amount in the splits
2- var_smoothing
var_smoothing is a parameter that takes a float value and is 1e-9 by default.
Variance smoothing is used to improve model stability by adding a portion of largest variances for each feature to variances.
smoothing a portion of largest variances of each feature.
Here is a Python code example:
GNB = GaussianNB(var_smoothing=3e-10)
Furthermore you can see the absolute value being added to variances by printing epsilon_ attribute of the model.
Here is an example: (suppose model is already trained)
print(GNB.epsilon_)
9.858497916666661e-10
If var_smoothing is increased too much likelihood probability for all classes will converge to a uniform distribution. Meaning prediction will be distributed to target values at equal probability which renders predictions pretty much useless and turns Naive Bayes machine learning model to a coin toss.
Here are sample confusion matrices for a model with a very high var_smoothing: (Model was run multiple times to create predictions.)
GNB = GaussianNB(var_smoothing=1e+4)
Each time Naive Bayes model with uniform distribution makes predictions there is equal chance of predicting every sample as Class 0, Class 1 or Class 2. (33.3%, 33.3%, 33.3%).
This is the reason why var_smoothing is such a small value by default and in most applications will remain a very small value just to create enough additive variance. (Variance smoothing in Naive Bayes is also called Laplace Smoothing or Additive Smoothing)
Benefit of var_smoothing is that when there is missing data or a class is not represented var_smoothing keeps model from breaking down.
When feature prior probability is missing (it has never occurred before), we end up with two undesirable cases. Posterior probability of an event becomes 1 (neglecting the missing data) or posterior probability of an event becomes 0 (assigning zero to likelihood because it never happened).
Thanks to adding var_smoothing to prior model can end up with a more reasonable equation instead of being confused when an event has never occurred before. And since its usually a very small value when priors are present var_smoothing’s affect is negligible. But it makes a difference when a probability is missing.
P(A|B) = ( P(B|A) * P(A) ) + var_smoothing / P(B) + var_smoothing*n_features
Laplace is the father of this smoothing and he came up with the idea when he was thinking about the probability of sun not rising up. He thought since it never happened before Bayes Theorem couldn’t deal with it.
It could be tweaked to assign it a probability of 1 which makes no sense (100% chance) or it could also be tweaked to assign the event a probability of 0. Which is probably more sensible but still not ideal since we can’t say it’s improbable just because it has never happened before. Financial markets taught us that time and again. Hence introducing a tiny variance for those previously unseen observations we save the model’s integrity.
Here is a quote from a Wikipedia article:
In a bag of words model of natural language processing and information retrieval, the data consists of the number of occurrences of each word in a document. Additive smoothing allows the assignment of non-zero probabilities to words which do not occur in the sample. Recent studies have proven that additive smoothing is more effective than other probability smoothing methods in several retrieval tasks such as language-model-based pseudo-relevance feedback and recommender systems.
-Wikipedia: Additive Smoothing
Summary
Naive Bayes is a fast and accurate probabilistic classification model that is still popular 250+ years after Thomas Bayes came up with Bayes Theorem and his friend Richard Price compiled and published his work. More details below:
In this Naive Bayes Tutorial, we have learned how to tune Naive Bayes models using priors and var_smoothing parameters. We have also discussed the importance of independence between features beyond parameter optimization.
Additionally, choosing the right Naive Bayes implementation can make a difference. Some of the options among Naive Bayes classifier models are: Gaussian, Multinomial, Bernoulli, Categorical and Complement and Out of Core Naive Bayes implementations. You can see more details about different Naive Bayes Classifiers in the tutorial below: