[1]:

%run loadmlfuncs.py

Naive Bayes¶

Naive Bayes models are a group of extremely fast and simple classification algorithms that are often suitable for very high-dimensional datasets. Because they are so fast and have so few tunable parameters, they end up being very useful as a quick-and-dirty baseline for a classification problem. This section will focus on an intuitive explanation of how naive Bayes classifiers work.

Bayesian Classification¶

Naive Bayes classifiers are built on Bayesian classification methods. These rely on Bayes’s theorem, which is an equation describing the relationship of conditional probabilities of statistical quantities.

In Bayesian classification, we’re interested in finding the probability of a label given some observed features, which we can write as $P(C~|~{\rm features})$. Bayes’s theorem tells us how to express this in terms of quantities we can compute more directly:

\[P(C~|~{\rm features}) = \frac{P({\rm features}~|~C)P(C)}{P({\rm features})}\]

If we are trying to decide between two labels — let’s call them $C_1$ and $C_2$ — then one way to make this decision is to compute the ratio of the posterior probabilities for each label:

\[\frac{P(C_1~|~{\rm features})}{P(C_2~|~{\rm features})} = \frac{P({\rm features}~|~C_1)}{P({\rm features}~|~C_2)}\frac{P(C_1)}{P(C_2)}\]

All we need now is some model by which we can compute $P({\rm features}~|~C_i)$ for each label. Such a model is called a generative model because it specifies the hypothetical random process that generates the data.

Specifying this generative model for each label is the main piece of the training of such a Bayesian classifier. The general version of such a training step is a very difficult task, but we can make it simpler through the use of some simplifying assumptions about the form of this model.

This is where the “naive” in “naive Bayes” comes in: if we make very naive assumptions about the generative model for each label, we can find a rough approximation of the generative model for each class, and then proceed with the Bayesian classification.

Bernoulli Naive Bayes¶

There are 856 people who have either tried or not tried a company’s new frozen lasagna product. The data includes the categorical dependent variable tried and several other explanatory variables.

[2]:

df_lasagna = pd.read_csv(dataurl+'lasagna.csv', header=0, index_col='person')
df_lasagna.head()

[2]:

	age	weight	income	car_value	debt	mall_trips	gender	alone	dwell	pay_type	nbhd	tried
person
1	48	175	65500	2190	3510	7	Male	No	Home	Hourly	East	No
2	33	202	29100	2110	740	4	Female	No	Condo	Hourly	East	Yes
3	51	188	32200	5140	910	1	Male	No	Condo	Salaried	East	No
4	56	244	19000	700	1620	3	Female	No	Home	Hourly	West	No
5	28	218	81400	26620	600	3	Male	No	Apt	Salaried	West	Yes

For the Naive Bayes method, numeric predictors must be binned, i.e., made categorical. For this example, each numeric variable has been binned by its quartiles as shown below.

[3]:

df_quantile = df_lasagna.quantile([0, .25, .5, .75, 1])
df_quantile

[3]:

	age	weight	income	car_value	debt	mall_trips
0.00	22.0	142.0	2600.0	130.0	0.0	0.0
0.25	31.0	174.0	24475.0	2110.0	560.0	3.0
0.50	37.5	190.0	39950.0	4175.0	1020.0	4.0
0.75	46.0	210.0	58225.0	7717.5	1972.5	7.0
1.00	64.0	258.0	190500.0	33870.0	8960.0	17.0

[4]:

for col in df_lasagna.columns:
    if df_lasagna[col].dtypes == 'int64':
        df_lasagna[col] = pd.cut(df_lasagna[col], bins=round(df_quantile[col])-[1,1,1,1,0], labels=False)
df_lasagna.head()

[4]:

	age	weight	income	car_value	debt	mall_trips	gender	alone	dwell	pay_type	nbhd	tried
person
1	3	1	3	1	3	3	Male	No	Home	Hourly	East	No
2	1	2	1	1	1	2	Female	No	Condo	Hourly	East	Yes
3	3	1	1	2	1	0	Male	No	Condo	Salaried	East	No
4	3	3	0	0	2	1	Female	No	Home	Hourly	West	No
5	0	3	3	3	1	1	Male	No	Apt	Salaried	West	Yes

We partition data into training and testing datasets.

[5]:

df_lasagna_train = df_lasagna.iloc[:700]
df_lasagna_test = df_lasagna.iloc[700:]

We fit a frequency count based on the training data, which provides the probability of a feature given an individual’s class. For example, person 1 has value ‘No’ for tried and has dwell ‘Home’. We obtain a probability

\begin{align} p(\text{dwell $=$ 'Home'}|\text{No}) = \frac{\text{# of persons whose dwell $=$ 'Home' and tried $=$ 'No'}}{\text{# of persons whose tried $=$ 'No'}} = 0.468 \nonumber \end{align}

The value (‘No’, ‘Home’): 0.468 shows that, if a customer did not try the lasagna product, the probability of his/her dwell type being Home is 0.468.

[6]:

def frequency_count(col, normalize):
    return dict(df_lasagna_train.groupby('tried')[col].value_counts(normalize=normalize))

interact(frequency_count,
         col=widgets.Dropdown(options=df_lasagna_train.columns, value='dwell', description='column:',disabled=False),
         normalize=widgets.Checkbox(value=True, description='normalize',disabled=False)
        );

Similarly, we can calculate

\begin{align} p(\text{dwell $=$ 'Home'}|\text{Yes}), p(\text{age = '3'}|\text{No}), p(\text{income = '2'}|\text{Yes}), \nonumber \end{align}

and so on. In summary, we have $p(\text{feature}|\text{No})$ and $p(\text{feature}|\text{Yes})$ for each possible feature of customers, which can be considered as a feature dictionary.

With this feature dictionary, we want to obtain probabilities \begin{align} p(\text{person 1}|\text{No}) \text{ and } p(\text{person 1}|\text{Yes}) \nonumber \end{align}

where the probability $p(\text{person 1}|\text{No})$ can be interpret as

if a person did not want to try the lasagna product, s/he has probability $p(\text{person 1}|\text{No})$ being person 1.

Then our prediction is as simple as follows

If $p(\text{person 1}|\text{No}) > p(\text{person 1}|\text{Yes})$ then we prediction the person will not try the lasagna production
Otherwise, we prediction the person will try the lasagna production.

Here is how probabilities \begin{align} p(\text{person 1}|\text{No}) \text{ and } p(\text{person 1}|\text{Yes}) \nonumber \end{align}

are derived. For example, for the first person

\begin{align} p(\text{person 1}|\text{No}) = & p(\text{age}=3|\text{No}) \times p(\text{weight}=1|\text{No}) \times \cdots \times\nonumber \\ & p(\text{pay_type $=$ hourly}|\text{No}) \times p(\text{nbhd $=$ East}|\text{No}) \nonumber \end{align}

and

\begin{align} p(\text{person 1}|\text{Yes}) = & p(\text{age}=3|\text{Yes}) \times p(\text{weight}=1|\text{Yes}) \times \cdots \times \nonumber \\ & p(\text{pay_type $=$ hourly}|\text{Yes}) \times p(\text{nbhd $=$ East}|\text{Yes}) \nonumber \end{align}

Note that we assume all the features are independent so that a simple multiplication can be applied (that is why this method is called naive). Thus, $p(\text{person 1}|\text{No})$ is the probability to have the exactly same features as person 1 if a random person did not want to try the lasagna product.

[7]:

def predict(df):
    df['prediction'] = ['No'
        if np.prod([frequency_count(col, True)[('No',df.iloc[row][col])]
                   for col in df.columns if col not in ['tried','prediction']]) > \
           np.prod([frequency_count(col, True)[('Yes',df.iloc[row][col])]
                   for col in df.columns if col not in ['tried','prediction']])\
        else 'Yes'
                   for row in range(df.shape[0])]

predict(df_lasagna_train)
df_lasagna_train.head()

[7]:

	age	weight	income	car_value	debt	mall_trips	gender	alone	dwell	pay_type	nbhd	tried	prediction
person
1	3	1	3	1	3	3	Male	No	Home	Hourly	East	No	Yes
2	1	2	1	1	1	2	Female	No	Condo	Hourly	East	Yes	No
3	3	1	1	2	1	0	Male	No	Condo	Salaried	East	No	No
4	3	3	0	0	2	1	Female	No	Home	Hourly	West	No	No
5	0	3	3	3	1	1	Male	No	Apt	Salaried	West	Yes	Yes

The classification matrix of training data is

[8]:

df_lasagna_train.groupby(['prediction','tried']).size().unstack(level=1, fill_value=0)

[8]:

tried	No	Yes
prediction
No	250	81
Yes	49	320

The feature dictionary can be used in testing data and its classification matrix is

[9]:

predict(df_lasagna_test)
df_lasagna_test.groupby(['prediction','tried']).size().unstack(level=1, fill_value=0)

[9]:

tried	No	Yes
prediction
No	56	21
Yes	6	73

Multinomial Naive Bayes¶

In multinomial naive Bayes, the features are assumed to be generated from a simple multinomial distribution.

The multinomial distribution describes the probability of observing counts among a number of categories, and thus multinomial naive Bayes is most appropriate for features that represent counts or count rates.

The idea is precisely the same as before, except that instead of modeling the data distribution with the best-fit Gaussian, we model the data distribution with a best-fit multinomial distribution.

Example: Classifying Text¶

One place where multinomial naive Bayes is often used is in text classification, where the features are related to word counts or frequencies within the documents to be classified.

Here we will use the sparse word count features from the 20 Newsgroups corpus to show how we might classify these short documents into categories.

[10]:

from sklearn.datasets import fetch_20newsgroups

data = fetch_20newsgroups()
display(data.target_names)

# choose a subset categories to learn
categories = ['talk.religion.misc',
              'soc.religion.christian',
              'sci.space',
              'comp.graphics']

train = fetch_20newsgroups(subset='train', categories=categories)
test = fetch_20newsgroups(subset='test', categories=categories)

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

Here is a representative entry from the data:

[11]:

print(train.data[5])

From: dmcgee@uluhe.soest.hawaii.edu (Don McGee)
Subject: Federal Hearing
Originator: dmcgee@uluhe
Organization: School of Ocean and Earth Science and Technology
Distribution: usa
Lines: 10

Fact or rumor....?  Madalyn Murray O'Hare an atheist who eliminated the
use of the bible reading and prayer in public schools 15 years ago is now
going to appear before the FCC with a petition to stop the reading of the
Gospel on the airways of America.  And she is also campaigning to remove
Christmas programs, songs, etc from the public schools.  If it is true
then mail to Federal Communications Commission 1919 H Street Washington DC
20054 expressing your opposition to her request.  Reference Petition number

2493.

[12]:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.metrics import confusion_matrix

# Fit the model and show the classification matrix
#convert the content of each string into a vector of numbers
model = make_pipeline(TfidfVectorizer(), MultinomialNB())
model.fit(train.data, train.target)
labels = model.predict(test.data)
mat = confusion_matrix(test.target, labels)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
            xticklabels=train.target_names, yticklabels=train.target_names)
plt.xlabel('true label')
plt.ylabel('predicted label');

def predict_category(s, train=train, model=model):
    pred = model.predict([s])
    return train.target_names[pred[0]]

../../_images/docs_machine_learning_naive_bayes_24_0.png

Evidently, even this very simple classifier can successfully separate space talk from computer talk, but it gets confused between talk about religion and talk about Christianity. This is perhaps an expected area of confusion!

The very cool thing here is that we now have the tools to determine the category for any string, using the predict() method of this pipeline. Here’s a quick utility function that will return the prediction for a single string:

[13]:

predict_category('discussing islam vs atheism')

[13]:

'soc.religion.christian'

[14]:

predict_category('determining the screen resolution')

[14]:

'comp.graphics'

[15]:

predict_category('sending a payload to the ISS')

[15]:

'sci.space'

When to Use Naive Bayes¶

Because naive Bayesian classifiers make such stringent assumptions about data, they will generally not perform as well as a more complicated model. That said, they have several advantages:

They are extremely fast for both training and prediction
They provide straightforward probabilistic prediction
They are often very easily interpretable
They have very few (if any) tunable parameters

These advantages mean a naive Bayesian classifier is often a good choice as an initial baseline classification.