[1]:
%run ../initscript.py
HTML("""
<div id="popup" style="padding-bottom:5px; display:none;">
    <div>Enter Password:</div>
    <input id="password" type="password"/>
    <button onclick="done()" style="border-radius: 12px;">Submit</button>
</div>
<button onclick="unlock()" style="border-radius: 12px;">Unclock</button>
<a href="#" onclick="code_toggle(this); return false;">show code</a>
""")
[1]:
show code
[2]:
%run loadmlfuncs.py

df_book_part1 = pd.read_csv(dataurl+'book_train.csv', header=0, index_col='customer')
df_book_part2 = pd.read_csv(dataurl+'book_validation.csv', header=0, index_col='customer')

toggle()
[2]:

Logistic Regression

Logistic regression is a popular method for classifying individuals (although we call it regression), given the values of a set of explanatory variables. It estimates the probability that an individual is in a particular category. We will demonstrate the method by considering a book club case.

Book Club

In a book club, a new title, “The Art History of Florence”, is ready for release. The book club has sent promotion mails to a sample of customers from its customer base in two different times. Each time it randomly select 1000 customers.

[3]:
df_book_part1.head()
[3]:
month art_book purchased
customer
1 24 0 0
2 16 0 0
3 15 0 0
4 22 0 0
5 15 0 1
[4]:
df_book_part2.head()
[4]:
month art_book purchased
customer
1001 30 0 0
1002 12 0 0
1003 18 0 0
1004 27 1 0
1005 4 1 0

Book club has collected several variables for all 2000 customers as follows:

  • month: months to the customer’s last purchase when promotion mail is sent

  • art_book: number of art books the customer purchased before

  • purchased: if s/he paid for the new title “The Art History of Florence”

It costs the book club $1 for sending a mail and generates $7 profit for selling the book. After two promotions, an analyst in the book club realizes that the store actually lost money in both promotions.

[5]:
def calc_profit(df):
    mail_cost = 1
    selling_profit = 7
    profit = df.purchased.sum() * 7 - df.month.count()*mail_cost
    return profit

print('net profit for the 1st promotion:', calc_profit(df_book_part1))
print('net profit for the 2nd promotion:', calc_profit(df_book_part2))
net profit for the 1st promotion: -418
net profit for the 2nd promotion: -433

The manager believes that the book club should build a predictive model to predict each customer’s probability of purchasing, and then send out promotion mail only if such a probability is high enough.

  • can we derive a prediction model after collecting the data from the first promotion?

  • can this prediction model improve the second promotion?

We expect this prediction model

  • uses month and art_book to predict purchased

which suggests a regression equation purchased \(\sim\) month \(+\) art_book. However, the \(y\) variable purchased is either 0 or 1, and a scatter plot between purchased and month (\(y\) vs \(x\)) shows as follows

[6]:
df_book_part1.plot.scatter(x='month', y='purchased')
plt.show()
../../_images/docs_machine_learning_logistic_regression_9_0.png

The graph is against many linear regression assumptions:

  • there is no linear relationship between independent and dependent variables.

  • error term is probably not normally distributed.

The Model

Instead of using the binary variable (purchased or not), we may consider purchasing probability \(p\) as dependent variable in a regression equation such as

\begin{align} p &= \beta_0 + \beta_1 \times \text{month} + \beta_2 \times \text{art_book} \end{align}

Although \(p\) is continuous, we still cannot run a linear regression on \(p\) because it is bounded in range \([0,1]\). In linear regression, the dependent variable should be able to take any value in range \([-\infty, +\infty]\).

We introduce odds and utility

\begin{align} \text{odds} &= \frac{p}{1-p} \nonumber \\ \text{utility} &= \log(\text{odds}) \nonumber \\ \end{align}

Note that utility is in \([-\infty, +\infty]\). Now a regression equation can be used

\begin{align} \text{utility} &= \beta_0 + \beta_1 \times \text{month} + \beta_2 \times \text{art_book} \end{align}

[7]:
p = np.linspace(0,1,100)
odds = p / (1-p)
utility = np.log(odds)

plt.subplots(1, 2, figsize=(12,5))
plt.subplot(1, 2, 1)
plt.plot(p, odds)
plt.xlabel('$p$')
plt.ylabel('odds')

plt.subplot(1, 2, 2)
plt.plot(p, utility)
plt.xlabel('$p$')
plt.ylabel('utility')

plt.show()
toggle()
../../_images/docs_machine_learning_logistic_regression_12_0.png
[7]:

In practice, we can simply use a typical type of regression, logistic regression, with binary dependent variable. Statistical tools will perform all the transformation for us. In python, we can use either statmodels which provides statistical summary or sklearn package.

[8]:
from statsmodels.api import add_constant
from statsmodels.formula.api import Logit
X = add_constant(df_book_part1[['month','art_book']])
y = df_book_part1['purchased']
model = Logit(y, X)
model.fit().summary()
Optimization terminated successfully.
         Current function value: 0.251705
         Iterations 7
[8]:
Logit Regression Results
Dep. Variable: purchased No. Observations: 999
Model: Logit Df Residuals: 996
Method: MLE Df Model: 2
Date: Mon, 29 Jul 2019 Pseudo R-squ.: 0.1206
Time: 20:41:18 Log-Likelihood: -251.45
converged: True LL-Null: -285.95
LLR p-value: 1.044e-15
coef std err z P>|z| [0.025 0.975]
const -2.2262 0.239 -9.316 0.000 -2.695 -1.758
month -0.0706 0.019 -3.670 0.000 -0.108 -0.033
art_book 0.9888 0.135 7.343 0.000 0.725 1.253

The signs of the coefficients indicate whether the probability of purchasing the book increases or decreases when these variables increases. For example, the probability of purchasing the book decrease as month increase (because of its minus sign) and increase as art_book increase (because of its plus sign).

However, you have to use caution when interpreting the magnitudes of the coefficients. For example, the absolute value of coefficient of month is smaller than art_book because month generally have larger values than art_book.

The value \(\exp\)(coefficient) is more interpretable. For example, if art_book increases 1, the odds of purchasing the book increase by a factor about \(\exp(0.9888)\). So, you should be on the lookout for values well above or below 1.

[9]:
from sklearn import linear_model

X = df_book_part1[['month','art_book']]
y = df_book_part1['purchased']

clf = linear_model.LogisticRegression(C=1e5, solver='lbfgs')
clf.fit(X, y)
print('intercept=', clf.intercept_, '\ncoefficient =', clf.coef_)
intercept= [-2.22621349]
coefficient = [[-0.07061966  0.98880806]]

Validation

After we have obtained \(\beta_0, \beta_1\) and \(\beta_2\), we can use equation \((2)\) for our validation data to evaluate its utility. Then, the probability can be derived by

\begin{align} p = \frac{\exp(\text{utility})}{1+\exp(\text{utility})} \nonumber \end{align}

[10]:
utility = np.linspace(-10,10,100)
p = np.exp(utility) / (1 + np.exp(utility))

plt.subplots(1, 1, figsize=(12,5))
plt.plot(utility, p)
plt.xlabel('utility')
plt.ylabel('$p$')
plt.show()
toggle()
../../_images/docs_machine_learning_logistic_regression_18_0.png
[10]:

In general, our decision can be made based on a threshold value 0.5. That is, if the probability that a customer may purchase the book is greater than 0.5, we send a mail.

However, in the book club case, it has a simple break-even point where the cost-profit ratio is \(1/7\). Therefore, our strategy can be designed based on this ratio as follows. If the probability that a customer may purchase the book is greater than \(1/7\), we send a mail, otherwise we do not.

We prefer to use sklearn because it provides capability to predict the probability directly.

[11]:
df_book_part2['purchase_prob'] = clf.predict_proba(df_book_part2[['month','art_book']])[:,1]
df_book_part2['send'] = df_book_part2['purchase_prob'] > 1/7
df_book_part2.head()
[11]:
month art_book purchased purchase_prob send
customer
1001 30 0 0 0.012808 False
1002 12 0 0 0.044207 False
1003 18 0 0 0.029387 False
1004 27 1 0 0.041323 False
1005 4 1 0 0.179479 True
[12]:
num_mail_send = df_book_part2[df_book_part2['send']].shape[0]
num_purchased = df_book_part2[df_book_part2['send'] & df_book_part2['purchased'] == 1].shape[0]
profit = num_purchased * 7 - num_mail_send
print('Based on our prediction model, we should send {} mails.'.format(num_mail_send))
print('We would expect receiving {} orders and our profit is ${}.'.format(num_purchased, profit))
toggle()
Based on our prediction model, we should send 128 mails.
We would expect receiving 38 orders and our profit is $138.
[12]: