{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "
\n", "
Enter Password:
\n", " \n", " \n", "
\n", "\n", "show code\n" ], "text/plain": [ "" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%run ../initscript.py\n", "HTML(\"\"\"\n", "
\n", "
Enter Password:
\n", " \n", " \n", "
\n", "\n", "show code\n", "\"\"\")" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " show code\n", " " ], "text/plain": [ "" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%run loadmlfuncs.py\n", "\n", "df_book_part1 = pd.read_csv(dataurl+'book_train.csv', header=0, index_col='customer')\n", "df_book_part2 = pd.read_csv(dataurl+'book_validation.csv', header=0, index_col='customer')\n", "\n", "toggle()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Logistic Regression\n", "\n", "Logistic regression is a popular method for classifying individuals (although we call it regression), given the values of a set of explanatory variables. It estimates the probability that an individual is in a particular category. We will demonstrate the method by considering a book club case." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Book Club\n", "\n", "In a book club, a new title, \"The Art History of Florence\", is ready for release. The book club has sent promotion mails to a sample of customers from its customer base in two different times. Each time it randomly select 1000 customers." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
monthart_bookpurchased
customer
12400
21600
31500
42200
51501
\n", "
" ], "text/plain": [ " month art_book purchased\n", "customer \n", "1 24 0 0\n", "2 16 0 0\n", "3 15 0 0\n", "4 22 0 0\n", "5 15 0 1" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_book_part1.head()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
monthart_bookpurchased
customer
10013000
10021200
10031800
10042710
1005410
\n", "
" ], "text/plain": [ " month art_book purchased\n", "customer \n", "1001 30 0 0\n", "1002 12 0 0\n", "1003 18 0 0\n", "1004 27 1 0\n", "1005 4 1 0" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_book_part2.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Book club has collected several variables for all 2000 customers as follows:\n", "\n", "- month: months to the customer's last purchase when promotion mail is sent\n", "\n", "- art_book: number of art books the customer purchased before\n", "\n", "- purchased: if s/he paid for the new title \"The Art History of Florence\"\n", "\n", "It costs the book club $1 for sending a mail and generates $7 profit for selling the book. After two promotions, an analyst in the book club realizes that the store actually lost money in both promotions." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "net profit for the 1st promotion: -418\n", "net profit for the 2nd promotion: -433\n" ] } ], "source": [ "def calc_profit(df):\n", " mail_cost = 1\n", " selling_profit = 7\n", " profit = df.purchased.sum() * 7 - df.month.count()*mail_cost\n", " return profit\n", "\n", "print('net profit for the 1st promotion:', calc_profit(df_book_part1)) \n", "print('net profit for the 2nd promotion:', calc_profit(df_book_part2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The manager believes that the book club should build a predictive model to predict each customer's probability of purchasing, and then send out promotion mail only if such a probability is high enough.\n", "\n", "- can we derive a prediction model after collecting the data from the first promotion? \n", "\n", "- can this prediction model improve the second promotion?\n", "\n", "We expect this prediction model \n", "\n", "- uses `month` and `art_book` to predict `purchased`\n", "\n", "which suggests a regression equation `purchased` $\\sim$ `month` $+$ `art_book`. However, the $y$ variable `purchased` is either 0 or 1, and a scatter plot between `purchased` and `month` ($y$ vs $x$) shows as follows" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "scrolled": true }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "df_book_part1.plot.scatter(x='month', y='purchased')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The graph is against many linear regression assumptions:\n", "\n", "- there is no linear relationship between independent and dependent variables.\n", "\n", "- error term is probably not normally distributed." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The Model\n", "\n", "Instead of using the binary variable (purchased or not), we may consider **purchasing probability** $p$ as dependent variable in a regression equation such as\n", "\n", "\\begin{align}\n", "p &= \\beta_0 + \\beta_1 \\times \\text{month} + \\beta_2 \\times \\text{art_book}\n", "\\end{align}\n", "\n", "\n", "Although $p$ is continuous, we still cannot run a linear regression on $p$ because it is bounded in range $[0,1]$. In linear regression, the dependent variable should be able to take any value in range $[-\\infty, +\\infty]$.\n", "\n", "We introduce **odds** and **utility**\n", "\n", "\\begin{align}\n", "\\text{odds} &= \\frac{p}{1-p} \\nonumber \\\\\n", "\\text{utility} &= \\log(\\text{odds}) \\nonumber \\\\\n", "\\end{align}\n", "\n", "Note that **utility** is in $[-\\infty, +\\infty]$. Now a regression equation can be used\n", "\n", "\\begin{align}\n", "\\text{utility} &= \\beta_0 + \\beta_1 \\times \\text{month} + \\beta_2 \\times \\text{art_book}\n", "\\end{align}" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" }, { "data": { "text/html": [ "\n", " show code\n", " " ], "text/plain": [ "" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "p = np.linspace(0,1,100)\n", "odds = p / (1-p)\n", "utility = np.log(odds)\n", "\n", "plt.subplots(1, 2, figsize=(12,5))\n", "plt.subplot(1, 2, 1)\n", "plt.plot(p, odds)\n", "plt.xlabel('$p$')\n", "plt.ylabel('odds')\n", "\n", "plt.subplot(1, 2, 2)\n", "plt.plot(p, utility)\n", "plt.xlabel('$p$')\n", "plt.ylabel('utility')\n", "\n", "plt.show()\n", "toggle()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In practice, we can simply use a typical type of regression, logistic regression, with binary dependent variable. Statistical tools will perform all the transformation for us. In python, we can use either `statmodels` which provides statistical summary or `sklearn` package." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Optimization terminated successfully.\n", " Current function value: 0.251705\n", " Iterations 7\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
Logit Regression Results
Dep. Variable: purchased No. Observations: 999
Model: Logit Df Residuals: 996
Method: MLE Df Model: 2
Date: Mon, 29 Jul 2019 Pseudo R-squ.: 0.1206
Time: 20:41:18 Log-Likelihood: -251.45
converged: True LL-Null: -285.95
LLR p-value: 1.044e-15
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
coef std err z P>|z| [0.025 0.975]
const -2.2262 0.239 -9.316 0.000 -2.695 -1.758
month -0.0706 0.019 -3.670 0.000 -0.108 -0.033
art_book 0.9888 0.135 7.343 0.000 0.725 1.253
" ], "text/plain": [ "\n", "\"\"\"\n", " Logit Regression Results \n", "==============================================================================\n", "Dep. Variable: purchased No. Observations: 999\n", "Model: Logit Df Residuals: 996\n", "Method: MLE Df Model: 2\n", "Date: Mon, 29 Jul 2019 Pseudo R-squ.: 0.1206\n", "Time: 20:41:18 Log-Likelihood: -251.45\n", "converged: True LL-Null: -285.95\n", " LLR p-value: 1.044e-15\n", "==============================================================================\n", " coef std err z P>|z| [0.025 0.975]\n", "------------------------------------------------------------------------------\n", "const -2.2262 0.239 -9.316 0.000 -2.695 -1.758\n", "month -0.0706 0.019 -3.670 0.000 -0.108 -0.033\n", "art_book 0.9888 0.135 7.343 0.000 0.725 1.253\n", "==============================================================================\n", "\"\"\"" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from statsmodels.api import add_constant\n", "from statsmodels.formula.api import Logit\n", "X = add_constant(df_book_part1[['month','art_book']])\n", "y = df_book_part1['purchased']\n", "model = Logit(y, X)\n", "model.fit().summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The signs of the coefficients indicate whether the probability of purchasing the book increases or decreases when these variables increases. For example, the probability of purchasing the book decrease as `month` increase (because of its minus sign) and increase as `art_book` increase (because of its plus sign).\n", "\n", "However, you have to use caution when interpreting the magnitudes of the coefficients. For example, the absolute value of coefficient of `month` is smaller than `art_book` because `month` generally have larger values than `art_book`.\n", "\n", "The value $\\exp$(coefficient) is more interpretable. For example, if `art_book` increases 1, the odds of purchasing the book increase by a factor about $\\exp(0.9888)$. So, you should be on the lookout for values well above or below 1." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "intercept= [-2.22621349] \n", "coefficient = [[-0.07061966 0.98880806]]\n" ] } ], "source": [ "from sklearn import linear_model\n", "\n", "X = df_book_part1[['month','art_book']]\n", "y = df_book_part1['purchased']\n", "\n", "clf = linear_model.LogisticRegression(C=1e5, solver='lbfgs')\n", "clf.fit(X, y)\n", "print('intercept=', clf.intercept_, '\\ncoefficient =', clf.coef_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Validation\n", "\n", "After we have obtained $\\beta_0, \\beta_1$ and $\\beta_2$, we can use equation $(2)$ for our validation data to evaluate its utility. Then, the probability can be derived by\n", "\n", "\\begin{align}\n", "p = \\frac{\\exp(\\text{utility})}{1+\\exp(\\text{utility})} \\nonumber\n", "\\end{align}" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "scrolled": false }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" }, { "data": { "text/html": [ "\n", " show code\n", " " ], "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "utility = np.linspace(-10,10,100)\n", "p = np.exp(utility) / (1 + np.exp(utility))\n", "\n", "plt.subplots(1, 1, figsize=(12,5))\n", "plt.plot(utility, p)\n", "plt.xlabel('utility')\n", "plt.ylabel('$p$')\n", "plt.show()\n", "toggle()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In general, our decision can be made based on a threshold value 0.5. That is, if the probability that a customer may purchase the book is greater than 0.5, we send a mail.\n", "\n", "However, in the book club case, it has a simple break-even point where the cost-profit ratio is $1/7$. Therefore, our strategy can be designed based on this ratio as follows. If the probability that a customer may purchase the book is greater than $1/7$, we send a mail, otherwise we do not.\n", "\n", "We prefer to use `sklearn` because it provides capability to predict the probability directly." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
monthart_bookpurchasedpurchase_probsend
customer
100130000.012808False
100212000.044207False
100318000.029387False
100427100.041323False
10054100.179479True
\n", "
" ], "text/plain": [ " month art_book purchased purchase_prob send\n", "customer \n", "1001 30 0 0 0.012808 False\n", "1002 12 0 0 0.044207 False\n", "1003 18 0 0 0.029387 False\n", "1004 27 1 0 0.041323 False\n", "1005 4 1 0 0.179479 True" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_book_part2['purchase_prob'] = clf.predict_proba(df_book_part2[['month','art_book']])[:,1]\n", "df_book_part2['send'] = df_book_part2['purchase_prob'] > 1/7\n", "df_book_part2.head()" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Based on our prediction model, we should send 128 mails.\n", "We would expect receiving 38 orders and our profit is $138.\n" ] }, { "data": { "text/html": [ "\n", " show code\n", " " ], "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "num_mail_send = df_book_part2[df_book_part2['send']].shape[0]\n", "num_purchased = df_book_part2[df_book_part2['send'] & df_book_part2['purchased'] == 1].shape[0]\n", "profit = num_purchased * 7 - num_mail_send\n", "print('Based on our prediction model, we should send {} mails.'.format(num_mail_send))\n", "print('We would expect receiving {} orders and our profit is ${}.'.format(num_purchased, profit))\n", "toggle()" ] } ], "metadata": { "hide_input": false, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": true }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 2 }