[1]:
%run initscript.py
%run ./files/loadregfuncs.py
HTML("""
<div id="popup" style="padding-bottom:5px; display:none;">
    <div>Enter Password:</div>
    <input id="password" type="password"/>
    <button onclick="done()" style="border-radius: 12px;">Submit</button>
</div>
<button onclick="unlock()" style="border-radius: 12px;">Unclock</button>
<a href="#" onclick="code_toggle(this); return false;">show code</a>
""")
[1]:
show code

Applied Regression Analysis

Regression Assumptions

As one of the most important types of data analysis, regression analysis is a way of mathematically sorting out factors that may have an impact. It focuses on questions:

  • Which factors matter most?

  • Which can we ignore?

  • How do those factors interact with each other?

  • Perhaps most importantly, how certain are we about all of these factors?

Those factors are called variables, and we have

  • Dependent (or response or target) variable: the main factor that you’re trying to understand or predict

  • Independent (or explanatory or predictor) variables: the factors you suspect have an impact on your dependent variable

Generally, regression model finds a line that fits the data “best” such that \(n\) residuals — one for each observed data point — are as small as possible in some overall sense. One way to achieve this goal is to use the “ordinary least squares (OLS) criterion,” which says to “minimize the sum of the squared prediction errors.”

What does the line estimate?

When looking to summarize the relationship between predictors \(X\) and a response \(Y\), we are interested in knowing the relationship in the population. The only way we could ever know it, though, is to be able to collect data on everybody in the population — most often an impossible task. We have to rely on taking and using a sample of data from the population to estimate the population regression line.

Therefore, several assumptions about the population are required to use sample regression line to estimate population regression line — statistical inference in a regression context.

Regression assumptions: 1. There is a population regression line. It joints the means of the dependent variable for all values of the explanatory variables, the mean of the errors is zero. 2. For any values of the explanatory variables, the variance (or standard deviation) of the dependent variable is a constant, the same for all such values. 3. For any values of the explanatory variables, the dependent variable is normally distributed. 4. The errors are probabilistically independent.

These assumptions represent an idealization of reality. From a practical point of view, if they represent a close approximate to reality, then the analysis is valid. But if the assumptions are grossly violated, statistical inferences that are based on these assumptions should be viewed with suspicion.

First Assumption

Let \(Y\) be the dependent variable and there are \(k\) explanatory variables \(X_1\) through \(X_k\). The assumption implies that there is an exact linear relationship between the mean of all \(Y\)’s for any fixed values of \(X\)’s and the \(X\)’s.

[2]:
interact(draw_sample,
         flag=widgets.Checkbox(value=False,description='Show Sample',disabled=False),
         id=widgets.Dropdown(options=range(1,4),value=1,description='Sample ID:',disabled=False));

Suppose we have population as shown in the plot. For example, \(X\) is the years of college education and \(Y\) is the annual salary. Given years of education, there is a range of possible salaries for different individuals.

The first assumption says, for each \(X\), the mean of \(Y\)s (\(=\overline{Y}\)) lies on the maroon line \(\overline{Y} = 3.08 X + 9.87\), which is the population regression line joining means. For each data point, its \(Y\) value is expressed by the population regression line with error \(Y = 3.08 X + 9.87 + \epsilon\) where \(\epsilon\) is the deviation of the data point to the maroon line.

Note that since the population is not accessible, the population regression (maroon) line is not observable in reality.

[3]:
interact(draw_sample,
         flag=widgets.Checkbox(value=True,description='Show Sample',disabled=False),
         id=widgets.Dropdown(options=range(1,4),value=1,description='Sample ID:',disabled=False));

The estimated regression line as shown in orange is based on the sample data, which is usually different from the population regression line. The regression lines derived from the first two samples are largely deviated from the population regression line.

Note that an error \(\epsilon\) is not quite the same as a residual \(e\). An error is the vertical distance from a point to the (unobservable) population regression line. A residual is the vertical distance from a point to the estimated regression line.

Second Assumption

The second assumption requires constant error variance, technical term is homoscedasticity. This assumption is often questionable because the variation in \(Y\) often increases as \(X\) increases. For example, the variation of spending increases as salary increases. We say the data exhibit heteroscedasticity (nonconstant error variance) if the variability of \(Y\) values is larger for some \(X\) values than for others. The easiest way to detect nonconstant error variance is through a visual inspection of a scatterplot.

It is usually sufficient to “visually” interpret a residuals versus fitted values plot. However, there are several hypothesis tests on the residuals for checking constant variance.

  • Brown-Forsythe test (Modified Levene Test)

  • Cook-Weisberg score test (Breusch-Pagan Test)

  • Bartlett’s Test

Third Assumption

The third assumption is equivalent to stating that the errors are normally distributed. We can check this by forming a histogram (or a Q-Q plot) of the residuals. - If the assumption holds, the histogram should be approximately symmetric and bell-shaped, and the points of a Q-Q plot should be close to a 45 degree line. - If there is an obvious skewness or some other nonnormal property, this indicates a violation of assumption 3.

[4]:
interact(draw_qq,
         dist=widgets.Dropdown(options=['normal','student_t','uniform','triangular'],
                               value='student_t',description='Distribution:',disabled=False));

The plot shows the Q-Q plot and histogram of four distributions including normal, student t, uniform and triangular.

Besides graphical method for assessing residual normality, there are several hypothesis tests in which the null hypothesis is that the errors have a normal distribution, such as

  • Kolmogorov-Smirnov Test

  • Anderson-Darling Test

  • Shapiro-Wilk Test

  • Ryan-Joiner Test

If those tests are available in software, then a large p-value indicates failure to reject this null hypothesis.

Fourth Assumption

The fourth assumption requires probabilistic independence of the errors. - This assumption means that information on some of the errors provides no information on the values of the other errors. However, when the sample data have been collected over time and the regression model fails to effectively capture any time trends, the random errors in the model are often positively correlated over time. This phenomenon is known as autocorrelation (or serial correlation) and can sometimes be detected by plotting the model residuals versus time. - For cross-sectional data, this assumption is usually taken for granted. - For time-series data, this assumption is often violated. - This is because of a property called autocorrelation. - The Durbin-Watson statistic is one measure of autocorrelation and thus measures the extent to which assumption 4 is violated.

Summary

We now summarize the four assumptions that underlie the linear regression model:

  • The mean of the response is a Linear function of the explanatory variables.

  • The errors are Independent.

  • The errors are Normally distributed.

  • The errors have Equal variances.

The first letters that are highlighted spell “LINE”.

Regression Estimation

We consider a simple example to illustrate how to use python package statsmodels to perform regression analysis and predictions.

Case: Manufacture Costs

A manufacturer wants to get a better understanding of its costs. We consider two variables: machine hours and production runs. A production run can produce a large batch of products. Once a production run is completed, the factory must set up for the next production run. Therefore, the costs include fixed costs associated with each production run and variable costs based on machine hours.

Scatterplot (with regression line and mean CI) and time series plot are presented:

[5]:
df = pd.read_csv(dataurl+'costs.csv', header=0, index_col='month')
def draw_scatterplot(x, y):
    fig, axes = plt.subplots(1, 2, figsize = (12,6))
    sns.regplot(x=df[x], y=df[y], data=df, ax=axes[0])
    axes[0].set_title('Scatterplot: X={} Y={}'.format(x,y))
    df[y].plot(ax=axes[1], xlim=(0,40))
    axes[1].set_title('Time series plot: {}'.format(y))
    plt.show()

interact(draw_scatterplot,
         x=widgets.Dropdown(options=df.columns,value=df.columns[0],
                            description='X:',disabled=False),
         y=widgets.Dropdown(options=df.columns,value=df.columns[2],
                            description='Y:',disabled=False));

g=sns.pairplot(df, height=3);
for i, j in zip(*np.triu_indices_from(g.axes, 1)):
    g.axes[i, j].set_visible(False)

toggle()
[5]:
../_images/docs_regression_analysis_24_2.png

We regress costs against production runs with formula:

\[\text{cost} \sim \text{production runs}\]

which is a simple regression because of only one explanatory variable.

[6]:
df = pd.read_csv(dataurl+'costs.csv', header=0, index_col='month')
result = analysis(df, 'cost', ['production_runs'], printlvl=5)

fig = plt.figure(figsize=(15,8))
fig = sm.graphics.plot_regress_exog(result, "production_runs", fig=fig);
OLS Regression Results
Dep. Variable: cost R-squared: 0.271
Model: OLS Adj. R-squared: 0.250
Method: Least Squares F-statistic: 12.64
Date: Wed, 14 Aug 2019 Prob (F-statistic): 0.00114
Time: 08:54:41 Log-Likelihood: -379.62
No. Observations: 36 AIC: 763.2
Df Residuals: 34 BIC: 766.4
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 7.561e+04 6808.611 11.104 0.000 6.18e+04 8.94e+04
production_runs 655.0707 184.275 3.555 0.001 280.579 1029.562
Omnibus: 0.597 Durbin-Watson: 2.090
Prob(Omnibus): 0.742 Jarque-Bera (JB): 0.683
Skew: -0.264 Prob(JB): 0.711
Kurtosis: 2.580 Cond. No. 160.


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

standard error of estimate:9457.23946


ANOVA Table:

sum_sq df F PR(>F)
production_runs 1.130248e+09 1.0 12.637029 0.001135
Residual 3.040939e+09 34.0 NaN NaN
../_images/docs_regression_analysis_26_3.png
../_images/docs_regression_analysis_26_4.png
KstestResult(statistic=0.5, pvalue=7.573316153566455e-09)
../_images/docs_regression_analysis_26_6.png

We perform a multiple regression on costs against both explanatory variables (production runs and machine hours) by using formula:

\[\text{cost} \sim \text{production runs} + \text{machine hours}\]
[7]:
df = pd.read_csv(dataurl+'costs.csv', header=0, index_col='month')
result = analysis(df, 'cost', ['production_runs','machine_hours'], printlvl=5)

fig = plt.figure(figsize=(12,8))
fig = sm.graphics.plot_partregress_grid(result, fig=fig);

fig = plt.figure(figsize=(10,8))
fig = sm.graphics.influence_plot(result,criterion="cook", fig=fig);
OLS Regression Results
Dep. Variable: cost R-squared: 0.866
Model: OLS Adj. R-squared: 0.858
Method: Least Squares F-statistic: 107.0
Date: Wed, 14 Aug 2019 Prob (F-statistic): 3.75e-15
Time: 08:54:44 Log-Likelihood: -349.07
No. Observations: 36 AIC: 704.1
Df Residuals: 33 BIC: 708.9
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 3996.6782 6603.651 0.605 0.549 -9438.551 1.74e+04
production_runs 883.6179 82.251 10.743 0.000 716.276 1050.960
machine_hours 43.5364 3.589 12.129 0.000 36.234 50.839
Omnibus: 3.142 Durbin-Watson: 1.313
Prob(Omnibus): 0.208 Jarque-Bera (JB): 2.259
Skew: -0.609 Prob(JB): 0.323
Kurtosis: 3.155 Cond. No. 1.42e+04


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.42e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

standard error of estimate:4108.99309


ANOVA Table:

sum_sq df F PR(>F)
production_runs 1.948557e+09 1.0 115.409712 2.611395e-12
machine_hours 2.483773e+09 1.0 147.109602 1.046454e-13
Residual 5.571662e+08 33.0 NaN NaN
../_images/docs_regression_analysis_28_3.png
KstestResult(statistic=0.5833333333333333, pvalue=3.6048681911998855e-12)
../_images/docs_regression_analysis_28_5.png
<Figure size 720x576 with 0 Axes>
../_images/docs_regression_analysis_28_7.png

Statistical Inferences

The regression output includes statistics for the regression model:

\(R^2\) (coefficient of determination) is an important measure of the goodness of fit of the least squares line. Formula for \(R^2\) is

\[R^2 = 1 - \frac{\sum e^2}{ \sum (Y_i - \overline{Y})^2} = 1 - \frac{\sum (Y_i - \hat{Y}_i)}{ \sum (Y_i - \overline{Y})^2}\]

where \(\overline{Y}\) is the mean of the observed data with \(\overline{Y} = 1/n \sum Y_i\) and \(\hat{Y}_i = a+b X\) is the fitted value.

  • It is the percentage of variation of the dependent variable explained by the regression.

  • It always ranges between 0 and 1. The better the linear fit is, the closer \(R^2\) is to 1.

  • \(R^2\) is the square of the multiple R, i.e., the correlation between the actual \(Y\) values and the fitted \(\hat{Y}\) values.

  • \(R^2\) can only increase when extra explanatory variables are added to an equation.

  • If \(R^2\) = 1, all of the data points fall perfectly on the regression line. The predictor \(X\) accounts for all of the variation in \(Y\).

  • If \(R^2\) = 0, the estimated regression line is perfectly horizontal. The predictor \(X\) accounts for none of the variation in \(Y\).

Adjusted \(R^2\) is an alternative measure that adjusts \(R^2\) for the number of explanatory variables in the equation. It is used primarily to monitor whether extra explanatory variables really belong in the equation.

Standard error of estimate (\(s_e\)), also known as the regression standard error or the residual standard error, provide a good indication of how useful the regression line is for predicting \(Y\) values from \(X\) values.

\(F\) statistics reports the result of \(F\)-test for the overall fit where the null hypothesis is that all coefficients of the explanatory variables are zero and the alternative is that at least one of these coefficients is not zero. It is possible, because of multicollinearity, to have all small \(t\)-values even though the variables as a whole have significant explanatory power.

Also, for each explanatory variable:

  • std err: reflects the level of accuracy of the coefficients. The lower it is, the higher is the level of accuracy

  • \(t\)-values for the individual regression coefficients, which is \(t\)-\(\text{value} = \text{coefficient}/\text{std err}\).

  • \(P >|t|\) is p-value. A p-value of less than 0.05 is considered to be statistically significant

    • When the p-value of a coefficient is greater or equal than 0.05, we don’t reject the null hypothesis \(H_0\): the coefficient is zero. The following two realities are possible:

      • We committed a Type II error. That is, in reality the coefficient is nonzero and our sample data just didn’t provide enough evidence.

      • There is not much of a linear relationship between explanatory and dependent variables, but they could have nonlinear relationship.

    • When the p-value of a coefficient is less than 0.05, we reject the null hypothesis \(H_0\): the coefficient is zero. The following two realities are possible:

      • We committed a Type II error. That is, in reality the coefficient is zero and we have a bad sample.

      • There is indeed a linear relationship, but a nonlinear function would fit the data even better.

  • Confidence Interval \([0.025, 0.975]\) represents the range in which our coefficients are likely to fall (with a likelihood of 95%)

R-squared

  • Since \(R^2\) is the square of a correlation, it quantify the strength of a linear relationship. In other words, it is possible that \(R^2 = 0\%\) suggesting there is no linear relation between \(X\) and \(Y\), and yet a perfect curved (or “curvilinear” relationship) exists.

[8]:
x = np.linspace(-5, 5, 10)
y = x*x
df = pd.DataFrame({'X':x,'Y':y})
result = ols(formula="Y ~ X", data=df).fit()

fig, axes = plt.subplots(1,1,figsize=(5,4))
_ = sns.regplot(x=df.X, y=df.Y, data=df, ci=None, ax=axes)

plt.ylim(-5, 30)
axes.text(0.6, 28, 'R-squared = %0.2f' % result.rsquared)
axes.text(0.6, 25, 'Correlation = %0.2f' % np.sqrt(result.rsquared))
plt.show()
toggle()
../_images/docs_regression_analysis_33_0.png
[8]:
  • \(R^2\) can be greatly affected by just one data point (or a few data points).

[9]:
x = np.linspace(-5, 5, 5)
y = -1.5*x + 1*np.random.normal(size=5)
y[4] = 50
df = pd.DataFrame({'X':x,'Y':y})
res = ols(formula="Y ~ X", data=df).fit()

fig, axes = plt.subplots(1,2,figsize=(12,6))
_ = sns.regplot(x=df.X, y=df.Y, data=df, ci=None, ax=axes[0])
axes[0].text(-4, 48, 'R-squared = %0.2f' % res.rsquared)
axes[0].text(-4, 45, 'Correlation = %0.2f' % np.sqrt(res.rsquared))

xx = x[:4]
yy = y[:4]
df = pd.DataFrame({'X':xx,'Y':yy})
res = ols(formula="Y ~ X", data=df).fit()
_ = sns.regplot(x=df.X, y=df.Y, data=df, ci=None, ax=axes[1])
axes[1].text(-4, 48, 'R-squared = %0.2f' % res.rsquared)
axes[1].text(-4, 45, 'Correlation = %0.2f' % np.sqrt(res.rsquared))
axes[1].plot(x[4], y[4], 'bo')

plt.show()
toggle()
../_images/docs_regression_analysis_35_0.png
[9]:
  • A large \(R^2\) value should not be interpreted as meaning that the estimated regression line fits the data well. Another function might better describe the trend in the data.

[10]:
x = np.linspace(1, 5, 20)
y = np.power(x,3)
df = pd.DataFrame({'X':x,'Y':y})
res = ols(formula="Y ~ X", data=df).fit()

fig, axes = plt.subplots(1,1,figsize=(6,6))
_ = sns.regplot(x=df.X, y=df.Y, data=df, ci=None, ax=axes)
axes.text(1, 120, 'R-squared = %0.2f' % res.rsquared)
plt.title('Data generated by equation $Y = X^3$')
plt.show()
toggle()
../_images/docs_regression_analysis_37_0.png
[10]:
  • Because statistical significance does not imply practical significance, large \(R^2\) value does not imply that the regression coefficient meaningfully different from 0. And a large \(R^2\) value does not necessarily mean the regression model can provide good predictions. It is still possible to get prediction intervals or confidence intervals that are too wide to be useful.

Residual Plot

Residual plot checks whether the regression model is appropriate for a dataset. It fits and removes a simple linear regression and then plots the residual values for each observation. Ideally, these values should be randomly scattered around \(y = 0\).

A “Good” Plot

Example: The dataset includes measurements of the total lifetime consumption of alcohol on a random sample of n = 50 alcoholic men. We are interested in determining whether or not alcohol consumption was linearly related to muscle strength.

[11]:
df = pd.read_csv(dataurl+'alcoholarm.txt', delim_whitespace=True, header=0)
result = analysis(df, 'strength', ['alcohol'], printlvl=2)
../_images/docs_regression_analysis_42_0.png
../_images/docs_regression_analysis_42_1.png

The scatter plot (on the left) suggests a decreasing linear relationship between alcohol and arm strength. It also suggests that there are no unusual data points in the data set.

This residual (residuals vs. fits) plot is a good example of suggesting about the appropriateness of the simple linear regression model

  • The residuals “bounce randomly” around the 0 line. This suggests that the assumption that the relationship is linear is reasonable.

  • The residuals roughly form a “horizontal band” around the 0 line. This suggests that the variances of the error terms are equal.

  • No one residual “stands out” from the basic random pattern of residuals. This suggests that there are no outliers.

In general, you expect your residual vs. fits plots to look something like the above plot. Interpreting these plots is subjective and many first time users tend to over-interpret these plots by looking at every twist and turn as something potentially troublesome. We should not put too much weight on residual plots based on small data sets because the size would be too small to make interpretation of a residuals plot worthwhile.

Nonlinear Relationship

Example: Is tire tread wear linearly related to mileage? As a result of the experiment, the researchers obtained a data set containing the mileage (in 1000 miles) driven and the depth of the remaining groove (in mils).

[12]:
df = pd.read_csv(dataurl+'treadwear.txt', sep='\s+', header=0)
result = analysis(df, 'groove', ['mileage'], printlvl=2)
../_images/docs_regression_analysis_45_0.png
../_images/docs_regression_analysis_45_1.png

Non-constant Error Variance

Example: How is plutonium activity related to alpha particle counts? Plutonium emits subatomic particles — called alpha particles. Devices used to detect plutonium record the intensity of alpha particle strikes in counts per second. The dataset contains 23 samples of plutonium

[13]:
df = pd.read_csv(dataurl+'alphapluto.txt', sep='\s+', header=0)
result = analysis(df, 'alpha', ['pluto'], printlvl=2)
../_images/docs_regression_analysis_47_0.png
../_images/docs_regression_analysis_47_1.png
[14]:
df = pd.read_csv(dataurl+'realestate.txt', sep='\s+', header=0)
result = analysis(df, 'SalePrice', ['SqFeet'], printlvl=2)
../_images/docs_regression_analysis_48_0.png
../_images/docs_regression_analysis_48_1.png

Outliers

[15]:
df = pd.read_csv(dataurl+'anscombe.txt', sep='\s+', header=0)
result = analysis(df, 'y3', ['x3'], printlvl=2)
../_images/docs_regression_analysis_50_0.png
../_images/docs_regression_analysis_50_1.png

An alternative to the residuals vs. fits plot is a residuals vs. predictor plot. It is a scatter plot of residuals on the y axis and the predictor values on the \(x\)-axis. The interpretation of a residuals vs. predictor plot is identical to that for a residuals vs. fits plot.

Normal Q-Q Plot of Residuals

Normal Q-Q plot shows if residuals are normally distributed by checking if residuals follow a straight line well or deviate severely.

A “Good” Plot

[16]:
df = pd.read_csv(dataurl+'alcoholarm.txt', sep='\s+', header=0)
result = analysis(df, 'strength', ['alcohol'], printlvl=3)
../_images/docs_regression_analysis_55_0.png
../_images/docs_regression_analysis_55_1.png
KstestResult(statistic=0.3217352865304308, pvalue=4.179928356341629e-05)

More Examples

Example 1: A Q-Q plot of the residuals after a simple regression model is used for fitting ‘time to next eruption’ and ‘duration of last eruption for eruptions’ of Old Faithful geyser which was named for its frequent and somewhat predictable eruptions. Old Faithful is located in Yellowstone’s Upper Geyser Basin in the southwest section of the park.

[17]:
df = pd.read_csv(dataurl+'oldfaithful.txt', delim_whitespace=True, header=0)
result = analysis(df, 'waiting', ['eruption'], printlvl=3)
../_images/docs_regression_analysis_58_0.png
../_images/docs_regression_analysis_58_1.png
KstestResult(statistic=0.3748566542043297, pvalue=7.87278404247441e-35)

How to interpret?

[18]:
hide_answer()
[18]:

The histogram is roughly bell-shaped so it is an indication that it is reasonable to assume that the errors have a normal distribution.

The pattern of the normal probability plot is straight, so this plot also provides evidence that it is reasonable to assume that the errors have a normal distribution.

\(\blacksquare\)

Example 2: The data contains ‘infection risk in a hospital’ and ‘average length of stay in the hospital’.

[19]:
df = pd.read_csv(dataurl+'infection12.txt', delim_whitespace=True, header=0)
result = analysis(df, 'InfctRsk', ['Stay'], printlvl=3)
../_images/docs_regression_analysis_63_0.png
../_images/docs_regression_analysis_63_1.png
KstestResult(statistic=0.11353657481026463, pvalue=0.39524100803555084)

How to interpret?

[20]:
hide_answer()
[20]:

The plot shows some deviation from the straight-line pattern indicating a distribution with heavier tails than a normal distribution. Also, the residual plot shows two possible outliers.

\(\blacksquare\)

Example: A stopping distance data with stopping distance of a car and speed of the car when the brakes were applied.

[21]:
df = pd.read_csv(dataurl+'carstopping.txt', delim_whitespace=True, header=0)
result = analysis(df, 'StopDist', ['Speed'], printlvl=3)
../_images/docs_regression_analysis_68_0.png
../_images/docs_regression_analysis_68_1.png
KstestResult(statistic=0.49181308907235927, pvalue=1.3918967453080744e-14)

How to interpret?

[22]:
hide_answer()
[22]:

Fitting a simple linear regression model to these data leads to problems with both curvature and nonconstant variance. The nonlinearity is reasonable due to physical laws. One possible remedy is to transform variable(s).

\(\blacksquare\)

Advice to Managers

An old statistics joke

A physicist, an engineer, and a statistician are on a hunting trip. They are walking through the woods when they spot a deer in the clearing.

The physicist calculates the distance to the target, the velocity and drop of the bullet, adjusts, and fires, missing the deer by five feet to the left.

The engineer looks frustrated. “You forgot to account for the wind. Give it here.” After licking a finger to determine the wind speed and direction, the engineer snatches the rifle and fires, missing the deer by five feet to the right.

Suddenly, without firing a shot, the statistician cheers, “Woo hoo! We got it!”

[23]:
hide_answer()
[23]:

Being precisely perfect on average can mean being actually wrong each time. Regression can keep missing several feet to the left or several feet to the right. Even if it averages out to the correct answer, regression can mean never actually hitting the target.

Unlike regression, machine learning predictions might be wrong on average, but when the predictions miss, they often don’t miss by much. Statisticians describe this as allowing some bias in exchange for reducing variance.

\(\blacksquare\)

Correlation vs Causation

  • Correlation is not causation. Recall the sales prediction case. It’s easy to say that there is a correlation between rain and monthly sales as shown by the regression. However, unless you are selling umbrellas, it might be difficult to prove that there is cause and effect.

  • Petabytes allow us to say: “Correlation is enough.” and causality is dead - Chris Anderson. Given enough statistical evidence, it’s no longer necessary to understand why things happen – we need only know what things happen together.

What do you think?

[24]:
hide_answer()
[24]:

Correlation is not causation. When you see a correlation from a regression analysis, you cannot make assumptions. You have to go out and see what’s happening in the real world. What is the physical mechanism causing the relationship.

For example, go out and observe consumers buying your product in the rain, talk to them and find out what is actually causing them to make the purchase. The goal is NOT to figure out what is going on in the data but to figure out what is going on in the world.

Correlation is enough. The beer and diapers. Some time ago, Wal-Mart runs queries on its point of sale systems with 1.2 million baskets worth in all. Many correlations appeared. Some of these were obvious.

beer

However, one correlation stood out like a sore thumb because it was so unexpected. Those queries revealed that, between 5pm and 7pm, customers tended to co-purchase beer and diapers. By moving these two items closer together, Wal-Mart reportedly saw the sales of both items increase geometrically.

The key question is “Can we take action on the basis of a correlation finding?” The answer depends primarily on two factors:

  1. Confidence that the correlation will reliably recur in the future. The higher that confidence level, the more reasonable it is to take action in response.

  2. The tradeoff between the risk and reward of acting. If the risk of acting and being wrong is extremely high, for example, acting on even a strong correlation may be a mistake.

correlation

\(\blacksquare\)

How do managers use it?

[25]:
hide_answer()
[25]:

Regression analysis is the go-to method in analytics. It helps us figure out what we can do. Mangers often uses it to

  • explain a phenomenon they want to understand (e.g. why did customer service calls drop last month?)

  • predict things about the future (e.g. what will sales look like over the next six months?)

  • decide what to do (e.g. should we go with this promotion or a different one?)

\(\blacksquare\)

What mistakes do managers make?

[26]:
hide_answer()
[26]:

Managers need to scope the problem. It’s managers’ job to identify the factors that may have an impact and ask your analyst to look at those. If you ask a data analyst to tell you something you don’t know, then you deserve what you get, which is bad analysis. It’s the same principle as flipping a coin. If someone do it enough times, you will eventually think you see something interesting, like a bunch of heads all in a row.

Regression analysis is very sensitive to bad data so be careful about data collecting and whether to act on the data.

Don’t make the mistake of ignoring the error term. If the regression explains 90% of the relationship, that is great. But if it explains 10%, and you act like it is 90%, that is not good.

Don’t let data replace your intuition and judgment.

\(\blacksquare\)

Regression Analysis

We consider a simple example to illustrate how to use python package statsmodels to perform regression analysis and predictions.

Influential Points

An influential point is an outlier that greatly affects the slope of the regression line. - An outlier is a data point whose response y does not follow the general trend of the rest of the data. - A data point has high leverage if it has “extreme” predictor x values. The lack of neighboring observations means that the fitted regression model will pass close to that particular observation.

Example 1:

[27]:
df = pd.read_csv(dataurl+'influence1.txt', delim_whitespace=True, header=0)
result = analysis(df, 'y', ['x'], printlvl=1)

fig, ax = plt.subplots(figsize=(10,6), dpi=80)
fig = sm.graphics.influence_plot(result, alpha  = 0.05, ax = ax, criterion="cooks")
../_images/docs_regression_analysis_89_0.png
../_images/docs_regression_analysis_89_1.png

Example 2:

[28]:
df = pd.read_csv(dataurl+'influence2.txt', delim_whitespace=True, header=0)
result = analysis(df, 'y', ['x'], printlvl=1)

fig, ax = plt.subplots(figsize=(10,6), dpi=80)
fig = sm.graphics.influence_plot(result, alpha  = 0.05, ax = ax, criterion="cooks")
../_images/docs_regression_analysis_91_0.png
../_images/docs_regression_analysis_91_1.png

Example 3:

[29]:
df = pd.read_csv(dataurl+'influence3.txt', delim_whitespace=True, header=0)
result = analysis(df, 'y', ['x'], printlvl=1)

fig, ax = plt.subplots(figsize=(10,6), dpi=80)
fig = sm.graphics.influence_plot(result, alpha  = 0.05, ax = ax, criterion="cooks")
../_images/docs_regression_analysis_93_0.png
../_images/docs_regression_analysis_93_1.png

Example 4:

[30]:
df = pd.read_csv(dataurl+'influence4.txt', delim_whitespace=True, header=0)
result = analysis(df, 'y', ['x'], printlvl=1)

fig, ax = plt.subplots(figsize=(10,6), dpi=80)
fig = sm.graphics.influence_plot(result, alpha  = 0.05, ax = ax, criterion="cooks")
../_images/docs_regression_analysis_95_0.png
../_images/docs_regression_analysis_95_1.png

In all examples above, do they contain outliers or high leverage data points? Are identified data points influential?

[31]:
hide_answer()
[31]:

If an observation has a studentized residual that is larger than 3 (in absolute value) we can call it an outlier.

Example 1:

All of the data points follow the general trend of the rest of the data, so there are no outliers (in the \(Y\) direction). And, none of the data points are extreme with respect to \(X\), so there are no high leverage points. Overall, none of the data points would appear to be influential with respect to the location of the best fitting line.

Example 2:

There is one outlier, but no high leverage points. An easy way to determine if the data point is influential is to find the best fitting line twice — once with the red data point included and once with the red data point excluded. The data point is not influential.

Example 3:

The data point is not influential, nor is it an outlier, but it does have high leverage.

Example 4:

The data point is deemed both high leverage and an outlier, and it turned out to be influential too.

\(\blacksquare\)

Partial Regression Plots

For a particular independent variable \(X_i\), a partial regression plot shows:

  • residuals from regressing \(Y\) (the response variable) against all the independent variables except \(X_i\)

  • residuals from regressing \(X_i\) against the remaining independent variables.

It has a slope that is the same as the multiple regression coefficient for that predictor and also has the same residuals as the full multiple regression. So we can spot any outliers or influential points and tell whether they’ve affected the estimation of this particular coefficient. For the same reasons that we always look at a scatterplot before interpreting a simple regression coefficient, it’s a good idea to make a partial regression plot for any multiple regression coefficient that we hope to understand or interpret.

ANOVA

Consider the heart attacks in rabbits study with variables:

  • the infarcted area (in grams) of a rabbit

  • the region at risk (in grams) of a rabbit

  • \(X2\) =1 if early cooling of a rabbit, 0 if not

  • \(X3\) =1 if late cooling of a rabbit, 0 if not

The regress formula is

\begin{equation*} \text{Infarc} = \beta_0 + \beta_1 \times \text{Area} + \beta_2 \times \text{X2} + \beta_3 \times \text{X3} \end{equation*}

[32]:
df = pd.read_csv(dataurl+'coolhearts.txt', delim_whitespace=True, header=0)
result = analysis(df, 'Infarc', ['Area','X2','X3'], printlvl=0)
sm.stats.anova_lm(result, typ=2)
[32]:
sum_sq df F PR(>F)
Area 0.637420 1.0 32.753601 0.000004
X2 0.297327 1.0 15.278050 0.000536
X3 0.019814 1.0 1.018138 0.321602
Residual 0.544910 28.0 NaN NaN
  • sum_sq[Residual] is sum of squared residuals;

  • sum_sq[Area] is sum_sq[Residual] - sum_sq[Residual] for regression Infarc ~ X2 + X3, i.e., regression without Area;

  • df[Residual] is degree of freedom \(=\) total number of observations (result.nobs) \(-\) the number of explanatory variables (3 or result.df_model) \(- 1 =\) result.df_resid;

ANOVA (analysis of variance) table for regression can be used to test hypothesis:

\begin{align} H_0:& \text{ a subset of explanatory variables have $0$ slope} \nonumber\\ H_a:& \text{ at least one variable in the subset of explanatory variables have nonzero slope} \nonumber \end{align}

Example 1: \begin{align} H_0:& \quad \beta_i = 0 ~ \forall i \in \{1\} \nonumber\\ H_a:& \quad \beta_i \neq 0 \text{ for at least one } i \in \{1\}. \nonumber \end{align}

[33]:
df = pd.read_csv(dataurl+'coolhearts.txt', delim_whitespace=True, header=0)
_ = GL_Ftest(df, 'Infarc', ['Area','X2','X3'], ['X2','X3'], 0.05)
The F-value is 32.754 and p-value is 0.000. Reject null hypothesis.
At least one variable in {'Area'} has nonzero slope.

It shows the size of the infarct is significantly related to the size of the area at risk. The result of such F-test is always equivalent to the t-test result.

Example 2: \begin{align} H_0:& \quad \beta_i = 0 ~ \forall i \in \{1,2,3\} \nonumber\\ H_a:& \quad \beta_i \neq 0 \text{ for at least one } i \in \{1,2,3\}. \nonumber \end{align}

[34]:
df = pd.read_csv(dataurl+'coolhearts.txt', delim_whitespace=True, header=0)
_ = GL_Ftest(df, 'Infarc', ['Area','X2','X3'], [], 0.05)
The F-value is 16.431 and p-value is 0.000. Reject null hypothesis.
At least one variable in {'Area', 'X3', 'X2'} has nonzero slope.

This is also called a test for the overall fit. The F-statistics and its p value are also available from the fvalue and f_pvalue properties of the regression result in statsmodels.

Example 3: \begin{align} H_0:& \quad \beta_i = 0 ~ \forall i \in \{2,3\} \nonumber\\ H_a:& \quad \beta_i \neq 0 \text{ for at least one } i \in \{2,3\}. \nonumber \end{align}

[35]:
df = pd.read_csv(dataurl+'coolhearts.txt', delim_whitespace=True, header=0)
_ = GL_Ftest(df, 'Infarc', ['Area','X2','X3'], ['Area'], 0.05)
The F-value is 8.590 and p-value is 0.001. Reject null hypothesis.
At least one variable in {'X3', 'X2'} has nonzero slope.

Let \(K \subseteq N\), we have two hypothesis \begin{align} H_0:& \quad \beta_i = 0 ~ \forall i \in K \nonumber\\ H_a:& \quad \beta_i \neq 0 \text{ for at least one } i \in K. \nonumber \end{align}

We consider two models \begin{align} \text{full model:} & \quad Y = \beta_0 + \sum_{i \in N} \beta_i X_i \nonumber\\ \text{reduced model:} & \quad Y = \beta_0 + \sum_{i \in K} \beta_i X_i. \nonumber\\ \end{align}

Let \(SSE(F)\) and \(SSE(R)\) be the sum of squared errors for the full and reduced models respectively where

\begin{equation*} SSE = \text{sum of squared residuals} = \sum e^2_i = \sum (Y_i - \hat{Y}_i)^2. \end{equation*}

The general linear statistic:

\begin{equation*} F^\ast = \frac{SSE(R) - SSE(F)}{df_R - df_F} \div \frac{SSE(F)}{df_F} \end{equation*}

and \(p\)-value is \begin{equation*} 1 - \text{F-dist}(F^\ast, df_R - df_F, df_F). \end{equation*}

Why don’t we perform a series of individual t-tests? For example, for the rabbit heart attacks study, we could have done the following in Example 3:

  • Fit the model InfSize ~ Area \(+\) x2 \(+\) x3 and use an individual t-test for x3.

  • If the test results indicate that we can drop x3 then fit the model oInfSize ~ Area \(+\) x2 and use an individual t-test for x2.

[36]:
hide_answer()
[36]:

Note that we’re using two individual t-tests instead of one F-test in the suggested method. It means the chance of drawing an incorrect conclusion in our testing procedure is higher. Recall that every time we do a hypothesis test, we can draw an incorrect conclusion by:

  • rejecting a true null hypothesis, i.e., make a type 1 error by concluding the tested predictor(s) should be retained in the model, when in truth it/they should be dropped; or

  • failing to reject a false null hypothesis, i.e., make a type 2 error by concluding the tested predictor(s) should be dropped from the model, when in truth it/they should be retained.

Thus, in general, the fewer tests we perform the better. In this case, this means that wherever possible using one F-test in place of multiple individual t-tests is preferable.

\(\blacksquare\)

Prediction

After we have estimated a regression equation from a dataset, we can use this equation to predict the value of the dependent variable for new observations.

[37]:
df = pd.read_csv(dataurl+'costs.csv', header=0, index_col='month')
result = analysis(df, 'cost', ['production_runs','machine_hours'], printlvl=0)

df_pred = pd.DataFrame({'machine_hours':[1430, 1560, 1520],'production_runs': [35,45,40]})
predict = result.get_prediction(df_pred)
predict.summary_frame(alpha=0.05)
[37]:
mean mean_se mean_ci_lower mean_ci_upper obs_ci_lower obs_ci_upper
0 97180.354898 697.961311 95760.341932 98600.367863 88700.800172 105659.909623
1 111676.265905 1135.822591 109365.417468 113987.114342 103002.948661 120349.583149
2 105516.720354 817.257185 103853.998110 107179.442598 96993.161428 114040.279281
  • mean is the value calculated based on the (sample) regression line

  • mean (or confidence) lower and upper limits offer the confidence interval of the average of predictor with given values of explanatory variables. In other words, if we run experiments many times with given values of explanatory variables, we have 95% confidence that the mean of the dependent variable will be in this interval.

  • obs (or prediction) lower and upper limits offer the confidence interval of a single prediction. In other words, if we run one experiment with given values of explanatory variables, we have 95% confidence that the value of the dependent variable will be in this interval.

Cross Validation

The fit from a regression analysis is often overly optimistic (over-fitted). There is no guarantee that the fit will be as good when th estimated regression equation is applied to new data.

To validate the fit, we can gather new data, predict the dependent variable and compare with known values of the dependent variable. However, most of the time we cannot obtain new independent data to validate our model. An alternative is to partition the sample data into a training set and testing (validation) set.

The simplest approach to cross-validation is to partition the sample observations randomly with 50% of the sample in each set or 60%/40% or 70%/30%.

[38]:
df = pd.read_csv(dataurl+'costs.csv', header=0, index_col='month')
result = analysis(df, 'cost', ['production_runs','machine_hours'], printlvl=0)

df_validate = pd.read_csv(dataurl+'validation.csv', header=0)
predict = result.get_prediction(df_validate)
df_validate['predict'] = predict.predicted_mean
r2 = np.corrcoef(df_validate.cost,df_validate.predict)[0,1]**2
stderr = np.sqrt(((df_validate.cost - df_validate.predict)**2).sum()/df_validate.shape[0]-3)
print('Validation R-squared:\t\t{:.3f}'.format(r2))
print('Validation StErr of Est:\t{:.3f}'.format(stderr))
Validation R-squared:           0.773
Validation StErr of Est:        5032.718

\(K\)-fold cross-validation can also be used. This partitions the sample dataset into \(K\) subsets with equal size. For each subset, we use it as testing set and the remaining \(K-1\) subsets as training set. We then calculate the sum of squared prediction errors, and combine the \(K\) estimates of prediction error to produce a \(K\)-fold cross-validation estimate.

Multicollinearity

Multicollinearity exists when two or more of the predictors in a regression model are moderately or highly correlated with one another. The exact linear relationship is fairly simple to spot because regression packages will typically respond with an error message. A more common and serious problem is multicollinearity, where explanatory variables are highly, but not exactly, correlated.

There are two types of multicollinearity:

  • Structural multicollinearity is a mathematical artifact caused by creating new predictors from other predictors — such as, creating the predictor \(X^2\) from the predictor \(X\).

  • Data-based multicollinearity, on the other hand, is a result of a poorly designed experiment, reliance on purely observational data, or the inability to manipulate the system on which the data are collected.

Variance inflation factor (VIF) can be used to identify multicollinearity. The VIF for \(X\) is defined as 1/(1 - R-square for \(X\)), where R-square for \(X\) variable is the usual R-square value from a regression with that \(X\) as the dependent variable and the other \(X\)’s as the explanatory variables.

[39]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
from patsy import dmatrices

df = pd.read_csv(dataurl+'multicollinearity.csv',header=0, index_col='person')
features = "+".join(df.drop(columns=['salary']).columns)
y, X = dmatrices('salary ~' + features, df, return_type='dataframe')
VIF = pd.DataFrame()
VIF["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(1,X.shape[1])]
VIF["Features"] = X.drop(columns=['Intercept']).columns
display(VIF)

def calculate_vif_(X, thresh=5.0):
    cols = X.columns
    variables = np.arange(X.shape[1])
    dropped=True
    while dropped:
        dropped=False
        c = X[cols[variables]].values
        vif = [variance_inflation_factor(c, ix) for ix in np.arange(c.shape[1])]

        maxloc = vif.index(max(vif))
        if max(vif) > thresh:
            print('dropping \'' + X[cols[variables]].columns[maxloc] + '\' at index: ' + str(maxloc))
            variables = np.delete(variables, maxloc)
            dropped=True

    print('Remaining variables:')
    print(X.columns[variables])

calculate_vif_(X.drop(columns=['Intercept']))
toggle()
VIF Factor Features
0 1.001383 gender[T.Male]
1 31.503655 age
2 38.153090 experience
3 5.594549 seniority
dropping 'experience' at index: 2
dropping 'age' at index: 1
Remaining variables:
Index(['gender[T.Male]', 'seniority'], dtype='object')
[39]:

Regression Modeling

Categorical Variables

The only meaningful way to summarize categorical data is with counts of observations in the categories. We consider a bank’s employee data.

[40]:
df_salary = pd.read_csv(dataurl+'salaries.csv', header=0, index_col='employee')
interact(plot_hist,
         column=widgets.Dropdown(options=df_salary.columns[:-1],value=df_salary.columns[0],
                            description='Column:',disabled=False));

To analyze whether the bank discriminates against females in terms of salary, we consider regression model:

\[\text{salary} \sim \text{years of experience with this bank (year1)} + \text{years of work experience prior to this bank (year2)} + \text{gender}\]
[41]:
df = pd.read_csv(dataurl+'salaries.csv', header=0, index_col='employee')
result = analysis(df, 'salary', ['years1', 'years2', 'C(gender, Treatment(\'Male\'))'], printlvl=5)
OLS Regression Results
Dep. Variable: salary R-squared: 0.492
Model: OLS Adj. R-squared: 0.485
Method: Least Squares F-statistic: 65.93
Date: Wed, 14 Aug 2019 Prob (F-statistic): 7.61e-30
Time: 08:55:10 Log-Likelihood: -2164.5
No. Observations: 208 AIC: 4337.
Df Residuals: 204 BIC: 4350.
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 3.549e+04 1341.022 26.466 0.000 3.28e+04 3.81e+04
C(gender, Treatment('Male'))[T.Female] -8080.2121 1198.170 -6.744 0.000 -1.04e+04 -5717.827
years1 987.9937 80.928 12.208 0.000 828.431 1147.557
years2 131.3379 180.923 0.726 0.469 -225.381 488.057
Omnibus: 15.654 Durbin-Watson: 1.286
Prob(Omnibus): 0.000 Jarque-Bera (JB): 25.154
Skew: 0.434 Prob(JB): 3.45e-06
Kurtosis: 4.465 Cond. No. 34.6


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

standard error of estimate:8079.39743


ANOVA Table:

sum_sq df F PR(>F)
C(gender, Treatment('Male')) 2.968701e+09 1.0 45.478753 1.556196e-10
years1 9.728975e+09 1.0 149.042170 4.314965e-26
years2 3.439940e+07 1.0 0.526979 4.687119e-01
Residual 1.331644e+10 204.0 NaN NaN
../_images/docs_regression_analysis_132_3.png
KstestResult(statistic=0.5240384615384616, pvalue=1.0087139661949263e-53)

The regression is equivalent to fit data into two separate regression functions. However, it is usually easier and more efficient to answer research questions concerning the binary predictor variable by using one combined regression function.

[42]:
def update(column):
    df = pd.read_csv(dataurl+'salaries.csv', header=0, index_col='employee')
    sns.lmplot(x=column, y='salary', data=df, fit_reg=True, hue='gender', legend=True, height=6);

interact(update,
         column=widgets.Dropdown(options=['years1', 'years2', 'age', 'education'],value='years1',
                            description='Column:',disabled=False));

Interaction Variables

Interaction with Categorical Variable

Example: consider salary data with interaction gender*year1

[43]:
result = ols(formula="salary ~ years1 + C(gender, Treatment('Male'))*years1", data=df_salary).fit()
display(result.summary())
print("\nANOVA Table:\n")
display(sm.stats.anova_lm(result, typ=1))
OLS Regression Results
Dep. Variable: salary R-squared: 0.639
Model: OLS Adj. R-squared: 0.633
Method: Least Squares F-statistic: 120.2
Date: Wed, 14 Aug 2019 Prob (F-statistic): 7.51e-45
Time: 08:55:12 Log-Likelihood: -2129.2
No. Observations: 208 AIC: 4266.
Df Residuals: 204 BIC: 4280.
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 3.043e+04 1216.574 25.013 0.000 2.8e+04 3.28e+04
C(gender, Treatment('Male'))[T.Female] 4098.2519 1665.842 2.460 0.015 813.776 7382.727
years1 1527.7617 90.460 16.889 0.000 1349.405 1706.119
C(gender, Treatment('Male'))[T.Female]:years1 -1247.7984 136.676 -9.130 0.000 -1517.277 -978.320
Omnibus: 11.523 Durbin-Watson: 1.188
Prob(Omnibus): 0.003 Jarque-Bera (JB): 15.614
Skew: 0.379 Prob(JB): 0.000407
Kurtosis: 4.108 Cond. No. 58.3


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

ANOVA Table:

df sum_sq mean_sq F PR(>F)
C(gender, Treatment('Male')) 1.0 3.149634e+09 3.149634e+09 67.789572 2.150454e-14
years1 1.0 9.726635e+09 9.726635e+09 209.346370 4.078705e-33
C(gender, Treatment('Male')):years1 1.0 3.872606e+09 3.872606e+09 83.350112 6.833545e-17
Residual 204.0 9.478232e+09 4.646192e+07 NaN NaN

Example: Let’s consider some contrived data

[44]:
np.random.seed(123)
size = 20
x1 = np.linspace(-5,5,size)
group = np.random.randint(0, 2, size=size)
y = 5 + (3-7*group)*x1 + 3*np.random.normal(size=size)
df = pd.DataFrame({'y':y, 'x1':x1, 'group':group})
sns.lmplot(x='x1', y='y', data=df, fit_reg=True, hue='group', legend=True);
result = analysis(df, 'y', ['x1', 'group'], printlvl=4)
OLS Regression Results
Dep. Variable: y R-squared: 0.004
Model: OLS Adj. R-squared: -0.113
Method: Least Squares F-statistic: 0.03659
Date: Wed, 14 Aug 2019 Prob (F-statistic): 0.964
Time: 08:55:13 Log-Likelihood: -72.898
No. Observations: 20 AIC: 151.8
Df Residuals: 17 BIC: 154.8
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 3.1116 3.075 1.012 0.326 -3.377 9.600
x1 0.1969 0.765 0.257 0.800 -1.417 1.811
group 0.0733 4.667 0.016 0.988 -9.773 9.920
Omnibus: 0.957 Durbin-Watson: 2.131
Prob(Omnibus): 0.620 Jarque-Bera (JB): 0.741
Skew: 0.008 Prob(JB): 0.690
Kurtosis: 2.057 Cond. No. 7.06


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

standard error of estimate:10.04661

../_images/docs_regression_analysis_139_2.png
../_images/docs_regression_analysis_139_3.png
KstestResult(statistic=0.44990149282919256, pvalue=0.0003257504098859564)

The scatterplot shows that both explanatory variables x and group are related to y. However, the regression results show that there is insufficient evidence to conclude that x and group are related to y. The example shows that if we leave an important interaction term out of our model, our analysis can lead us to make erroneous conclusions.

Without interaction variables, we assume the effect of each \(X\) on \(Y\) is independent of the value of the other \(X\)’s, which is not the case in this contrived data as shown in the scatterplot.

Now, consider a regression model contains an interaction term between the two predictors.

[45]:
result = analysis(df, 'y', ['x1', 'x1*C(group)'], printlvl=4)
OLS Regression Results
Dep. Variable: y R-squared: 0.897
Model: OLS Adj. R-squared: 0.878
Method: Least Squares F-statistic: 46.57
Date: Wed, 14 Aug 2019 Prob (F-statistic): 3.95e-08
Time: 08:55:14 Log-Likelihood: -50.187
No. Observations: 20 AIC: 108.4
Df Residuals: 16 BIC: 112.4
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 4.7029 1.027 4.578 0.000 2.525 6.881
C(group)[T.1] 1.7810 1.552 1.147 0.268 -1.509 5.071
x1 2.4905 0.319 7.798 0.000 1.813 3.168
x1:C(group)[T.1] -6.1842 0.524 -11.792 0.000 -7.296 -5.072
Omnibus: 1.739 Durbin-Watson: 1.433
Prob(Omnibus): 0.419 Jarque-Bera (JB): 1.403
Skew: 0.607 Prob(JB): 0.496
Kurtosis: 2.541 Cond. No. 7.68


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

standard error of estimate:3.32671

../_images/docs_regression_analysis_141_2.png
KstestResult(statistic=0.40384330620002296, pvalue=0.0018511879743272586)

The non-significance for the coefficients of group indicates no observed difference between the mean values of x1 in different groups.

Example: Researchers are interested in comparing the effectiveness of three treatments (A,B,C) for severe depression. They collect the data on a random sample of \(n = 36\) severely depressed individuals including following variables:

  • measure of the effectiveness of the treatment for each individual

  • age (in years) of each individual i

  • x2 = 1 if an individual is received treatment A and 0, if not

  • x3 = 1 if an individual is received treatment B and 0, if not

The regression model:

\begin{align} \text{y} = \beta_0 + \beta_1 \times \text{age} + \beta_2 \times \text{x2} + \beta_3 \times \text{x3} + \beta_{12} \times \text{age} \times \text{x2} + \beta_{13} \times \text{age} \times \text{x3} \end{align}

[46]:
df = pd.read_csv(dataurl+'depression.txt', delim_whitespace=True, header=0)
sns.lmplot(x='age', y='y', data=df, fit_reg=True, hue='TRT', legend=True);
../_images/docs_regression_analysis_144_0.png
[47]:
result = analysis(df, 'y', ['age', 'age*x2', 'age*x3'], printlvl=4)
sm.stats.anova_lm(result, typ=1)
OLS Regression Results
Dep. Variable: y R-squared: 0.914
Model: OLS Adj. R-squared: 0.900
Method: Least Squares F-statistic: 64.04
Date: Wed, 14 Aug 2019 Prob (F-statistic): 4.26e-15
Time: 08:55:16 Log-Likelihood: -97.024
No. Observations: 36 AIC: 206.0
Df Residuals: 30 BIC: 215.5
Df Model: 5
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 6.2114 3.350 1.854 0.074 -0.630 13.052
age 1.0334 0.072 14.288 0.000 0.886 1.181
x2 41.3042 5.085 8.124 0.000 30.920 51.688
age:x2 -0.7029 0.109 -6.451 0.000 -0.925 -0.480
x3 22.7068 5.091 4.460 0.000 12.310 33.104
age:x3 -0.5097 0.110 -4.617 0.000 -0.735 -0.284
Omnibus: 2.593 Durbin-Watson: 1.633
Prob(Omnibus): 0.273 Jarque-Bera (JB): 1.475
Skew: -0.183 Prob(JB): 0.478
Kurtosis: 2.079 Cond. No. 529.


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

standard error of estimate:3.92491

../_images/docs_regression_analysis_145_2.png
KstestResult(statistic=0.3553429742849312, pvalue=0.00014233708080337406)
[47]:
df sum_sq mean_sq F PR(>F)
age 1.0 3424.431786 3424.431786 222.294615 2.058902e-15
x2 1.0 803.804498 803.804498 52.178412 4.856763e-08
age:x2 1.0 375.254832 375.254832 24.359407 2.794727e-05
x3 1.0 0.936671 0.936671 0.060803 8.069102e-01
age:x3 1.0 328.424475 328.424475 21.319447 6.849515e-05
Residual 30.0 462.147738 15.404925 NaN NaN

For each age, is there a difference in the mean effectiveness for the three treatments?

[48]:
hide_answer()
[48]:

\begin{align*} \text{Treatment A:} & \quad \mu_Y = (\beta_0 + \beta_2) + (\beta_1 + \beta_{12}) \times \text{age} \\ \text{Treatment B:} & \quad \mu_Y = (\beta_0 + \beta_3) + (\beta_1 + \beta_{13}) \times \text{age} \\ \text{Treatment C:} & \quad \mu_Y = \beta_0 + \beta_1 \times \text{age} \end{align*}

The research question is equivalent to ask \begin{align*} \text{Treatment A} - \text{Treatment C} &= 0 = \beta_2 + \beta_{12} \times \text{age} \\ \text{Treatment B} - \text{Treatment C} &= 0 = \beta_3 + \beta_{13} \times \text{age} \end{align*}

Thus, we test hypothesis \begin{align} H_0:& \quad \beta_2 = \beta_3 = \beta_{12} = \beta_{13} = 0 \nonumber\\ H_a:& \quad \text{at least one is nonzero} \nonumber \end{align}

The python code is:

_ =GL_Ftest(df, 'y', ['age', 'x2', 'x3', 'age:x2', 'age:x3'], ['age'], 0.05, typ=1)

\(\blacksquare\)

Does the effect of age on the treatment’s effectiveness depend on treatment?

[49]:
hide_answer()
[49]:

The question asks if the two interaction parameters \(\beta_{12}\) and \(\beta_{13}\) are significant.

The python code is:

_ =GL_Ftest(df, 'y', ['age', 'x2', 'x3', 'age:x2', 'age:x3'], ['age', 'x2', 'x3'], 0.05, typ=1)

\(\blacksquare\)

Interactions Between Continuous Variables

Typically, regression models that include interactions between quantitative predictors adhere to the hierarchy principle, which says that if your model includes an interaction term, \(X_1 X_2\), and \(X_1 X_2\) is shown to be a statistically significant predictor of Y, then your model should also include the “main effects,” \(X_1\) and \(X_2\), whether or not the coefficients for these main effects are significant. Depending on the subject area, there may be circumstances where a main effect could be excluded, but this tends to be the exception.

Transformations

Log-transform

  • Transformation is used widely in regression analysis because they have a meaningful interpretation.

    • \(Y = a \log(X) + \cdots\): 1% increase by X, Y is expected to increase by 0.01*a amount

    • \(\log(Y) = a X + \cdots\): 1unit increase by X, Y is expected to increase by (100*a)%

    • \(\log(Y) = a \log(X) + \cdots\): 1% increase by X, Y is expected to increase by a%

  • Transforming the \(X\) values is appropriate when non-linearity is the only problem (i.e., the independence, normality, and equal variance conditions are met)

  • Transforming the \(Y\) values should be considered when non-normality and/or unequal variances are the problems with the model.

[50]:
df = pd.read_csv(dataurl+'shortleaf.txt', delim_whitespace=True, header=0)
result = analysis(df, 'Vol', ['Diam'], printlvl=4)
OLS Regression Results
Dep. Variable: Vol R-squared: 0.893
Model: OLS Adj. R-squared: 0.891
Method: Least Squares F-statistic: 564.9
Date: Wed, 14 Aug 2019 Prob (F-statistic): 1.17e-34
Time: 08:55:18 Log-Likelihood: -258.61
No. Observations: 70 AIC: 521.2
Df Residuals: 68 BIC: 525.7
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -41.5681 3.427 -12.130 0.000 -48.406 -34.730
Diam 6.8367 0.288 23.767 0.000 6.263 7.411
Omnibus: 29.509 Durbin-Watson: 0.562
Prob(Omnibus): 0.000 Jarque-Bera (JB): 85.141
Skew: 1.240 Prob(JB): 3.25e-19
Kurtosis: 7.800 Cond. No. 34.8


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

standard error of estimate:9.87485

../_images/docs_regression_analysis_155_2.png
../_images/docs_regression_analysis_155_3.png
KstestResult(statistic=0.4372892531717568, pvalue=1.0701622883725878e-12)

The Q-Q plot suggests that the error terms are not normal. We can log-transform \(X\) (Diam).

[51]:
df = pd.read_csv(dataurl+'shortleaf.txt', delim_whitespace=True, header=0)
df['logvol'] = np.log(df.Vol)
df['logdiam'] = np.log(df.Diam)
result = analysis(df, 'Vol', ['logdiam'], printlvl=4)
OLS Regression Results
Dep. Variable: Vol R-squared: 0.746
Model: OLS Adj. R-squared: 0.743
Method: Least Squares F-statistic: 200.2
Date: Wed, 14 Aug 2019 Prob (F-statistic): 6.10e-22
Time: 08:55:20 Log-Likelihood: -288.66
No. Observations: 70 AIC: 581.3
Df Residuals: 68 BIC: 585.8
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -116.1618 10.830 -10.726 0.000 -137.773 -94.551
logdiam 64.5358 4.562 14.148 0.000 55.433 73.638
Omnibus: 50.047 Durbin-Watson: 0.324
Prob(Omnibus): 0.000 Jarque-Bera (JB): 219.500
Skew: 2.099 Prob(JB): 2.17e-48
Kurtosis: 10.591 Cond. No. 16.6


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

standard error of estimate:15.17022

../_images/docs_regression_analysis_157_2.png
../_images/docs_regression_analysis_157_3.png
KstestResult(statistic=0.5855934039207538, pvalue=2.338971194588551e-23)

Transforming only the x values doesn’t change the non-linearity at all and improve the normality of the error terms either. We perform log-transform on both \(X\) and \(Y\).

[52]:
result = analysis(df, 'logvol', ['logdiam'], printlvl=4)
OLS Regression Results
Dep. Variable: logvol R-squared: 0.974
Model: OLS Adj. R-squared: 0.973
Method: Least Squares F-statistic: 2509.
Date: Wed, 14 Aug 2019 Prob (F-statistic): 2.08e-55
Time: 08:55:21 Log-Likelihood: 25.618
No. Observations: 70 AIC: -47.24
Df Residuals: 68 BIC: -42.74
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -2.8718 0.122 -23.626 0.000 -3.114 -2.629
logdiam 2.5644 0.051 50.090 0.000 2.462 2.667
Omnibus: 1.405 Durbin-Watson: 1.359
Prob(Omnibus): 0.495 Jarque-Bera (JB): 1.119
Skew: -0.051 Prob(JB): 0.572
Kurtosis: 2.389 Cond. No. 16.6


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

standard error of estimate:0.17026

../_images/docs_regression_analysis_159_2.png
../_images/docs_regression_analysis_159_3.png
KstestResult(statistic=0.3711075900985334, pvalue=3.691529639884377e-09)

The residual and Q-Q plot has improved substantially

Polynomial Transformations

[53]:
df = pd.read_csv(dataurl+'costunits.csv', header=0, index_col='month')
result = analysis(df, 'cost', ['units'], printlvl=4)
OLS Regression Results
Dep. Variable: cost R-squared: 0.736
Model: OLS Adj. R-squared: 0.728
Method: Least Squares F-statistic: 94.75
Date: Wed, 14 Aug 2019 Prob (F-statistic): 2.32e-11
Time: 08:55:23 Log-Likelihood: -334.94
No. Observations: 36 AIC: 673.9
Df Residuals: 34 BIC: 677.0
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 2.365e+04 1917.137 12.337 0.000 1.98e+04 2.75e+04
units 30.5331 3.137 9.734 0.000 24.158 36.908
Omnibus: 4.527 Durbin-Watson: 2.144
Prob(Omnibus): 0.104 Jarque-Bera (JB): 1.765
Skew: -0.042 Prob(JB): 0.414
Kurtosis: 1.918 Cond. No. 2.57e+03


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.57e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

standard error of estimate:2733.74240

../_images/docs_regression_analysis_162_2.png
../_images/docs_regression_analysis_162_3.png
KstestResult(statistic=0.5555555555555556, pvalue=5.544489818114514e-11)
[54]:
df = pd.read_csv(dataurl+'costunits.csv', header=0, index_col='month')
fit2 = np.poly1d(np.polyfit(df.units, df.cost, 2))
xp = np.linspace(np.min(df.units), np.max(df.units), 100)
fig = plt.figure(figsize=(8,6))
plt.plot(df.units, df.cost, '.', xp, fit2(xp), '-')
plt.title('cost vs units')
plt.xlabel('units')
plt.ylabel('cost')
plt.show()

toggle()
../_images/docs_regression_analysis_163_0.png
[54]:
[55]:
df = pd.read_csv(dataurl+'costunits.csv', header=0, index_col='month')
df['units2'] = np.power(df.units,2)
result = ols(formula='cost ~ units + units2', data=df).fit()
display(result.summary())
fig, axes = plt.subplots(1,3,figsize=(18,6))
sns.residplot(result.fittedvalues, result.resid , lowess=False, scatter_kws={"s": 80},
              line_kws={'color':'r', 'lw':1}, ax=axes[0])
axes[0].set_title('Residual plot')
axes[0].set_xlabel('Fitted values')
axes[0].set_ylabel('Residuals')

axes[1].relim()
stats.probplot(result.resid, dist='norm', plot=axes[1])
axes[1].set_title('Normal Q-Q plot')
axes[2].relim()
sns.distplot(result.resid, ax=axes[2]);
plt.show()
toggle()
OLS Regression Results
Dep. Variable: cost R-squared: 0.822
Model: OLS Adj. R-squared: 0.811
Method: Least Squares F-statistic: 75.98
Date: Wed, 14 Aug 2019 Prob (F-statistic): 4.45e-13
Time: 08:55:25 Log-Likelihood: -327.88
No. Observations: 36 AIC: 661.8
Df Residuals: 33 BIC: 666.5
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 5792.7983 4763.058 1.216 0.233 -3897.717 1.55e+04
units 98.3504 17.237 5.706 0.000 63.282 133.419
units2 -0.0600 0.015 -3.981 0.000 -0.091 -0.029
Omnibus: 2.245 Durbin-Watson: 2.123
Prob(Omnibus): 0.326 Jarque-Bera (JB): 2.021
Skew: -0.553 Prob(JB): 0.364
Kurtosis: 2.651 Cond. No. 5.12e+06


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.12e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
../_images/docs_regression_analysis_164_1.png
[55]:
[56]:
df = pd.read_csv(dataurl+'carstopping.txt', delim_whitespace=True, header=0)
df['sqrtdist'] = np.sqrt(df.StopDist)
result = ols(formula='sqrtdist ~ Speed', data=df).fit()
fig, axes = plt.subplots(1,3,figsize=(18,5))
sns.residplot(result.fittedvalues, result.resid , lowess=False, scatter_kws={"s": 80},
              line_kws={'color':'r', 'lw':1}, ax=axes[0])
axes[0].set_title('Residual plot')
axes[0].set_xlabel('Fitted values')
axes[0].set_ylabel('Residuals')

axes[1].relim()
stats.probplot(result.resid, dist='norm', plot=axes[1])
axes[1].set_title('Normal Q-Q plot')
axes[2].relim()
sns.distplot(result.resid, ax=axes[2]);
plt.show()

toggle()
../_images/docs_regression_analysis_165_0.png
[56]:

Heteroscedasticity

Excessive nonconstant variance can create technical difficulties with a multiple linear regression model. For example, if the residual variance increases with the fitted values, then prediction intervals will tend to be wider than they should be at low fitted values and narrower than they should be at high fitted values. Some remedies for refining a model exhibiting excessive nonconstant variance includes the following:

  • Apply a variance-stabilizing transformation to the response variable, for example a logarithmic transformation (or a square root transformation if a logarithmic transformation is “too strong” or a reciprocal transformation if a logarithmic transformation is “too weak”).

  • Weight the variances so that they can be different for each set of predictor values. This leads to weighted least squares, in which the data observations are given different weights when estimating the model – see below.

  • A generalization of weighted least squares is to allow the regression errors to be correlated with one another in addition to having different variances. This leads to generalized least squares, in which various forms of nonconstant variance can be modeled.

  • For some applications we can explicitly model the variance as a function of the mean, E(Y). This approach uses the framework of generalized linear models.

[57]:
df = pd.read_csv(dataurl+'ca_learning.txt', sep='\s+', header=0)
result = analysis(df, 'cost', ['responses'], printlvl=4)
OLS Regression Results
Dep. Variable: cost R-squared: 0.889
Model: OLS Adj. R-squared: 0.878
Method: Least Squares F-statistic: 80.19
Date: Wed, 14 Aug 2019 Prob (F-statistic): 4.33e-06
Time: 08:55:27 Log-Likelihood: -34.242
No. Observations: 12 AIC: 72.48
Df Residuals: 10 BIC: 73.45
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 19.4727 5.516 3.530 0.005 7.182 31.764
responses 3.2689 0.365 8.955 0.000 2.456 4.082
Omnibus: 2.739 Durbin-Watson: 1.783
Prob(Omnibus): 0.254 Jarque-Bera (JB): 1.001
Skew: 0.038 Prob(JB): 0.606
Kurtosis: 1.587 Cond. No. 63.1


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

standard error of estimate:4.59830

../_images/docs_regression_analysis_168_2.png
../_images/docs_regression_analysis_168_3.png
KstestResult(statistic=0.4553950067071834, pvalue=0.008567457152187175)

A plot of the residuals versus the predictor values indicates possible nonconstant variance since there is a very slight “megaphone” pattern. We therefore fit a simple linear regression model of the absolute residuals on the predictor and calculate weights as 1 over the squared fitted values from this model.

[58]:
result_ols = ols(formula='cost ~ responses', data=df).fit()
df['resid'] = abs(result_ols.resid)
result = ols(formula='resid ~ responses', data=df).fit()
X = sm.add_constant(df.responses)
w = 1./(result.fittedvalues ** 2)
result_wls = sm.WLS(df.cost, X, weights=w).fit()
display(result_wls.summary())
fig, axes = plt.subplots(1,3,figsize=(18,5))
sns.residplot(result_wls.fittedvalues, result_wls.wresid, lowess=False, scatter_kws={"s": 80},
              line_kws={'color':'r', 'lw':1}, ax=axes[0])
axes[0].set_title('Residual plot')
axes[0].set_xlabel('Fitted values')
axes[0].set_ylabel('Residuals')

axes[1].relim()
stats.probplot(result_wls.resid, dist='norm', plot=axes[1])
axes[1].set_title('Normal Q-Q plot')
axes[2].relim()
sns.distplot(result_wls.resid, ax=axes[2]);
plt.show()

x = np.linspace(np.min(df.responses), np.max(df.responses), 20)
y_wls = result_wls.params[0] + result_wls.params[1] * x
y_ols = result_ols.params[0] + result_ols.params[1] * x
plt.plot(df.responses, df.cost, 'o', label='data')
plt.plot(x, y_ols, '-', label='ols method')
plt.plot(x, y_wls, '-', label='wls method')
plt.legend()
plt.show()

toggle()
WLS Regression Results
Dep. Variable: cost R-squared: 0.895
Model: WLS Adj. R-squared: 0.885
Method: Least Squares F-statistic: 85.35
Date: Wed, 14 Aug 2019 Prob (F-statistic): 3.27e-06
Time: 08:55:28 Log-Likelihood: -33.248
No. Observations: 12 AIC: 70.50
Df Residuals: 10 BIC: 71.47
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 17.3006 4.828 3.584 0.005 6.544 28.058
responses 3.4211 0.370 9.238 0.000 2.596 4.246
Omnibus: 4.602 Durbin-Watson: 1.706
Prob(Omnibus): 0.100 Jarque-Bera (JB): 1.241
Skew: 0.062 Prob(JB): 0.538
Kurtosis: 1.429 Cond. No. 56.7


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
../_images/docs_regression_analysis_170_1.png
../_images/docs_regression_analysis_170_2.png
[58]: