[1]:

%run initscript.py
%run ./files/loadregfuncs.py
HTML("""
<div id="popup" style="padding-bottom:5px; display:none;">
    <div>Enter Password:</div>
    <input id="password" type="password"/>
    <button onclick="done()" style="border-radius: 12px;">Submit</button>
</div>
<button onclick="unlock()" style="border-radius: 12px;">Unclock</button>
<a href="#" onclick="code_toggle(this); return false;">show code</a>
""")

[1]:

show code

Applied Regression Analysis¶

Regression Assumptions¶

As one of the most important types of data analysis, regression analysis is a way of mathematically sorting out factors that may have an impact. It focuses on questions:

Which factors matter most?
Which can we ignore?
How do those factors interact with each other?
Perhaps most importantly, how certain are we about all of these factors?

Those factors are called variables, and we have

Dependent (or response or target) variable: the main factor that you’re trying to understand or predict
Independent (or explanatory or predictor) variables: the factors you suspect have an impact on your dependent variable

Generally, regression model finds a line that fits the data “best” such that $n$ residuals — one for each observed data point — are as small as possible in some overall sense. One way to achieve this goal is to use the “ordinary least squares (OLS) criterion,” which says to “minimize the sum of the squared prediction errors.”

What does the line estimate?

When looking to summarize the relationship between predictors $X$ and a response $Y$, we are interested in knowing the relationship in the population. The only way we could ever know it, though, is to be able to collect data on everybody in the population — most often an impossible task. We have to rely on taking and using a sample of data from the population to estimate the population regression line.

Therefore, several assumptions about the population are required to use sample regression line to estimate population regression line — statistical inference in a regression context.

Regression assumptions: 1. There is a population regression line. It joints the means of the dependent variable for all values of the explanatory variables, the mean of the errors is zero. 2. For any values of the explanatory variables, the variance (or standard deviation) of the dependent variable is a constant, the same for all such values. 3. For any values of the explanatory variables, the dependent variable is normally distributed. 4. The errors are probabilistically independent.

These assumptions represent an idealization of reality. From a practical point of view, if they represent a close approximate to reality, then the analysis is valid. But if the assumptions are grossly violated, statistical inferences that are based on these assumptions should be viewed with suspicion.

First Assumption¶

Let $Y$ be the dependent variable and there are $k$ explanatory variables $X_1$ through $X_k$. The assumption implies that there is an exact linear relationship between the mean of all $Y$’s for any fixed values of $X$’s and the $X$’s.

[2]:

interact(draw_sample,
         flag=widgets.Checkbox(value=False,description='Show Sample',disabled=False),
         id=widgets.Dropdown(options=range(1,4),value=1,description='Sample ID:',disabled=False));

Suppose we have population as shown in the plot. For example, $X$ is the years of college education and $Y$ is the annual salary. Given years of education, there is a range of possible salaries for different individuals.

The first assumption says, for each $X$, the mean of $Y$s ($=\overline{Y}$) lies on the maroon line $\overline{Y} = 3.08 X + 9.87$, which is the population regression line joining means. For each data point, its $Y$ value is expressed by the population regression line with error $Y = 3.08 X + 9.87 + \epsilon$ where $\epsilon$ is the deviation of the data point to the maroon line.

Note that since the population is not accessible, the population regression (maroon) line is not observable in reality.

[3]:

interact(draw_sample,
         flag=widgets.Checkbox(value=True,description='Show Sample',disabled=False),
         id=widgets.Dropdown(options=range(1,4),value=1,description='Sample ID:',disabled=False));

The estimated regression line as shown in orange is based on the sample data, which is usually different from the population regression line. The regression lines derived from the first two samples are largely deviated from the population regression line.

Note that an error $\epsilon$ is not quite the same as a residual $e$. An error is the vertical distance from a point to the (unobservable) population regression line. A residual is the vertical distance from a point to the estimated regression line.

Second Assumption¶

The second assumption requires constant error variance, technical term is homoscedasticity. This assumption is often questionable because the variation in $Y$ often increases as $X$ increases. For example, the variation of spending increases as salary increases. We say the data exhibit heteroscedasticity (nonconstant error variance) if the variability of $Y$ values is larger for some $X$ values than for others. The easiest way to detect nonconstant error variance is through a visual inspection of a scatterplot.

It is usually sufficient to “visually” interpret a residuals versus fitted values plot. However, there are several hypothesis tests on the residuals for checking constant variance.

Brown-Forsythe test (Modified Levene Test)
Cook-Weisberg score test (Breusch-Pagan Test)
Bartlett’s Test

Third Assumption¶

The third assumption is equivalent to stating that the errors are normally distributed. We can check this by forming a histogram (or a Q-Q plot) of the residuals. - If the assumption holds, the histogram should be approximately symmetric and bell-shaped, and the points of a Q-Q plot should be close to a 45 degree line. - If there is an obvious skewness or some other nonnormal property, this indicates a violation of assumption 3.

[4]:

interact(draw_qq,
         dist=widgets.Dropdown(options=['normal','student_t','uniform','triangular'],
                               value='student_t',description='Distribution:',disabled=False));

The plot shows the Q-Q plot and histogram of four distributions including normal, student t, uniform and triangular.

Besides graphical method for assessing residual normality, there are several hypothesis tests in which the null hypothesis is that the errors have a normal distribution, such as

Kolmogorov-Smirnov Test
Anderson-Darling Test
Shapiro-Wilk Test
Ryan-Joiner Test

If those tests are available in software, then a large p-value indicates failure to reject this null hypothesis.

Fourth Assumption¶

The fourth assumption requires probabilistic independence of the errors. - This assumption means that information on some of the errors provides no information on the values of the other errors. However, when the sample data have been collected over time and the regression model fails to effectively capture any time trends, the random errors in the model are often positively correlated over time. This phenomenon is known as autocorrelation (or serial correlation) and can sometimes be detected by plotting the model residuals versus time. - For cross-sectional data, this assumption is usually taken for granted. - For time-series data, this assumption is often violated. - This is because of a property called autocorrelation. - The Durbin-Watson statistic is one measure of autocorrelation and thus measures the extent to which assumption 4 is violated.

Summary¶

We now summarize the four assumptions that underlie the linear regression model:

The mean of the response is a Linear function of the explanatory variables.
The errors are Independent.
The errors are Normally distributed.
The errors have Equal variances.

The first letters that are highlighted spell “LINE”.

Regression Estimation¶

We consider a simple example to illustrate how to use python package statsmodels to perform regression analysis and predictions.

Case: Manufacture Costs¶

A manufacturer wants to get a better understanding of its costs. We consider two variables: machine hours and production runs. A production run can produce a large batch of products. Once a production run is completed, the factory must set up for the next production run. Therefore, the costs include fixed costs associated with each production run and variable costs based on machine hours.

Scatterplot (with regression line and mean CI) and time series plot are presented:

[5]:

df = pd.read_csv(dataurl+'costs.csv', header=0, index_col='month')
def draw_scatterplot(x, y):
    fig, axes = plt.subplots(1, 2, figsize = (12,6))
    sns.regplot(x=df[x], y=df[y], data=df, ax=axes[0])
    axes[0].set_title('Scatterplot: X={} Y={}'.format(x,y))
    df[y].plot(ax=axes[1], xlim=(0,40))
    axes[1].set_title('Time series plot: {}'.format(y))
    plt.show()

interact(draw_scatterplot,
         x=widgets.Dropdown(options=df.columns,value=df.columns[0],
                            description='X:',disabled=False),
         y=widgets.Dropdown(options=df.columns,value=df.columns[2],
                            description='Y:',disabled=False));

g=sns.pairplot(df, height=3);
for i, j in zip(*np.triu_indices_from(g.axes, 1)):
    g.axes[i, j].set_visible(False)

toggle()

[5]:

show code

../_images/docs_regression_analysis_24_2.png

We regress costs against production runs with formula:

\[\text{cost} \sim \text{production runs}\]

which is a simple regression because of only one explanatory variable.

[6]:

df = pd.read_csv(dataurl+'costs.csv', header=0, index_col='month')
result = analysis(df, 'cost', ['production_runs'], printlvl=5)

fig = plt.figure(figsize=(15,8))
fig = sm.graphics.plot_regress_exog(result, "production_runs", fig=fig);

OLS Regression Results
Dep. Variable:	cost	R-squared:	0.271
Model:	OLS	Adj. R-squared:	0.250
Method:	Least Squares	F-statistic:	12.64
Date:	Wed, 14 Aug 2019	Prob (F-statistic):	0.00114
Time:	08:54:41	Log-Likelihood:	-379.62
No. Observations:	36	AIC:	763.2
Df Residuals:	34	BIC:	766.4
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	7.561e+04	6808.611	11.104	0.000	6.18e+04	8.94e+04
production_runs	655.0707	184.275	3.555	0.001	280.579	1029.562

Omnibus:	0.597	Durbin-Watson:	2.090
Prob(Omnibus):	0.742	Jarque-Bera (JB):	0.683
Skew:	-0.264	Prob(JB):	0.711
Kurtosis:	2.580	Cond. No.	160.

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.


standard error of estimate:9457.23946


ANOVA Table:

	sum_sq	df	F	PR(>F)
production_runs	1.130248e+09	1.0	12.637029	0.001135
Residual	3.040939e+09	34.0	NaN	NaN

../_images/docs_regression_analysis_26_3.png

../_images/docs_regression_analysis_26_4.png

KstestResult(statistic=0.5, pvalue=7.573316153566455e-09)

../_images/docs_regression_analysis_26_6.png

We perform a multiple regression on costs against both explanatory variables (production runs and machine hours) by using formula:

\[\text{cost} \sim \text{production runs} + \text{machine hours}\]

[7]:

df = pd.read_csv(dataurl+'costs.csv', header=0, index_col='month')
result = analysis(df, 'cost', ['production_runs','machine_hours'], printlvl=5)

fig = plt.figure(figsize=(12,8))
fig = sm.graphics.plot_partregress_grid(result, fig=fig);

fig = plt.figure(figsize=(10,8))
fig = sm.graphics.influence_plot(result,criterion="cook", fig=fig);

OLS Regression Results
Dep. Variable:	cost	R-squared:	0.866
Model:	OLS	Adj. R-squared:	0.858
Method:	Least Squares	F-statistic:	107.0
Date:	Wed, 14 Aug 2019	Prob (F-statistic):	3.75e-15
Time:	08:54:44	Log-Likelihood:	-349.07
No. Observations:	36	AIC:	704.1
Df Residuals:	33	BIC:	708.9
Df Model:	2
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	3996.6782	6603.651	0.605	0.549	-9438.551	1.74e+04
production_runs	883.6179	82.251	10.743	0.000	716.276	1050.960
machine_hours	43.5364	3.589	12.129	0.000	36.234	50.839

Omnibus:	3.142	Durbin-Watson:	1.313
Prob(Omnibus):	0.208	Jarque-Bera (JB):	2.259
Skew:	-0.609	Prob(JB):	0.323
Kurtosis:	3.155	Cond. No.	1.42e+04

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.42e+04. This might indicate that there are
strong multicollinearity or other numerical problems.


standard error of estimate:4108.99309


ANOVA Table:

	sum_sq	df	F	PR(>F)
production_runs	1.948557e+09	1.0	115.409712	2.611395e-12
machine_hours	2.483773e+09	1.0	147.109602	1.046454e-13
Residual	5.571662e+08	33.0	NaN	NaN

../_images/docs_regression_analysis_28_3.png

KstestResult(statistic=0.5833333333333333, pvalue=3.6048681911998855e-12)

../_images/docs_regression_analysis_28_5.png

<Figure size 720x576 with 0 Axes>

../_images/docs_regression_analysis_28_7.png

Statistical Inferences¶

The regression output includes statistics for the regression model:

$R^2$ (coefficient of determination) is an important measure of the goodness of fit of the least squares line. Formula for $R^2$ is

\[R^2 = 1 - \frac{\sum e^2}{ \sum (Y_i - \overline{Y})^2} = 1 - \frac{\sum (Y_i - \hat{Y}_i)}{ \sum (Y_i - \overline{Y})^2}\]

where $\overline{Y}$ is the mean of the observed data with $\overline{Y} = 1/n \sum Y_i$ and $\hat{Y}_i = a+b X$ is the fitted value.

It is the percentage of variation of the dependent variable explained by the regression.
It always ranges between 0 and 1. The better the linear fit is, the closer $R^2$ is to 1.
$R^2$ is the square of the multiple R, i.e., the correlation between the actual $Y$ values and the fitted $\hat{Y}$ values.
$R^2$ can only increase when extra explanatory variables are added to an equation.
If $R^2$ = 1, all of the data points fall perfectly on the regression line. The predictor $X$ accounts for all of the variation in $Y$.
If $R^2$ = 0, the estimated regression line is perfectly horizontal. The predictor $X$ accounts for none of the variation in $Y$.

Adjusted $R^2$ is an alternative measure that adjusts $R^2$ for the number of explanatory variables in the equation. It is used primarily to monitor whether extra explanatory variables really belong in the equation.

Standard error of estimate ($s_e$), also known as the regression standard error or the residual standard error, provide a good indication of how useful the regression line is for predicting $Y$ values from $X$ values.

$F$ statistics reports the result of $F$-test for the overall fit where the null hypothesis is that all coefficients of the explanatory variables are zero and the alternative is that at least one of these coefficients is not zero. It is possible, because of multicollinearity, to have all small $t$-values even though the variables as a whole have significant explanatory power.

Also, for each explanatory variable:

std err: reflects the level of accuracy of the coefficients. The lower it is, the higher is the level of accuracy
$t$-values for the individual regression coefficients, which is $t$-$\text{value} = \text{coefficient}/\text{std err}$.
$P >|t|$ is p-value. A p-value of less than 0.05 is considered to be statistically significant
- When the p-value of a coefficient is greater or equal than 0.05, we don’t reject the null hypothesis $H_0$: the coefficient is zero. The following two realities are possible:
  - We committed a Type II error. That is, in reality the coefficient is nonzero and our sample data just didn’t provide enough evidence.
  - There is not much of a linear relationship between explanatory and dependent variables, but they could have nonlinear relationship.
- When the p-value of a coefficient is less than 0.05, we reject the null hypothesis $H_0$: the coefficient is zero. The following two realities are possible:
  - We committed a Type II error. That is, in reality the coefficient is zero and we have a bad sample.
  - There is indeed a linear relationship, but a nonlinear function would fit the data even better.
Confidence Interval $[0.025, 0.975]$ represents the range in which our coefficients are likely to fall (with a likelihood of 95%)

R-squared¶

Since $R^2$ is the square of a correlation, it quantify the strength of a linear relationship. In other words, it is possible that $R^2 = 0\%$ suggesting there is no linear relation between $X$ and $Y$, and yet a perfect curved (or “curvilinear” relationship) exists.

[8]:

x = np.linspace(-5, 5, 10)
y = x*x
df = pd.DataFrame({'X':x,'Y':y})
result = ols(formula="Y ~ X", data=df).fit()

fig, axes = plt.subplots(1,1,figsize=(5,4))
_ = sns.regplot(x=df.X, y=df.Y, data=df, ci=None, ax=axes)

plt.ylim(-5, 30)
axes.text(0.6, 28, 'R-squared = %0.2f' % result.rsquared)
axes.text(0.6, 25, 'Correlation = %0.2f' % np.sqrt(result.rsquared))
plt.show()
toggle()

../_images/docs_regression_analysis_33_0.png

[8]:

show code

$R^2$ can be greatly affected by just one data point (or a few data points).

[9]:

x = np.linspace(-5, 5, 5)
y = -1.5*x + 1*np.random.normal(size=5)
y[4] = 50
df = pd.DataFrame({'X':x,'Y':y})
res = ols(formula="Y ~ X", data=df).fit()

fig, axes = plt.subplots(1,2,figsize=(12,6))
_ = sns.regplot(x=df.X, y=df.Y, data=df, ci=None, ax=axes[0])
axes[0].text(-4, 48, 'R-squared = %0.2f' % res.rsquared)
axes[0].text(-4, 45, 'Correlation = %0.2f' % np.sqrt(res.rsquared))

xx = x[:4]
yy = y[:4]
df = pd.DataFrame({'X':xx,'Y':yy})
res = ols(formula="Y ~ X", data=df).fit()
_ = sns.regplot(x=df.X, y=df.Y, data=df, ci=None, ax=axes[1])
axes[1].text(-4, 48, 'R-squared = %0.2f' % res.rsquared)
axes[1].text(-4, 45, 'Correlation = %0.2f' % np.sqrt(res.rsquared))
axes[1].plot(x[4], y[4], 'bo')

plt.show()
toggle()

../_images/docs_regression_analysis_35_0.png

[9]:

show code

A large $R^2$ value should not be interpreted as meaning that the estimated regression line fits the data well. Another function might better describe the trend in the data.

[10]:

x = np.linspace(1, 5, 20)
y = np.power(x,3)
df = pd.DataFrame({'X':x,'Y':y})
res = ols(formula="Y ~ X", data=df).fit()

fig, axes = plt.subplots(1,1,figsize=(6,6))
_ = sns.regplot(x=df.X, y=df.Y, data=df, ci=None, ax=axes)
axes.text(1, 120, 'R-squared = %0.2f' % res.rsquared)
plt.title('Data generated by equation $Y = X^3$')
plt.show()
toggle()

../_images/docs_regression_analysis_37_0.png

[10]:

show code

Because statistical significance does not imply practical significance, large $R^2$ value does not imply that the regression coefficient meaningfully different from 0. And a large $R^2$ value does not necessarily mean the regression model can provide good predictions. It is still possible to get prediction intervals or confidence intervals that are too wide to be useful.

Residual Plot¶

Residual plot checks whether the regression model is appropriate for a dataset. It fits and removes a simple linear regression and then plots the residual values for each observation. Ideally, these values should be randomly scattered around $y = 0$.

A “Good” Plot¶

Example: The dataset includes measurements of the total lifetime consumption of alcohol on a random sample of n = 50 alcoholic men. We are interested in determining whether or not alcohol consumption was linearly related to muscle strength.

[11]:

df = pd.read_csv(dataurl+'alcoholarm.txt', delim_whitespace=True, header=0)
result = analysis(df, 'strength', ['alcohol'], printlvl=2)

../_images/docs_regression_analysis_42_0.png

../_images/docs_regression_analysis_42_1.png

The scatter plot (on the left) suggests a decreasing linear relationship between alcohol and arm strength. It also suggests that there are no unusual data points in the data set.

This residual (residuals vs. fits) plot is a good example of suggesting about the appropriateness of the simple linear regression model

The residuals “bounce randomly” around the 0 line. This suggests that the assumption that the relationship is linear is reasonable.
The residuals roughly form a “horizontal band” around the 0 line. This suggests that the variances of the error terms are equal.
No one residual “stands out” from the basic random pattern of residuals. This suggests that there are no outliers.

In general, you expect your residual vs. fits plots to look something like the above plot. Interpreting these plots is subjective and many first time users tend to over-interpret these plots by looking at every twist and turn as something potentially troublesome. We should not put too much weight on residual plots based on small data sets because the size would be too small to make interpretation of a residuals plot worthwhile.

Nonlinear Relationship¶

Example: Is tire tread wear linearly related to mileage? As a result of the experiment, the researchers obtained a data set containing the mileage (in 1000 miles) driven and the depth of the remaining groove (in mils).

[12]:

df = pd.read_csv(dataurl+'treadwear.txt', sep='\s+', header=0)
result = analysis(df, 'groove', ['mileage'], printlvl=2)

../_images/docs_regression_analysis_45_0.png

../_images/docs_regression_analysis_45_1.png

Non-constant Error Variance¶

Example: How is plutonium activity related to alpha particle counts? Plutonium emits subatomic particles — called alpha particles. Devices used to detect plutonium record the intensity of alpha particle strikes in counts per second. The dataset contains 23 samples of plutonium

[13]:

df = pd.read_csv(dataurl+'alphapluto.txt', sep='\s+', header=0)
result = analysis(df, 'alpha', ['pluto'], printlvl=2)

../_images/docs_regression_analysis_47_0.png

../_images/docs_regression_analysis_47_1.png

[14]:

df = pd.read_csv(dataurl+'realestate.txt', sep='\s+', header=0)
result = analysis(df, 'SalePrice', ['SqFeet'], printlvl=2)

../_images/docs_regression_analysis_48_0.png

../_images/docs_regression_analysis_48_1.png

Outliers¶

[15]:

df = pd.read_csv(dataurl+'anscombe.txt', sep='\s+', header=0)
result = analysis(df, 'y3', ['x3'], printlvl=2)

../_images/docs_regression_analysis_50_0.png

../_images/docs_regression_analysis_50_1.png

An alternative to the residuals vs. fits plot is a residuals vs. predictor plot. It is a scatter plot of residuals on the y axis and the predictor values on the $x$-axis. The interpretation of a residuals vs. predictor plot is identical to that for a residuals vs. fits plot.

Normal Q-Q Plot of Residuals¶

Normal Q-Q plot shows if residuals are normally distributed by checking if residuals follow a straight line well or deviate severely.

A “Good” Plot¶

[16]:

df = pd.read_csv(dataurl+'alcoholarm.txt', sep='\s+', header=0)
result = analysis(df, 'strength', ['alcohol'], printlvl=3)

../_images/docs_regression_analysis_55_0.png

../_images/docs_regression_analysis_55_1.png

KstestResult(statistic=0.3217352865304308, pvalue=4.179928356341629e-05)

More Examples¶

Example 1: A Q-Q plot of the residuals after a simple regression model is used for fitting ‘time to next eruption’ and ‘duration of last eruption for eruptions’ of Old Faithful geyser which was named for its frequent and somewhat predictable eruptions. Old Faithful is located in Yellowstone’s Upper Geyser Basin in the southwest section of the park.

[17]:

df = pd.read_csv(dataurl+'oldfaithful.txt', delim_whitespace=True, header=0)
result = analysis(df, 'waiting', ['eruption'], printlvl=3)

../_images/docs_regression_analysis_58_0.png

../_images/docs_regression_analysis_58_1.png

KstestResult(statistic=0.3748566542043297, pvalue=7.87278404247441e-35)

How to interpret?

[18]:

hide_answer()

[18]:

show answer

The histogram is roughly bell-shaped so it is an indication that it is reasonable to assume that the errors have a normal distribution.

The pattern of the normal probability plot is straight, so this plot also provides evidence that it is reasonable to assume that the errors have a normal distribution.

$\blacksquare$

Example 2: The data contains ‘infection risk in a hospital’ and ‘average length of stay in the hospital’.

[19]:

df = pd.read_csv(dataurl+'infection12.txt', delim_whitespace=True, header=0)
result = analysis(df, 'InfctRsk', ['Stay'], printlvl=3)

../_images/docs_regression_analysis_63_0.png

../_images/docs_regression_analysis_63_1.png

KstestResult(statistic=0.11353657481026463, pvalue=0.39524100803555084)

How to interpret?

[20]:

hide_answer()

[20]:

show answer

The plot shows some deviation from the straight-line pattern indicating a distribution with heavier tails than a normal distribution. Also, the residual plot shows two possible outliers.

$\blacksquare$

Example: A stopping distance data with stopping distance of a car and speed of the car when the brakes were applied.

[21]:

df = pd.read_csv(dataurl+'carstopping.txt', delim_whitespace=True, header=0)
result = analysis(df, 'StopDist', ['Speed'], printlvl=3)

../_images/docs_regression_analysis_68_0.png

../_images/docs_regression_analysis_68_1.png

KstestResult(statistic=0.49181308907235927, pvalue=1.3918967453080744e-14)

How to interpret?

[22]:

hide_answer()

[22]:

show answer

Fitting a simple linear regression model to these data leads to problems with both curvature and nonconstant variance. The nonlinearity is reasonable due to physical laws. One possible remedy is to transform variable(s).

$\blacksquare$

Advice to Managers¶

An old statistics joke

A physicist, an engineer, and a statistician are on a hunting trip. They are walking through the woods when they spot a deer in the clearing.

The physicist calculates the distance to the target, the velocity and drop of the bullet, adjusts, and fires, missing the deer by five feet to the left.

The engineer looks frustrated. “You forgot to account for the wind. Give it here.” After licking a finger to determine the wind speed and direction, the engineer snatches the rifle and fires, missing the deer by five feet to the right.

Suddenly, without firing a shot, the statistician cheers, “Woo hoo! We got it!”

[23]:

hide_answer()

[23]:

show answer

Being precisely perfect on average can mean being actually wrong each time. Regression can keep missing several feet to the left or several feet to the right. Even if it averages out to the correct answer, regression can mean never actually hitting the target.

Unlike regression, machine learning predictions might be wrong on average, but when the predictions miss, they often don’t miss by much. Statisticians describe this as allowing some bias in exchange for reducing variance.

$\blacksquare$

Correlation vs Causation

Correlation is not causation. Recall the sales prediction case. It’s easy to say that there is a correlation between rain and monthly sales as shown by the regression. However, unless you are selling umbrellas, it might be difficult to prove that there is cause and effect.
Petabytes allow us to say: “Correlation is enough.” and causality is dead - Chris Anderson. Given enough statistical evidence, it’s no longer necessary to understand why things happen – we need only know what things happen together.

What do you think?

[24]:

hide_answer()

[24]:

show answer

Correlation is not causation. When you see a correlation from a regression analysis, you cannot make assumptions. You have to go out and see what’s happening in the real world. What is the physical mechanism causing the relationship.

For example, go out and observe consumers buying your product in the rain, talk to them and find out what is actually causing them to make the purchase. The goal is NOT to figure out what is going on in the data but to figure out what is going on in the world.

Correlation is enough. The beer and diapers. Some time ago, Wal-Mart runs queries on its point of sale systems with 1.2 million baskets worth in all. Many correlations appeared. Some of these were obvious.

beer

However, one correlation stood out like a sore thumb because it was so unexpected. Those queries revealed that, between 5pm and 7pm, customers tended to co-purchase beer and diapers. By moving these two items closer together, Wal-Mart reportedly saw the sales of both items increase geometrically.

The key question is “Can we take action on the basis of a correlation finding?” The answer depends primarily on two factors:

Confidence that the correlation will reliably recur in the future. The higher that confidence level, the more reasonable it is to take action in response.
The tradeoff between the risk and reward of acting. If the risk of acting and being wrong is extremely high, for example, acting on even a strong correlation may be a mistake.

correlation

$\blacksquare$

How do managers use it?¶

[25]:

hide_answer()

[25]:

show answer

Regression analysis is the go-to method in analytics. It helps us figure out what we can do. Mangers often uses it to

explain a phenomenon they want to understand (e.g. why did customer service calls drop last month?)
predict things about the future (e.g. what will sales look like over the next six months?)
decide what to do (e.g. should we go with this promotion or a different one?)

$\blacksquare$

What mistakes do managers make?¶

[26]:

hide_answer()

[26]:

show answer

Managers need to scope the problem. It’s managers’ job to identify the factors that may have an impact and ask your analyst to look at those. If you ask a data analyst to tell you something you don’t know, then you deserve what you get, which is bad analysis. It’s the same principle as flipping a coin. If someone do it enough times, you will eventually think you see something interesting, like a bunch of heads all in a row.

Regression analysis is very sensitive to bad data so be careful about data collecting and whether to act on the data.

Don’t make the mistake of ignoring the error term. If the regression explains 90% of the relationship, that is great. But if it explains 10%, and you act like it is 90%, that is not good.

Don’t let data replace your intuition and judgment.

$\blacksquare$

Regression Analysis¶

We consider a simple example to illustrate how to use python package statsmodels to perform regression analysis and predictions.

Influential Points¶

An influential point is an outlier that greatly affects the slope of the regression line. - An outlier is a data point whose response y does not follow the general trend of the rest of the data. - A data point has high leverage if it has “extreme” predictor x values. The lack of neighboring observations means that the fitted regression model will pass close to that particular observation.

Example 1:

[27]:

df = pd.read_csv(dataurl+'influence1.txt', delim_whitespace=True, header=0)
result = analysis(df, 'y', ['x'], printlvl=1)

fig, ax = plt.subplots(figsize=(10,6), dpi=80)
fig = sm.graphics.influence_plot(result, alpha  = 0.05, ax = ax, criterion="cooks")

../_images/docs_regression_analysis_89_0.png

../_images/docs_regression_analysis_89_1.png

Example 2:

[28]:

df = pd.read_csv(dataurl+'influence2.txt', delim_whitespace=True, header=0)
result = analysis(df, 'y', ['x'], printlvl=1)

fig, ax = plt.subplots(figsize=(10,6), dpi=80)
fig = sm.graphics.influence_plot(result, alpha  = 0.05, ax = ax, criterion="cooks")

../_images/docs_regression_analysis_91_0.png

../_images/docs_regression_analysis_91_1.png

Example 3:

[29]:

df = pd.read_csv(dataurl+'influence3.txt', delim_whitespace=True, header=0)
result = analysis(df, 'y', ['x'], printlvl=1)

fig, ax = plt.subplots(figsize=(10,6), dpi=80)
fig = sm.graphics.influence_plot(result, alpha  = 0.05, ax = ax, criterion="cooks")

../_images/docs_regression_analysis_93_0.png

../_images/docs_regression_analysis_93_1.png

Example 4:

[30]:

df = pd.read_csv(dataurl+'influence4.txt', delim_whitespace=True, header=0)
result = analysis(df, 'y', ['x'], printlvl=1)

fig, ax = plt.subplots(figsize=(10,6), dpi=80)
fig = sm.graphics.influence_plot(result, alpha  = 0.05, ax = ax, criterion="cooks")

../_images/docs_regression_analysis_95_0.png

../_images/docs_regression_analysis_95_1.png

In all examples above, do they contain outliers or high leverage data points? Are identified data points influential?

[31]:

hide_answer()

[31]:

show answer

If an observation has a studentized residual that is larger than 3 (in absolute value) we can call it an outlier.

Example 1:

All of the data points follow the general trend of the rest of the data, so there are no outliers (in the $Y$ direction). And, none of the data points are extreme with respect to $X$, so there are no high leverage points. Overall, none of the data points would appear to be influential with respect to the location of the best fitting line.

Example 2:

There is one outlier, but no high leverage points. An easy way to determine if the data point is influential is to find the best fitting line twice — once with the red data point included and once with the red data point excluded. The data point is not influential.

Example 3:

The data point is not influential, nor is it an outlier, but it does have high leverage.

Example 4:

The data point is deemed both high leverage and an outlier, and it turned out to be influential too.

$\blacksquare$

Partial Regression Plots¶

For a particular independent variable $X_i$, a partial regression plot shows:

residuals from regressing $Y$ (the response variable) against all the independent variables except $X_i$
residuals from regressing $X_i$ against the remaining independent variables.

It has a slope that is the same as the multiple regression coefficient for that predictor and also has the same residuals as the full multiple regression. So we can spot any outliers or influential points and tell whether they’ve affected the estimation of this particular coefficient. For the same reasons that we always look at a scatterplot before interpreting a simple regression coefficient, it’s a good idea to make a partial regression plot for any multiple regression coefficient that we hope to understand or interpret.

ANOVA¶

Consider the heart attacks in rabbits study with variables:

the infarcted area (in grams) of a rabbit
the region at risk (in grams) of a rabbit
$X2$ =1 if early cooling of a rabbit, 0 if not
$X3$ =1 if late cooling of a rabbit, 0 if not

The regress formula is

\begin{equation*} \text{Infarc} = \beta_0 + \beta_1 \times \text{Area} + \beta_2 \times \text{X2} + \beta_3 \times \text{X3} \end{equation*}

[32]:

df = pd.read_csv(dataurl+'coolhearts.txt', delim_whitespace=True, header=0)
result = analysis(df, 'Infarc', ['Area','X2','X3'], printlvl=0)
sm.stats.anova_lm(result, typ=2)

[32]:

	sum_sq	df	F	PR(>F)
Area	0.637420	1.0	32.753601	0.000004
X2	0.297327	1.0	15.278050	0.000536
X3	0.019814	1.0	1.018138	0.321602
Residual	0.544910	28.0	NaN	NaN

sum_sq[Residual] is sum of squared residuals;
sum_sq[Area] is sum_sq[Residual] - sum_sq[Residual] for regression Infarc ~ X2 + X3, i.e., regression without Area;
df[Residual] is degree of freedom $=$ total number of observations (result.nobs) $-$ the number of explanatory variables (3 or result.df_model) $- 1 =$ result.df_resid;

ANOVA (analysis of variance) table for regression can be used to test hypothesis:

\begin{align} H_0:& \text{ a subset of explanatory variables have $0$ slope} \nonumber\\ H_a:& \text{ at least one variable in the subset of explanatory variables have nonzero slope} \nonumber \end{align}

Example 1: \begin{align} H_0:& \quad \beta_i = 0 ~ \forall i \in \{1\} \nonumber\\ H_a:& \quad \beta_i \neq 0 \text{ for at least one } i \in \{1\}. \nonumber \end{align}

[33]:

df = pd.read_csv(dataurl+'coolhearts.txt', delim_whitespace=True, header=0)
_ = GL_Ftest(df, 'Infarc', ['Area','X2','X3'], ['X2','X3'], 0.05)

The F-value is 32.754 and p-value is 0.000. Reject null hypothesis.
At least one variable in {'Area'} has nonzero slope.

It shows the size of the infarct is significantly related to the size of the area at risk. The result of such F-test is always equivalent to the t-test result.

Example 2: \begin{align} H_0:& \quad \beta_i = 0 ~ \forall i \in \{1,2,3\} \nonumber\\ H_a:& \quad \beta_i \neq 0 \text{ for at least one } i \in \{1,2,3\}. \nonumber \end{align}

[34]:

df = pd.read_csv(dataurl+'coolhearts.txt', delim_whitespace=True, header=0)
_ = GL_Ftest(df, 'Infarc', ['Area','X2','X3'], [], 0.05)

The F-value is 16.431 and p-value is 0.000. Reject null hypothesis.
At least one variable in {'Area', 'X3', 'X2'} has nonzero slope.

This is also called a test for the overall fit. The F-statistics and its p value are also available from the fvalue and f_pvalue properties of the regression result in statsmodels.

Example 3: \begin{align} H_0:& \quad \beta_i = 0 ~ \forall i \in \{2,3\} \nonumber\\ H_a:& \quad \beta_i \neq 0 \text{ for at least one } i \in \{2,3\}. \nonumber \end{align}

[35]:

df = pd.read_csv(dataurl+'coolhearts.txt', delim_whitespace=True, header=0)
_ = GL_Ftest(df, 'Infarc', ['Area','X2','X3'], ['Area'], 0.05)

The F-value is 8.590 and p-value is 0.001. Reject null hypothesis.
At least one variable in {'X3', 'X2'} has nonzero slope.

Let $K \subseteq N$, we have two hypothesis \begin{align} H_0:& \quad \beta_i = 0 ~ \forall i \in K \nonumber\\ H_a:& \quad \beta_i \neq 0 \text{ for at least one } i \in K. \nonumber \end{align}

We consider two models \begin{align} \text{full model:} & \quad Y = \beta_0 + \sum_{i \in N} \beta_i X_i \nonumber\\ \text{reduced model:} & \quad Y = \beta_0 + \sum_{i \in K} \beta_i X_i. \nonumber\\ \end{align}

Let $SSE(F)$ and $SSE(R)$ be the sum of squared errors for the full and reduced models respectively where

\begin{equation*} SSE = \text{sum of squared residuals} = \sum e^2_i = \sum (Y_i - \hat{Y}_i)^2. \end{equation*}

The general linear statistic:

\begin{equation*} F^\ast = \frac{SSE(R) - SSE(F)}{df_R - df_F} \div \frac{SSE(F)}{df_F} \end{equation*}

and $p$-value is \begin{equation*} 1 - \text{F-dist}(F^\ast, df_R - df_F, df_F). \end{equation*}

Why don’t we perform a series of individual t-tests? For example, for the rabbit heart attacks study, we could have done the following in Example 3:

Fit the model InfSize ~ Area $+$ x2 $+$ x3 and use an individual t-test for x3.
If the test results indicate that we can drop x3 then fit the model oInfSize ~ Area $+$ x2 and use an individual t-test for x2.

[36]:

hide_answer()

[36]:

show answer

Note that we’re using two individual t-tests instead of one F-test in the suggested method. It means the chance of drawing an incorrect conclusion in our testing procedure is higher. Recall that every time we do a hypothesis test, we can draw an incorrect conclusion by:

rejecting a true null hypothesis, i.e., make a type 1 error by concluding the tested predictor(s) should be retained in the model, when in truth it/they should be dropped; or
failing to reject a false null hypothesis, i.e., make a type 2 error by concluding the tested predictor(s) should be dropped from the model, when in truth it/they should be retained.

Thus, in general, the fewer tests we perform the better. In this case, this means that wherever possible using one F-test in place of multiple individual t-tests is preferable.

$\blacksquare$

Prediction¶

After we have estimated a regression equation from a dataset, we can use this equation to predict the value of the dependent variable for new observations.

[37]:

df = pd.read_csv(dataurl+'costs.csv', header=0, index_col='month')
result = analysis(df, 'cost', ['production_runs','machine_hours'], printlvl=0)

df_pred = pd.DataFrame({'machine_hours':[1430, 1560, 1520],'production_runs': [35,45,40]})
predict = result.get_prediction(df_pred)
predict.summary_frame(alpha=0.05)

[37]:

	mean	mean_se	mean_ci_lower	mean_ci_upper	obs_ci_lower	obs_ci_upper
0	97180.354898	697.961311	95760.341932	98600.367863	88700.800172	105659.909623
1	111676.265905	1135.822591	109365.417468	113987.114342	103002.948661	120349.583149
2	105516.720354	817.257185	103853.998110	107179.442598	96993.161428	114040.279281

mean is the value calculated based on the (sample) regression line
mean (or confidence) lower and upper limits offer the confidence interval of the average of predictor with given values of explanatory variables. In other words, if we run experiments many times with given values of explanatory variables, we have 95% confidence that the mean of the dependent variable will be in this interval.
obs (or prediction) lower and upper limits offer the confidence interval of a single prediction. In other words, if we run one experiment with given values of explanatory variables, we have 95% confidence that the value of the dependent variable will be in this interval.

Cross Validation¶

The fit from a regression analysis is often overly optimistic (over-fitted). There is no guarantee that the fit will be as good when th estimated regression equation is applied to new data.

To validate the fit, we can gather new data, predict the dependent variable and compare with known values of the dependent variable. However, most of the time we cannot obtain new independent data to validate our model. An alternative is to partition the sample data into a training set and testing (validation) set.

The simplest approach to cross-validation is to partition the sample observations randomly with 50% of the sample in each set or 60%/40% or 70%/30%.

[38]:

df = pd.read_csv(dataurl+'costs.csv', header=0, index_col='month')
result = analysis(df, 'cost', ['production_runs','machine_hours'], printlvl=0)

df_validate = pd.read_csv(dataurl+'validation.csv', header=0)
predict = result.get_prediction(df_validate)
df_validate['predict'] = predict.predicted_mean
r2 = np.corrcoef(df_validate.cost,df_validate.predict)[0,1]**2
stderr = np.sqrt(((df_validate.cost - df_validate.predict)**2).sum()/df_validate.shape[0]-3)
print('Validation R-squared:\t\t{:.3f}'.format(r2))
print('Validation StErr of Est:\t{:.3f}'.format(stderr))

Validation R-squared:           0.773
Validation StErr of Est:        5032.718

$K$-fold cross-validation can also be used. This partitions the sample dataset into $K$ subsets with equal size. For each subset, we use it as testing set and the remaining $K-1$ subsets as training set. We then calculate the sum of squared prediction errors, and combine the $K$ estimates of prediction error to produce a $K$-fold cross-validation estimate.

Multicollinearity¶

Multicollinearity exists when two or more of the predictors in a regression model are moderately or highly correlated with one another. The exact linear relationship is fairly simple to spot because regression packages will typically respond with an error message. A more common and serious problem is multicollinearity, where explanatory variables are highly, but not exactly, correlated.

There are two types of multicollinearity:

Structural multicollinearity is a mathematical artifact caused by creating new predictors from other predictors — such as, creating the predictor $X^2$ from the predictor $X$.
Data-based multicollinearity, on the other hand, is a result of a poorly designed experiment, reliance on purely observational data, or the inability to manipulate the system on which the data are collected.

Variance inflation factor (VIF) can be used to identify multicollinearity. The VIF for $X$ is defined as 1/(1 - R-square for $X$), where R-square for $X$ variable is the usual R-square value from a regression with that $X$ as the dependent variable and the other $X$’s as the explanatory variables.

[39]:

from statsmodels.stats.outliers_influence import variance_inflation_factor
from patsy import dmatrices

df = pd.read_csv(dataurl+'multicollinearity.csv',header=0, index_col='person')
features = "+".join(df.drop(columns=['salary']).columns)
y, X = dmatrices('salary ~' + features, df, return_type='dataframe')
VIF = pd.DataFrame()
VIF["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(1,X.shape[1])]
VIF["Features"] = X.drop(columns=['Intercept']).columns
display(VIF)

def calculate_vif_(X, thresh=5.0):
    cols = X.columns
    variables = np.arange(X.shape[1])
    dropped=True
    while dropped:
        dropped=False
        c = X[cols[variables]].values
        vif = [variance_inflation_factor(c, ix) for ix in np.arange(c.shape[1])]

        maxloc = vif.index(max(vif))
        if max(vif) > thresh:
            print('dropping \'' + X[cols[variables]].columns[maxloc] + '\' at index: ' + str(maxloc))
            variables = np.delete(variables, maxloc)
            dropped=True

    print('Remaining variables:')
    print(X.columns[variables])

calculate_vif_(X.drop(columns=['Intercept']))
toggle()

	VIF Factor	Features
0	1.001383	gender[T.Male]
1	31.503655	age
2	38.153090	experience
3	5.594549	seniority

dropping 'experience' at index: 2
dropping 'age' at index: 1
Remaining variables:
Index(['gender[T.Male]', 'seniority'], dtype='object')

[39]:

show code

Regression Modeling¶

Categorical Variables¶

The only meaningful way to summarize categorical data is with counts of observations in the categories. We consider a bank’s employee data.

[40]:

df_salary = pd.read_csv(dataurl+'salaries.csv', header=0, index_col='employee')
interact(plot_hist,
         column=widgets.Dropdown(options=df_salary.columns[:-1],value=df_salary.columns[0],
                            description='Column:',disabled=False));

To analyze whether the bank discriminates against females in terms of salary, we consider regression model:

\[\text{salary} \sim \text{years of experience with this bank (year1)} + \text{years of work experience prior to this bank (year2)} + \text{gender}\]

[41]:

df = pd.read_csv(dataurl+'salaries.csv', header=0, index_col='employee')
result = analysis(df, 'salary', ['years1', 'years2', 'C(gender, Treatment(\'Male\'))'], printlvl=5)

OLS Regression Results
Dep. Variable:	salary	R-squared:	0.492
Model:	OLS	Adj. R-squared:	0.485
Method:	Least Squares	F-statistic:	65.93
Date:	Wed, 14 Aug 2019	Prob (F-statistic):	7.61e-30
Time:	08:55:10	Log-Likelihood:	-2164.5
No. Observations:	208	AIC:	4337.
Df Residuals:	204	BIC:	4350.
Df Model:	3
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	3.549e+04	1341.022	26.466	0.000	3.28e+04	3.81e+04
C(gender, Treatment('Male'))[T.Female]	-8080.2121	1198.170	-6.744	0.000	-1.04e+04	-5717.827
years1	987.9937	80.928	12.208	0.000	828.431	1147.557
years2	131.3379	180.923	0.726	0.469	-225.381	488.057

Omnibus:	15.654	Durbin-Watson:	1.286
Prob(Omnibus):	0.000	Jarque-Bera (JB):	25.154
Skew:	0.434	Prob(JB):	3.45e-06
Kurtosis:	4.465	Cond. No.	34.6

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.


standard error of estimate:8079.39743


ANOVA Table:

	sum_sq	df	F	PR(>F)
C(gender, Treatment('Male'))	2.968701e+09	1.0	45.478753	1.556196e-10
years1	9.728975e+09	1.0	149.042170	4.314965e-26
years2	3.439940e+07	1.0	0.526979	4.687119e-01
Residual	1.331644e+10	204.0	NaN	NaN

../_images/docs_regression_analysis_132_3.png

KstestResult(statistic=0.5240384615384616, pvalue=1.0087139661949263e-53)

The regression is equivalent to fit data into two separate regression functions. However, it is usually easier and more efficient to answer research questions concerning the binary predictor variable by using one combined regression function.

[42]:

def update(column):
    df = pd.read_csv(dataurl+'salaries.csv', header=0, index_col='employee')
    sns.lmplot(x=column, y='salary', data=df, fit_reg=True, hue='gender', legend=True, height=6);

interact(update,
         column=widgets.Dropdown(options=['years1', 'years2', 'age', 'education'],value='years1',
                            description='Column:',disabled=False));

Interaction Variables¶

Interaction with Categorical Variable

Example: consider salary data with interaction gender*year1

[43]:

result = ols(formula="salary ~ years1 + C(gender, Treatment('Male'))*years1", data=df_salary).fit()
display(result.summary())
print("\nANOVA Table:\n")
display(sm.stats.anova_lm(result, typ=1))

OLS Regression Results
Dep. Variable:	salary	R-squared:	0.639
Model:	OLS	Adj. R-squared:	0.633
Method:	Least Squares	F-statistic:	120.2
Date:	Wed, 14 Aug 2019	Prob (F-statistic):	7.51e-45
Time:	08:55:12	Log-Likelihood:	-2129.2
No. Observations:	208	AIC:	4266.
Df Residuals:	204	BIC:	4280.
Df Model:	3
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	3.043e+04	1216.574	25.013	0.000	2.8e+04	3.28e+04
C(gender, Treatment('Male'))[T.Female]	4098.2519	1665.842	2.460	0.015	813.776	7382.727
years1	1527.7617	90.460	16.889	0.000	1349.405	1706.119
C(gender, Treatment('Male'))[T.Female]:years1	-1247.7984	136.676	-9.130	0.000	-1517.277	-978.320

Omnibus:	11.523	Durbin-Watson:	1.188
Prob(Omnibus):	0.003	Jarque-Bera (JB):	15.614
Skew:	0.379	Prob(JB):	0.000407
Kurtosis:	4.108	Cond. No.	58.3

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.


ANOVA Table:

	df	sum_sq	mean_sq	F	PR(>F)
C(gender, Treatment('Male'))	1.0	3.149634e+09	3.149634e+09	67.789572	2.150454e-14
years1	1.0	9.726635e+09	9.726635e+09	209.346370	4.078705e-33
C(gender, Treatment('Male')):years1	1.0	3.872606e+09	3.872606e+09	83.350112	6.833545e-17
Residual	204.0	9.478232e+09	4.646192e+07	NaN	NaN

Example: Let’s consider some contrived data

[44]:

np.random.seed(123)
size = 20
x1 = np.linspace(-5,5,size)
group = np.random.randint(0, 2, size=size)
y = 5 + (3-7*group)*x1 + 3*np.random.normal(size=size)
df = pd.DataFrame({'y':y, 'x1':x1, 'group':group})
sns.lmplot(x='x1', y='y', data=df, fit_reg=True, hue='group', legend=True);
result = analysis(df, 'y', ['x1', 'group'], printlvl=4)

OLS Regression Results
Dep. Variable:	y	R-squared:	0.004
Model:	OLS	Adj. R-squared:	-0.113
Method:	Least Squares	F-statistic:	0.03659
Date:	Wed, 14 Aug 2019	Prob (F-statistic):	0.964
Time:	08:55:13	Log-Likelihood:	-72.898
No. Observations:	20	AIC:	151.8
Df Residuals:	17	BIC:	154.8
Df Model:	2
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	3.1116	3.075	1.012	0.326	-3.377	9.600
x1	0.1969	0.765	0.257	0.800	-1.417	1.811
group	0.0733	4.667	0.016	0.988	-9.773	9.920

Omnibus:	0.957	Durbin-Watson:	2.131
Prob(Omnibus):	0.620	Jarque-Bera (JB):	0.741
Skew:	0.008	Prob(JB):	0.690
Kurtosis:	2.057	Cond. No.	7.06

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.


standard error of estimate:10.04661

../_images/docs_regression_analysis_139_2.png

../_images/docs_regression_analysis_139_3.png

KstestResult(statistic=0.44990149282919256, pvalue=0.0003257504098859564)

The scatterplot shows that both explanatory variables x and group are related to y. However, the regression results show that there is insufficient evidence to conclude that x and group are related to y. The example shows that if we leave an important interaction term out of our model, our analysis can lead us to make erroneous conclusions.

Without interaction variables, we assume the effect of each $X$ on $Y$ is independent of the value of the other $X$’s, which is not the case in this contrived data as shown in the scatterplot.

Now, consider a regression model contains an interaction term between the two predictors.

[45]:

result = analysis(df, 'y', ['x1', 'x1*C(group)'], printlvl=4)

OLS Regression Results
Dep. Variable:	y	R-squared:	0.897
Model:	OLS	Adj. R-squared:	0.878
Method:	Least Squares	F-statistic:	46.57
Date:	Wed, 14 Aug 2019	Prob (F-statistic):	3.95e-08
Time:	08:55:14	Log-Likelihood:	-50.187
No. Observations:	20	AIC:	108.4
Df Residuals:	16	BIC:	112.4
Df Model:	3
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	4.7029	1.027	4.578	0.000	2.525	6.881
C(group)[T.1]	1.7810	1.552	1.147	0.268	-1.509	5.071
x1	2.4905	0.319	7.798	0.000	1.813	3.168
x1:C(group)[T.1]	-6.1842	0.524	-11.792	0.000	-7.296	-5.072

Omnibus:	1.739	Durbin-Watson:	1.433
Prob(Omnibus):	0.419	Jarque-Bera (JB):	1.403
Skew:	0.607	Prob(JB):	0.496
Kurtosis:	2.541	Cond. No.	7.68

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.


standard error of estimate:3.32671

../_images/docs_regression_analysis_141_2.png

KstestResult(statistic=0.40384330620002296, pvalue=0.0018511879743272586)

The non-significance for the coefficients of group indicates no observed difference between the mean values of x1 in different groups.

Example: Researchers are interested in comparing the effectiveness of three treatments (A,B,C) for severe depression. They collect the data on a random sample of $n = 36$ severely depressed individuals including following variables:

measure of the effectiveness of the treatment for each individual
age (in years) of each individual i
x2 = 1 if an individual is received treatment A and 0, if not
x3 = 1 if an individual is received treatment B and 0, if not

The regression model:

\begin{align} \text{y} = \beta_0 + \beta_1 \times \text{age} + \beta_2 \times \text{x2} + \beta_3 \times \text{x3} + \beta_{12} \times \text{age} \times \text{x2} + \beta_{13} \times \text{age} \times \text{x3} \end{align}

[46]:

df = pd.read_csv(dataurl+'depression.txt', delim_whitespace=True, header=0)
sns.lmplot(x='age', y='y', data=df, fit_reg=True, hue='TRT', legend=True);

../_images/docs_regression_analysis_144_0.png

[47]:

result = analysis(df, 'y', ['age', 'age*x2', 'age*x3'], printlvl=4)
sm.stats.anova_lm(result, typ=1)

OLS Regression Results
Dep. Variable:	y	R-squared:	0.914
Model:	OLS	Adj. R-squared:	0.900
Method:	Least Squares	F-statistic:	64.04
Date:	Wed, 14 Aug 2019	Prob (F-statistic):	4.26e-15
Time:	08:55:16	Log-Likelihood:	-97.024
No. Observations:	36	AIC:	206.0
Df Residuals:	30	BIC:	215.5
Df Model:	5
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	6.2114	3.350	1.854	0.074	-0.630	13.052
age	1.0334	0.072	14.288	0.000	0.886	1.181
x2	41.3042	5.085	8.124	0.000	30.920	51.688
age:x2	-0.7029	0.109	-6.451	0.000	-0.925	-0.480
x3	22.7068	5.091	4.460	0.000	12.310	33.104
age:x3	-0.5097	0.110	-4.617	0.000	-0.735	-0.284

Omnibus:	2.593	Durbin-Watson:	1.633
Prob(Omnibus):	0.273	Jarque-Bera (JB):	1.475
Skew:	-0.183	Prob(JB):	0.478
Kurtosis:	2.079	Cond. No.	529.

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.


standard error of estimate:3.92491

../_images/docs_regression_analysis_145_2.png

KstestResult(statistic=0.3553429742849312, pvalue=0.00014233708080337406)

[47]:

	df	sum_sq	mean_sq	F	PR(>F)
age	1.0	3424.431786	3424.431786	222.294615	2.058902e-15
x2	1.0	803.804498	803.804498	52.178412	4.856763e-08
age:x2	1.0	375.254832	375.254832	24.359407	2.794727e-05
x3	1.0	0.936671	0.936671	0.060803	8.069102e-01
age:x3	1.0	328.424475	328.424475	21.319447	6.849515e-05
Residual	30.0	462.147738	15.404925	NaN	NaN

For each age, is there a difference in the mean effectiveness for the three treatments?

[48]:

hide_answer()

[48]:

show answer

\begin{align*} \text{Treatment A:} & \quad \mu_Y = (\beta_0 + \beta_2) + (\beta_1 + \beta_{12}) \times \text{age} \\ \text{Treatment B:} & \quad \mu_Y = (\beta_0 + \beta_3) + (\beta_1 + \beta_{13}) \times \text{age} \\ \text{Treatment C:} & \quad \mu_Y = \beta_0 + \beta_1 \times \text{age} \end{align*}

The research question is equivalent to ask \begin{align*} \text{Treatment A} - \text{Treatment C} &= 0 = \beta_2 + \beta_{12} \times \text{age} \\ \text{Treatment B} - \text{Treatment C} &= 0 = \beta_3 + \beta_{13} \times \text{age} \end{align*}

Thus, we test hypothesis \begin{align} H_0:& \quad \beta_2 = \beta_3 = \beta_{12} = \beta_{13} = 0 \nonumber\\ H_a:& \quad \text{at least one is nonzero} \nonumber \end{align}

The python code is:

_ =GL_Ftest(df, 'y', ['age', 'x2', 'x3', 'age:x2', 'age:x3'], ['age'], 0.05, typ=1)

$\blacksquare$

Does the effect of age on the treatment’s effectiveness depend on treatment?

[49]:

hide_answer()

[49]:

show answer

The question asks if the two interaction parameters $\beta_{12}$ and $\beta_{13}$ are significant.

The python code is:

_ =GL_Ftest(df, 'y', ['age', 'x2', 'x3', 'age:x2', 'age:x3'], ['age', 'x2', 'x3'], 0.05, typ=1)

$\blacksquare$

Interactions Between Continuous Variables

Typically, regression models that include interactions between quantitative predictors adhere to the hierarchy principle, which says that if your model includes an interaction term, $X_1 X_2$, and $X_1 X_2$ is shown to be a statistically significant predictor of Y, then your model should also include the “main effects,” $X_1$ and $X_2$, whether or not the coefficients for these main effects are significant. Depending on the subject area, there may be circumstances where a main effect could be excluded, but this tends to be the exception.

Transformations¶

Log-transform

Transformation is used widely in regression analysis because they have a meaningful interpretation.
- $Y = a \log(X) + \cdots$: 1% increase by X, Y is expected to increase by 0.01*a amount
- $\log(Y) = a X + \cdots$: 1unit increase by X, Y is expected to increase by (100*a)%
- $\log(Y) = a \log(X) + \cdots$: 1% increase by X, Y is expected to increase by a%
Transforming the $X$ values is appropriate when non-linearity is the only problem (i.e., the independence, normality, and equal variance conditions are met)
Transforming the $Y$ values should be considered when non-normality and/or unequal variances are the problems with the model.

[50]:

df = pd.read_csv(dataurl+'shortleaf.txt', delim_whitespace=True, header=0)
result = analysis(df, 'Vol', ['Diam'], printlvl=4)

OLS Regression Results
Dep. Variable:	Vol	R-squared:	0.893
Model:	OLS	Adj. R-squared:	0.891
Method:	Least Squares	F-statistic:	564.9
Date:	Wed, 14 Aug 2019	Prob (F-statistic):	1.17e-34
Time:	08:55:18	Log-Likelihood:	-258.61
No. Observations:	70	AIC:	521.2
Df Residuals:	68	BIC:	525.7
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	-41.5681	3.427	-12.130	0.000	-48.406	-34.730
Diam	6.8367	0.288	23.767	0.000	6.263	7.411

Omnibus:	29.509	Durbin-Watson:	0.562
Prob(Omnibus):	0.000	Jarque-Bera (JB):	85.141
Skew:	1.240	Prob(JB):	3.25e-19
Kurtosis:	7.800	Cond. No.	34.8

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.


standard error of estimate:9.87485

../_images/docs_regression_analysis_155_2.png

../_images/docs_regression_analysis_155_3.png

KstestResult(statistic=0.4372892531717568, pvalue=1.0701622883725878e-12)

The Q-Q plot suggests that the error terms are not normal. We can log-transform $X$ (Diam).

[51]:

df = pd.read_csv(dataurl+'shortleaf.txt', delim_whitespace=True, header=0)
df['logvol'] = np.log(df.Vol)
df['logdiam'] = np.log(df.Diam)
result = analysis(df, 'Vol', ['logdiam'], printlvl=4)

OLS Regression Results
Dep. Variable:	Vol	R-squared:	0.746
Model:	OLS	Adj. R-squared:	0.743
Method:	Least Squares	F-statistic:	200.2
Date:	Wed, 14 Aug 2019	Prob (F-statistic):	6.10e-22
Time:	08:55:20	Log-Likelihood:	-288.66
No. Observations:	70	AIC:	581.3
Df Residuals:	68	BIC:	585.8
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	-116.1618	10.830	-10.726	0.000	-137.773	-94.551
logdiam	64.5358	4.562	14.148	0.000	55.433	73.638

Omnibus:	50.047	Durbin-Watson:	0.324
Prob(Omnibus):	0.000	Jarque-Bera (JB):	219.500
Skew:	2.099	Prob(JB):	2.17e-48
Kurtosis:	10.591	Cond. No.	16.6

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.


standard error of estimate:15.17022

../_images/docs_regression_analysis_157_2.png

../_images/docs_regression_analysis_157_3.png

KstestResult(statistic=0.5855934039207538, pvalue=2.338971194588551e-23)

Transforming only the x values doesn’t change the non-linearity at all and improve the normality of the error terms either. We perform log-transform on both $X$ and $Y$.

[52]:

result = analysis(df, 'logvol', ['logdiam'], printlvl=4)

OLS Regression Results
Dep. Variable:	logvol	R-squared:	0.974
Model:	OLS	Adj. R-squared:	0.973
Method:	Least Squares	F-statistic:	2509.
Date:	Wed, 14 Aug 2019	Prob (F-statistic):	2.08e-55
Time:	08:55:21	Log-Likelihood:	25.618
No. Observations:	70	AIC:	-47.24
Df Residuals:	68	BIC:	-42.74
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	-2.8718	0.122	-23.626	0.000	-3.114	-2.629
logdiam	2.5644	0.051	50.090	0.000	2.462	2.667

Omnibus:	1.405	Durbin-Watson:	1.359
Prob(Omnibus):	0.495	Jarque-Bera (JB):	1.119
Skew:	-0.051	Prob(JB):	0.572
Kurtosis:	2.389	Cond. No.	16.6

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.


standard error of estimate:0.17026

../_images/docs_regression_analysis_159_2.png

../_images/docs_regression_analysis_159_3.png

KstestResult(statistic=0.3711075900985334, pvalue=3.691529639884377e-09)

The residual and Q-Q plot has improved substantially

Polynomial Transformations

[53]:

df = pd.read_csv(dataurl+'costunits.csv', header=0, index_col='month')
result = analysis(df, 'cost', ['units'], printlvl=4)

OLS Regression Results
Dep. Variable:	cost	R-squared:	0.736
Model:	OLS	Adj. R-squared:	0.728
Method:	Least Squares	F-statistic:	94.75
Date:	Wed, 14 Aug 2019	Prob (F-statistic):	2.32e-11
Time:	08:55:23	Log-Likelihood:	-334.94
No. Observations:	36	AIC:	673.9
Df Residuals:	34	BIC:	677.0
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	2.365e+04	1917.137	12.337	0.000	1.98e+04	2.75e+04
units	30.5331	3.137	9.734	0.000	24.158	36.908

Omnibus:	4.527	Durbin-Watson:	2.144
Prob(Omnibus):	0.104	Jarque-Bera (JB):	1.765
Skew:	-0.042	Prob(JB):	0.414
Kurtosis:	1.918	Cond. No.	2.57e+03

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.57e+03. This might indicate that there are
strong multicollinearity or other numerical problems.


standard error of estimate:2733.74240

../_images/docs_regression_analysis_162_2.png

../_images/docs_regression_analysis_162_3.png

KstestResult(statistic=0.5555555555555556, pvalue=5.544489818114514e-11)

[54]:

df = pd.read_csv(dataurl+'costunits.csv', header=0, index_col='month')
fit2 = np.poly1d(np.polyfit(df.units, df.cost, 2))
xp = np.linspace(np.min(df.units), np.max(df.units), 100)
fig = plt.figure(figsize=(8,6))
plt.plot(df.units, df.cost, '.', xp, fit2(xp), '-')
plt.title('cost vs units')
plt.xlabel('units')
plt.ylabel('cost')
plt.show()

toggle()

../_images/docs_regression_analysis_163_0.png

[54]:

show code

[55]:

df = pd.read_csv(dataurl+'costunits.csv', header=0, index_col='month')
df['units2'] = np.power(df.units,2)
result = ols(formula='cost ~ units + units2', data=df).fit()
display(result.summary())
fig, axes = plt.subplots(1,3,figsize=(18,6))
sns.residplot(result.fittedvalues, result.resid , lowess=False, scatter_kws={"s": 80},
              line_kws={'color':'r', 'lw':1}, ax=axes[0])
axes[0].set_title('Residual plot')
axes[0].set_xlabel('Fitted values')
axes[0].set_ylabel('Residuals')

axes[1].relim()
stats.probplot(result.resid, dist='norm', plot=axes[1])
axes[1].set_title('Normal Q-Q plot')
axes[2].relim()
sns.distplot(result.resid, ax=axes[2]);
plt.show()
toggle()

OLS Regression Results
Dep. Variable:	cost	R-squared:	0.822
Model:	OLS	Adj. R-squared:	0.811
Method:	Least Squares	F-statistic:	75.98
Date:	Wed, 14 Aug 2019	Prob (F-statistic):	4.45e-13
Time:	08:55:25	Log-Likelihood:	-327.88
No. Observations:	36	AIC:	661.8
Df Residuals:	33	BIC:	666.5
Df Model:	2
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	5792.7983	4763.058	1.216	0.233	-3897.717	1.55e+04
units	98.3504	17.237	5.706	0.000	63.282	133.419
units2	-0.0600	0.015	-3.981	0.000	-0.091	-0.029

Omnibus:	2.245	Durbin-Watson:	2.123
Prob(Omnibus):	0.326	Jarque-Bera (JB):	2.021
Skew:	-0.553	Prob(JB):	0.364
Kurtosis:	2.651	Cond. No.	5.12e+06

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.12e+06. This might indicate that there are
strong multicollinearity or other numerical problems.

../_images/docs_regression_analysis_164_1.png

[55]:

show code

[56]:

df = pd.read_csv(dataurl+'carstopping.txt', delim_whitespace=True, header=0)
df['sqrtdist'] = np.sqrt(df.StopDist)
result = ols(formula='sqrtdist ~ Speed', data=df).fit()
fig, axes = plt.subplots(1,3,figsize=(18,5))
sns.residplot(result.fittedvalues, result.resid , lowess=False, scatter_kws={"s": 80},
              line_kws={'color':'r', 'lw':1}, ax=axes[0])
axes[0].set_title('Residual plot')
axes[0].set_xlabel('Fitted values')
axes[0].set_ylabel('Residuals')

axes[1].relim()
stats.probplot(result.resid, dist='norm', plot=axes[1])
axes[1].set_title('Normal Q-Q plot')
axes[2].relim()
sns.distplot(result.resid, ax=axes[2]);
plt.show()

toggle()

../_images/docs_regression_analysis_165_0.png

[56]:

show code

Heteroscedasticity¶

Excessive nonconstant variance can create technical difficulties with a multiple linear regression model. For example, if the residual variance increases with the fitted values, then prediction intervals will tend to be wider than they should be at low fitted values and narrower than they should be at high fitted values. Some remedies for refining a model exhibiting excessive nonconstant variance includes the following:

Apply a variance-stabilizing transformation to the response variable, for example a logarithmic transformation (or a square root transformation if a logarithmic transformation is “too strong” or a reciprocal transformation if a logarithmic transformation is “too weak”).
Weight the variances so that they can be different for each set of predictor values. This leads to weighted least squares, in which the data observations are given different weights when estimating the model – see below.
A generalization of weighted least squares is to allow the regression errors to be correlated with one another in addition to having different variances. This leads to generalized least squares, in which various forms of nonconstant variance can be modeled.
For some applications we can explicitly model the variance as a function of the mean, E(Y). This approach uses the framework of generalized linear models.

[57]:

df = pd.read_csv(dataurl+'ca_learning.txt', sep='\s+', header=0)
result = analysis(df, 'cost', ['responses'], printlvl=4)

OLS Regression Results
Dep. Variable:	cost	R-squared:	0.889
Model:	OLS	Adj. R-squared:	0.878
Method:	Least Squares	F-statistic:	80.19
Date:	Wed, 14 Aug 2019	Prob (F-statistic):	4.33e-06
Time:	08:55:27	Log-Likelihood:	-34.242
No. Observations:	12	AIC:	72.48
Df Residuals:	10	BIC:	73.45
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	19.4727	5.516	3.530	0.005	7.182	31.764
responses	3.2689	0.365	8.955	0.000	2.456	4.082

Omnibus:	2.739	Durbin-Watson:	1.783
Prob(Omnibus):	0.254	Jarque-Bera (JB):	1.001
Skew:	0.038	Prob(JB):	0.606
Kurtosis:	1.587	Cond. No.	63.1

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.


standard error of estimate:4.59830

../_images/docs_regression_analysis_168_2.png

../_images/docs_regression_analysis_168_3.png

KstestResult(statistic=0.4553950067071834, pvalue=0.008567457152187175)

A plot of the residuals versus the predictor values indicates possible nonconstant variance since there is a very slight “megaphone” pattern. We therefore fit a simple linear regression model of the absolute residuals on the predictor and calculate weights as 1 over the squared fitted values from this model.

[58]:

result_ols = ols(formula='cost ~ responses', data=df).fit()
df['resid'] = abs(result_ols.resid)
result = ols(formula='resid ~ responses', data=df).fit()
X = sm.add_constant(df.responses)
w = 1./(result.fittedvalues ** 2)
result_wls = sm.WLS(df.cost, X, weights=w).fit()
display(result_wls.summary())
fig, axes = plt.subplots(1,3,figsize=(18,5))
sns.residplot(result_wls.fittedvalues, result_wls.wresid, lowess=False, scatter_kws={"s": 80},
              line_kws={'color':'r', 'lw':1}, ax=axes[0])
axes[0].set_title('Residual plot')
axes[0].set_xlabel('Fitted values')
axes[0].set_ylabel('Residuals')

axes[1].relim()
stats.probplot(result_wls.resid, dist='norm', plot=axes[1])
axes[1].set_title('Normal Q-Q plot')
axes[2].relim()
sns.distplot(result_wls.resid, ax=axes[2]);
plt.show()

x = np.linspace(np.min(df.responses), np.max(df.responses), 20)
y_wls = result_wls.params[0] + result_wls.params[1] * x
y_ols = result_ols.params[0] + result_ols.params[1] * x
plt.plot(df.responses, df.cost, 'o', label='data')
plt.plot(x, y_ols, '-', label='ols method')
plt.plot(x, y_wls, '-', label='wls method')
plt.legend()
plt.show()

toggle()

WLS Regression Results
Dep. Variable:	cost	R-squared:	0.895
Model:	WLS	Adj. R-squared:	0.885
Method:	Least Squares	F-statistic:	85.35
Date:	Wed, 14 Aug 2019	Prob (F-statistic):	3.27e-06
Time:	08:55:28	Log-Likelihood:	-33.248
No. Observations:	12	AIC:	70.50
Df Residuals:	10	BIC:	71.47
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	17.3006	4.828	3.584	0.005	6.544	28.058
responses	3.4211	0.370	9.238	0.000	2.596	4.246

Omnibus:	4.602	Durbin-Watson:	1.706
Prob(Omnibus):	0.100	Jarque-Bera (JB):	1.241
Skew:	0.062	Prob(JB):	0.538
Kurtosis:	1.429	Cond. No.	56.7

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

../_images/docs_regression_analysis_170_1.png

../_images/docs_regression_analysis_170_2.png

[58]:

show code