[1]:

%run ../initscript.py
HTML("""
<div id="popup" style="padding-bottom:5px; display:none;">
    <div>Enter Password:</div>
    <input id="password" type="password"/>
    <button onclick="done()" style="border-radius: 12px;">Submit</button>
</div>
<button onclick="unlock()" style="border-radius: 12px;">Unclock</button>
<a href="#" onclick="code_toggle(this); return false;">show code</a>
""")

[1]:

show code

[2]:

%run loadregfuncs.py

from ipywidgets import *
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')
%matplotlib inline

toggle()

[2]:

show code

Regression Modeling¶

Categorical Variables¶

The only meaningful way to summarize categorical data is with counts of observations in the categories. We consider a bank’s employee data.

[3]:

df_salary = pd.read_csv(dataurl+'salaries.csv', header=0, index_col='employee')
interact(plot_hist,
         column=widgets.Dropdown(options=df_salary.columns[:-1],value=df_salary.columns[0],
                            description='Column:',disabled=False));

To analyze whether the bank discriminates against females in terms of salary, we consider regression model:

\[\text{salary} \sim \text{years of experience with this bank (year1)} + \text{years of work experience prior to this bank (year2)} + \text{gender}\]

[4]:

df = pd.read_csv(dataurl+'salaries.csv', header=0, index_col='employee')
result = analysis(df, 'salary', ['years1', 'years2', 'C(gender, Treatment(\'Male\'))'], printlvl=5)

OLS Regression Results
Dep. Variable:	salary	R-squared:	0.492
Model:	OLS	Adj. R-squared:	0.485
Method:	Least Squares	F-statistic:	65.93
Date:	Mon, 20 May 2019	Prob (F-statistic):	7.61e-30
Time:	13:53:58	Log-Likelihood:	-2164.5
No. Observations:	208	AIC:	4337.
Df Residuals:	204	BIC:	4350.
Df Model:	3
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	3.549e+04	1341.022	26.466	0.000	3.28e+04	3.81e+04
C(gender, Treatment('Male'))[T.Female]	-8080.2121	1198.170	-6.744	0.000	-1.04e+04	-5717.827
years1	987.9937	80.928	12.208	0.000	828.431	1147.557
years2	131.3379	180.923	0.726	0.469	-225.381	488.057

Omnibus:	15.654	Durbin-Watson:	1.286
Prob(Omnibus):	0.000	Jarque-Bera (JB):	25.154
Skew:	0.434	Prob(JB):	3.45e-06
Kurtosis:	4.465	Cond. No.	34.6

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.


standard error of estimate:8079.39743


ANOVA Table:

	sum_sq	df	F	PR(>F)
C(gender, Treatment('Male'))	2.968701e+09	1.0	45.478753	1.556196e-10
years1	9.728975e+09	1.0	149.042170	4.314965e-26
years2	3.439940e+07	1.0	0.526979	4.687119e-01
Residual	1.331644e+10	204.0	NaN	NaN

../../_images/docs_regression_analysis_regression_modeling_7_3.png

The regression is equivalent to fit data into two separate regression functions. However, it is usually easier and more efficient to answer research questions concerning the binary predictor variable by using one combined regression function.

[5]:

def update(column):
    df = pd.read_csv(dataurl+'salaries.csv', header=0, index_col='employee')
    sns.lmplot(x=column, y='salary', data=df, fit_reg=True, hue='gender', legend=True, height=6);

interact(update,
         column=widgets.Dropdown(options=['years1', 'years2', 'age', 'education'],value='years1',
                            description='Column:',disabled=False));

Interaction Variables¶

Interaction with Categorical Variable¶

Example: consider salary data with interaction gender*year1

[6]:

result = ols(formula="salary ~ years1 + C(gender, Treatment('Male'))*years1", data=df_salary).fit()
display(result.summary())
print("\nANOVA Table:\n")
display(sm.stats.anova_lm(result, typ=1))

OLS Regression Results
Dep. Variable:	salary	R-squared:	0.639
Model:	OLS	Adj. R-squared:	0.633
Method:	Least Squares	F-statistic:	120.2
Date:	Mon, 20 May 2019	Prob (F-statistic):	7.51e-45
Time:	13:54:00	Log-Likelihood:	-2129.2
No. Observations:	208	AIC:	4266.
Df Residuals:	204	BIC:	4280.
Df Model:	3
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	3.043e+04	1216.574	25.013	0.000	2.8e+04	3.28e+04
C(gender, Treatment('Male'))[T.Female]	4098.2519	1665.842	2.460	0.015	813.776	7382.727
years1	1527.7617	90.460	16.889	0.000	1349.405	1706.119
C(gender, Treatment('Male'))[T.Female]:years1	-1247.7984	136.676	-9.130	0.000	-1517.277	-978.320

Omnibus:	11.523	Durbin-Watson:	1.188
Prob(Omnibus):	0.003	Jarque-Bera (JB):	15.614
Skew:	0.379	Prob(JB):	0.000407
Kurtosis:	4.108	Cond. No.	58.3

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.


ANOVA Table:

	df	sum_sq	mean_sq	F	PR(>F)
C(gender, Treatment('Male'))	1.0	3.149634e+09	3.149634e+09	67.789572	2.150454e-14
years1	1.0	9.726635e+09	9.726635e+09	209.346370	4.078705e-33
C(gender, Treatment('Male')):years1	1.0	3.872606e+09	3.872606e+09	83.350112	6.833545e-17
Residual	204.0	9.478232e+09	4.646192e+07	NaN	NaN

Example: Let’s consider some contrived data

[7]:

np.random.seed(123)
size = 20
x1 = np.linspace(-5,5,size)
group = np.random.randint(0, 2, size=size)
y = 5 + (3-7*group)*x1 + 3*np.random.normal(size=size)
df = pd.DataFrame({'y':y, 'x1':x1, 'group':group})
sns.lmplot(x='x1', y='y', data=df, fit_reg=True, hue='group', legend=True);
result = analysis(df, 'y', ['x1', 'group'], printlvl=4)

OLS Regression Results
Dep. Variable:	y	R-squared:	0.004
Model:	OLS	Adj. R-squared:	-0.113
Method:	Least Squares	F-statistic:	0.03659
Date:	Mon, 20 May 2019	Prob (F-statistic):	0.964
Time:	13:54:00	Log-Likelihood:	-72.898
No. Observations:	20	AIC:	151.8
Df Residuals:	17	BIC:	154.8
Df Model:	2
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	3.1116	3.075	1.012	0.326	-3.377	9.600
x1	0.1969	0.765	0.257	0.800	-1.417	1.811
group	0.0733	4.667	0.016	0.988	-9.773	9.920

Omnibus:	0.957	Durbin-Watson:	2.131
Prob(Omnibus):	0.620	Jarque-Bera (JB):	0.741
Skew:	0.008	Prob(JB):	0.690
Kurtosis:	2.057	Cond. No.	7.06

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.


standard error of estimate:10.04661

../../_images/docs_regression_analysis_regression_modeling_14_2.png

../../_images/docs_regression_analysis_regression_modeling_14_3.png

The scatterplot shows that both explanatory variables x and group are related to y. However, the regression results show that there is insufficient evidence to conclude that x and group are related to y. The example shows that if we leave an important interaction term out of our model, our analysis can lead us to make erroneous conclusions.

Without interaction variables, we assume the effect of each \(X\) on \(Y\) is independent of the value of the other \(X\)’s, which is not the case in this contrived data as shown in the scatterplot.

Now, consider a regression model contains an interaction term between the two predictors.

[8]:

result = analysis(df, 'y', ['x1', 'x1*C(group)'], printlvl=4)

OLS Regression Results
Dep. Variable:	y	R-squared:	0.897
Model:	OLS	Adj. R-squared:	0.878
Method:	Least Squares	F-statistic:	46.57
Date:	Mon, 20 May 2019	Prob (F-statistic):	3.95e-08
Time:	13:54:01	Log-Likelihood:	-50.187
No. Observations:	20	AIC:	108.4
Df Residuals:	16	BIC:	112.4
Df Model:	3
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	4.7029	1.027	4.578	0.000	2.525	6.881
C(group)[T.1]	1.7810	1.552	1.147	0.268	-1.509	5.071
x1	2.4905	0.319	7.798	0.000	1.813	3.168
x1:C(group)[T.1]	-6.1842	0.524	-11.792	0.000	-7.296	-5.072

Omnibus:	1.739	Durbin-Watson:	1.433
Prob(Omnibus):	0.419	Jarque-Bera (JB):	1.403
Skew:	0.607	Prob(JB):	0.496
Kurtosis:	2.541	Cond. No.	7.68

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.


standard error of estimate:3.32671

../../_images/docs_regression_analysis_regression_modeling_16_2.png

The non-significance for the coefficients of group indicates no observed difference between the mean values of x1 in different groups.

Example: Researchers are interested in comparing the effectiveness of three treatments (A,B,C) for severe depression. They collect the data on a random sample of \(n = 36\) severely depressed individuals including following variables:

measure of the effectiveness of the treatment for each individual
age (in years) of each individual i
x2 = 1 if an individual is received treatment A and 0, if not
x3 = 1 if an individual is received treatment B and 0, if not

The regression model:

\begin{align} \text{y} = \beta_0 + \beta_1 \times \text{age} + \beta_2 \times \text{x2} + \beta_3 \times \text{x3} + \beta_{12} \times \text{age} \times \text{x2} + \beta_{13} \times \text{age} \times \text{x3} \end{align}

[9]:

df = pd.read_csv(dataurl+'depression.txt', delim_whitespace=True, header=0)
sns.lmplot(x='age', y='y', data=df, fit_reg=True, hue='TRT', legend=True);

../../_images/docs_regression_analysis_regression_modeling_19_0.png

[10]:

result = analysis(df, 'y', ['age', 'age*x2', 'age*x3'], printlvl=4)
sm.stats.anova_lm(result, typ=1)

OLS Regression Results
Dep. Variable:	y	R-squared:	0.914
Model:	OLS	Adj. R-squared:	0.900
Method:	Least Squares	F-statistic:	64.04
Date:	Mon, 20 May 2019	Prob (F-statistic):	4.26e-15
Time:	13:54:03	Log-Likelihood:	-97.024
No. Observations:	36	AIC:	206.0
Df Residuals:	30	BIC:	215.5
Df Model:	5
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	6.2114	3.350	1.854	0.074	-0.630	13.052
age	1.0334	0.072	14.288	0.000	0.886	1.181
x2	41.3042	5.085	8.124	0.000	30.920	51.688
age:x2	-0.7029	0.109	-6.451	0.000	-0.925	-0.480
x3	22.7068	5.091	4.460	0.000	12.310	33.104
age:x3	-0.5097	0.110	-4.617	0.000	-0.735	-0.284

Omnibus:	2.593	Durbin-Watson:	1.633
Prob(Omnibus):	0.273	Jarque-Bera (JB):	1.475
Skew:	-0.183	Prob(JB):	0.478
Kurtosis:	2.079	Cond. No.	529.

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.


standard error of estimate:3.92491

../../_images/docs_regression_analysis_regression_modeling_20_2.png

[10]:

	df	sum_sq	mean_sq	F	PR(>F)
age	1.0	3424.431786	3424.431786	222.294615	2.058902e-15
x2	1.0	803.804498	803.804498	52.178412	4.856763e-08
age:x2	1.0	375.254832	375.254832	24.359407	2.794727e-05
x3	1.0	0.936671	0.936671	0.060803	8.069102e-01
age:x3	1.0	328.424475	328.424475	21.319447	6.849515e-05
Residual	30.0	462.147738	15.404925	NaN	NaN

For each age, is there a difference in the mean effectiveness for the three treatments?

[11]:

hide_answer()

[11]:

show answer

\begin{align*} \text{Treatment A:} & \quad \mu_Y = (\beta_0 + \beta_2) + (\beta_1 + \beta_{12}) \times \text{age} \\ \text{Treatment B:} & \quad \mu_Y = (\beta_0 + \beta_3) + (\beta_1 + \beta_{13}) \times \text{age} \\ \text{Treatment C:} & \quad \mu_Y = \beta_0 + \beta_1 \times \text{age} \end{align*}

The research question is equivalent to ask \begin{align*} \text{Treatment A} - \text{Treatment C} &= 0 = \beta_2 + \beta_{12} \times \text{age} \\ \text{Treatment B} - \text{Treatment C} &= 0 = \beta_3 + \beta_{13} \times \text{age} \end{align*}

Thus, we test hypothesis \begin{align} H_0:& \quad \beta_2 = \beta_3 = \beta_{12} = \beta_{13} = 0 \nonumber\\ H_a:& \quad \text{at least one is nonzero} \nonumber \end{align}

The python code is:

_ =GL_Ftest(df, 'y', ['age', 'x2', 'x3', 'age:x2', 'age:x3'], ['age'], 0.05, typ=1)

Does the effect of age on the treatment’s effectiveness depend on treatment?

[12]:

hide_answer()

[12]:

show answer

The question asks if the two interaction parameters \(\beta_{12}\) and \(\beta_{13}\) are significant.

The python code is:

_ =GL_Ftest(df, 'y', ['age', 'x2', 'x3', 'age:x2', 'age:x3'], ['age', 'x2', 'x3'], 0.05, typ=1)

Interactions Between Continuous Variables¶

Typically, regression models that include interactions between quantitative predictors adhere to the hierarchy principle, which says that if your model includes an interaction term, \(X_1 X_2\), and \(X_1 X_2\) is shown to be a statistically significant predictor of Y, then your model should also include the “main effects,” \(X_1\) and \(X_2\), whether or not the coefficients for these main effects are significant. Depending on the subject area, there may be circumstances where a main effect could be excluded, but this tends to be the exception.

Transformations¶

Log-transform¶

Transformation is used widely in regression analysis because they have a meaningful interpretation.
- \(Y = a \log(X) + \cdots\): 1% increase by X, Y is expected to increase by 0.01*a amount
- \(\log(Y) = a X + \cdots\): 1unit increase by X, Y is expected to increase by (100*a)%
- \(\log(Y) = a \log(X) + \cdots\): 1% increase by X, Y is expected to increase by a%
Transforming the \(X\) values is appropriate when non-linearity is the only problem (i.e., the independence, normality, and equal variance conditions are met)
Transforming the \(Y\) values should be considered when non-normality and/or unequal variances are the problems with the model.

[13]:

df = pd.read_csv(dataurl+'shortleaf.txt', delim_whitespace=True, header=0)
result = analysis(df, 'Vol', ['Diam'], printlvl=4)

OLS Regression Results
Dep. Variable:	Vol	R-squared:	0.893
Model:	OLS	Adj. R-squared:	0.891
Method:	Least Squares	F-statistic:	564.9
Date:	Mon, 20 May 2019	Prob (F-statistic):	1.17e-34
Time:	13:54:11	Log-Likelihood:	-258.61
No. Observations:	70	AIC:	521.2
Df Residuals:	68	BIC:	525.7
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	-41.5681	3.427	-12.130	0.000	-48.406	-34.730
Diam	6.8367	0.288	23.767	0.000	6.263	7.411

Omnibus:	29.509	Durbin-Watson:	0.562
Prob(Omnibus):	0.000	Jarque-Bera (JB):	85.141
Skew:	1.240	Prob(JB):	3.25e-19
Kurtosis:	7.800	Cond. No.	34.8

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.


standard error of estimate:9.87485

../../_images/docs_regression_analysis_regression_modeling_31_2.png

../../_images/docs_regression_analysis_regression_modeling_31_3.png

The Q-Q plot suggests that the error terms are not normal. We can log-transform \(X\) (Diam).

[14]:

df = pd.read_csv(dataurl+'shortleaf.txt', delim_whitespace=True, header=0)
df['logvol'] = np.log(df.Vol)
df['logdiam'] = np.log(df.Diam)
result = analysis(df, 'Vol', ['logdiam'], printlvl=4)

OLS Regression Results
Dep. Variable:	Vol	R-squared:	0.746
Model:	OLS	Adj. R-squared:	0.743
Method:	Least Squares	F-statistic:	200.2
Date:	Mon, 20 May 2019	Prob (F-statistic):	6.10e-22
Time:	13:54:14	Log-Likelihood:	-288.66
No. Observations:	70	AIC:	581.3
Df Residuals:	68	BIC:	585.8
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	-116.1618	10.830	-10.726	0.000	-137.773	-94.551
logdiam	64.5358	4.562	14.148	0.000	55.433	73.638

Omnibus:	50.047	Durbin-Watson:	0.324
Prob(Omnibus):	0.000	Jarque-Bera (JB):	219.500
Skew:	2.099	Prob(JB):	2.17e-48
Kurtosis:	10.591	Cond. No.	16.6

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.


standard error of estimate:15.17022

../../_images/docs_regression_analysis_regression_modeling_33_2.png

../../_images/docs_regression_analysis_regression_modeling_33_3.png

Transforming only the x values doesn’t change the non-linearity at all and improve the normality of the error terms either. We perform log-transform on both \(X\) and \(Y\).

[15]:

result = analysis(df, 'logvol', ['logdiam'], printlvl=4)

OLS Regression Results
Dep. Variable:	logvol	R-squared:	0.974
Model:	OLS	Adj. R-squared:	0.973
Method:	Least Squares	F-statistic:	2509.
Date:	Mon, 20 May 2019	Prob (F-statistic):	2.08e-55
Time:	13:54:14	Log-Likelihood:	25.618
No. Observations:	70	AIC:	-47.24
Df Residuals:	68	BIC:	-42.74
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	-2.8718	0.122	-23.626	0.000	-3.114	-2.629
logdiam	2.5644	0.051	50.090	0.000	2.462	2.667

Omnibus:	1.405	Durbin-Watson:	1.359
Prob(Omnibus):	0.495	Jarque-Bera (JB):	1.119
Skew:	-0.051	Prob(JB):	0.572
Kurtosis:	2.389	Cond. No.	16.6

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.


standard error of estimate:0.17026

../../_images/docs_regression_analysis_regression_modeling_35_2.png

../../_images/docs_regression_analysis_regression_modeling_35_3.png

The residual and Q-Q plot has improved substantially

Polynomial Transformations¶

[16]:

df = pd.read_csv(dataurl+'costunits.csv', header=0, index_col='month')
result = analysis(df, 'cost', ['units'], printlvl=4)

OLS Regression Results
Dep. Variable:	cost	R-squared:	0.736
Model:	OLS	Adj. R-squared:	0.728
Method:	Least Squares	F-statistic:	94.75
Date:	Mon, 20 May 2019	Prob (F-statistic):	2.32e-11
Time:	13:54:15	Log-Likelihood:	-334.94
No. Observations:	36	AIC:	673.9
Df Residuals:	34	BIC:	677.0
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	2.365e+04	1917.137	12.337	0.000	1.98e+04	2.75e+04
units	30.5331	3.137	9.734	0.000	24.158	36.908

Omnibus:	4.527	Durbin-Watson:	2.144
Prob(Omnibus):	0.104	Jarque-Bera (JB):	1.765
Skew:	-0.042	Prob(JB):	0.414
Kurtosis:	1.918	Cond. No.	2.57e+03

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.57e+03. This might indicate that there are
strong multicollinearity or other numerical problems.


standard error of estimate:2733.74240

../../_images/docs_regression_analysis_regression_modeling_38_2.png

../../_images/docs_regression_analysis_regression_modeling_38_3.png

[17]:

df = pd.read_csv(dataurl+'costunits.csv', header=0, index_col='month')
fit2 = np.poly1d(np.polyfit(df.units, df.cost, 2))
xp = np.linspace(np.min(df.units), np.max(df.units), 100)
fig = plt.figure(figsize=(8,6))
plt.plot(df.units, df.cost, '.', xp, fit2(xp), '-')
plt.title('cost vs units')
plt.xlabel('units')
plt.ylabel('cost')
plt.show()

toggle()

../../_images/docs_regression_analysis_regression_modeling_39_0.png

[17]:

show code

[18]:

df = pd.read_csv(dataurl+'costunits.csv', header=0, index_col='month')
df['units2'] = np.power(df.units,2)
result = ols(formula='cost ~ units + units2', data=df).fit()
display(result.summary())
fig, axes = plt.subplots(1,3,figsize=(18,6))
sns.residplot(result.fittedvalues, result.resid , lowess=False, scatter_kws={"s": 80},
              line_kws={'color':'r', 'lw':1}, ax=axes[0])
axes[0].set_title('Residual plot')
axes[0].set_xlabel('Fitted values')
axes[0].set_ylabel('Residuals')

axes[1].relim()
stats.probplot(result.resid, dist='norm', plot=axes[1])
axes[1].set_title('Normal Q-Q plot')
axes[2].relim()
sns.distplot(result.resid, ax=axes[2]);
plt.show()
toggle()

OLS Regression Results
Dep. Variable:	cost	R-squared:	0.822
Model:	OLS	Adj. R-squared:	0.811
Method:	Least Squares	F-statistic:	75.98
Date:	Mon, 20 May 2019	Prob (F-statistic):	4.45e-13
Time:	13:54:18	Log-Likelihood:	-327.88
No. Observations:	36	AIC:	661.8
Df Residuals:	33	BIC:	666.5
Df Model:	2
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	5792.7983	4763.058	1.216	0.233	-3897.717	1.55e+04
units	98.3504	17.237	5.706	0.000	63.282	133.419
units2	-0.0600	0.015	-3.981	0.000	-0.091	-0.029

Omnibus:	2.245	Durbin-Watson:	2.123
Prob(Omnibus):	0.326	Jarque-Bera (JB):	2.021
Skew:	-0.553	Prob(JB):	0.364
Kurtosis:	2.651	Cond. No.	5.12e+06

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.12e+06. This might indicate that there are
strong multicollinearity or other numerical problems.

../../_images/docs_regression_analysis_regression_modeling_40_1.png

[18]:

show code

[19]:

df = pd.read_csv(dataurl+'carstopping.txt', delim_whitespace=True, header=0)
df['sqrtdist'] = np.sqrt(df.StopDist)
result = ols(formula='sqrtdist ~ Speed', data=df).fit()
fig, axes = plt.subplots(1,3,figsize=(18,5))
sns.residplot(result.fittedvalues, result.resid , lowess=False, scatter_kws={"s": 80},
              line_kws={'color':'r', 'lw':1}, ax=axes[0])
axes[0].set_title('Residual plot')
axes[0].set_xlabel('Fitted values')
axes[0].set_ylabel('Residuals')

axes[1].relim()
stats.probplot(result.resid, dist='norm', plot=axes[1])
axes[1].set_title('Normal Q-Q plot')
axes[2].relim()
sns.distplot(result.resid, ax=axes[2]);
plt.show()

toggle()

../../_images/docs_regression_analysis_regression_modeling_41_0.png

[19]:

show code

Heteroscedasticity¶

Excessive nonconstant variance can create technical difficulties with a multiple linear regression model. For example, if the residual variance increases with the fitted values, then prediction intervals will tend to be wider than they should be at low fitted values and narrower than they should be at high fitted values. Some remedies for refining a model exhibiting excessive nonconstant variance includes the following:

Apply a variance-stabilizing transformation to the response variable, for example a logarithmic transformation (or a square root transformation if a logarithmic transformation is “too strong” or a reciprocal transformation if a logarithmic transformation is “too weak”).
Weight the variances so that they can be different for each set of predictor values. This leads to weighted least squares, in which the data observations are given different weights when estimating the model – see below.
A generalization of weighted least squares is to allow the regression errors to be correlated with one another in addition to having different variances. This leads to generalized least squares, in which various forms of nonconstant variance can be modeled.
For some applications we can explicitly model the variance as a function of the mean, E(Y). This approach uses the framework of generalized linear models.

[20]:

df = pd.read_csv(dataurl+'ca_learning.txt', sep='\s+', header=0)
result = analysis(df, 'cost', ['responses'], printlvl=4)

OLS Regression Results
Dep. Variable:	cost	R-squared:	0.889
Model:	OLS	Adj. R-squared:	0.878
Method:	Least Squares	F-statistic:	80.19
Date:	Mon, 20 May 2019	Prob (F-statistic):	4.33e-06
Time:	13:54:27	Log-Likelihood:	-34.242
No. Observations:	12	AIC:	72.48
Df Residuals:	10	BIC:	73.45
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	19.4727	5.516	3.530	0.005	7.182	31.764
responses	3.2689	0.365	8.955	0.000	2.456	4.082

Omnibus:	2.739	Durbin-Watson:	1.783
Prob(Omnibus):	0.254	Jarque-Bera (JB):	1.001
Skew:	0.038	Prob(JB):	0.606
Kurtosis:	1.587	Cond. No.	63.1

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.


standard error of estimate:4.59830

../../_images/docs_regression_analysis_regression_modeling_44_2.png

../../_images/docs_regression_analysis_regression_modeling_44_3.png

A plot of the residuals versus the predictor values indicates possible nonconstant variance since there is a very slight “megaphone” pattern. We therefore fit a simple linear regression model of the absolute residuals on the predictor and calculate weights as 1 over the squared fitted values from this model.

[21]:

result_ols = ols(formula='cost ~ responses', data=df).fit()
df['resid'] = abs(result_ols.resid)
result = ols(formula='resid ~ responses', data=df).fit()
X = sm.add_constant(df.responses)
w = 1./(result.fittedvalues ** 2)
result_wls = sm.WLS(df.cost, X, weights=w).fit()
display(result_wls.summary())
fig, axes = plt.subplots(1,3,figsize=(18,5))
sns.residplot(result_wls.fittedvalues, result_wls.wresid, lowess=False, scatter_kws={"s": 80},
              line_kws={'color':'r', 'lw':1}, ax=axes[0])
axes[0].set_title('Residual plot')
axes[0].set_xlabel('Fitted values')
axes[0].set_ylabel('Residuals')

axes[1].relim()
stats.probplot(result_wls.resid, dist='norm', plot=axes[1])
axes[1].set_title('Normal Q-Q plot')
axes[2].relim()
sns.distplot(result_wls.resid, ax=axes[2]);
plt.show()

x = np.linspace(np.min(df.responses), np.max(df.responses), 20)
y_wls = result_wls.params[0] + result_wls.params[1] * x
y_ols = result_ols.params[0] + result_ols.params[1] * x
plt.plot(df.responses, df.cost, 'o', label='data')
plt.plot(x, y_ols, '-', label='ols method')
plt.plot(x, y_wls, '-', label='wls method')
plt.legend()
plt.show()

toggle()

WLS Regression Results
Dep. Variable:	cost	R-squared:	0.895
Model:	WLS	Adj. R-squared:	0.885
Method:	Least Squares	F-statistic:	85.35
Date:	Mon, 20 May 2019	Prob (F-statistic):	3.27e-06
Time:	13:54:27	Log-Likelihood:	-33.248
No. Observations:	12	AIC:	70.50
Df Residuals:	10	BIC:	71.47
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	17.3006	4.828	3.584	0.005	6.544	28.058
responses	3.4211	0.370	9.238	0.000	2.596	4.246

Omnibus:	4.602	Durbin-Watson:	1.706
Prob(Omnibus):	0.100	Jarque-Bera (JB):	1.241
Skew:	0.062	Prob(JB):	0.538
Kurtosis:	1.429	Cond. No.	56.7

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

../../_images/docs_regression_analysis_regression_modeling_46_1.png

../../_images/docs_regression_analysis_regression_modeling_46_2.png

[21]:

show code