[1]:
%run ../initscript.py
HTML("""
<div id="popup" style="padding-bottom:5px; display:none;">
    <div>Enter Password:</div>
    <input id="password" type="password"/>
    <button onclick="done()" style="border-radius: 12px;">Submit</button>
</div>
<button onclick="unlock()" style="border-radius: 12px;">Unclock</button>
<a href="#" onclick="code_toggle(this); return false;">show code</a>
""")
[1]:
show code
[2]:
%run loadregfuncs.py
from ipywidgets import *
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
toggle()
[2]:

Regression Assumptions

As one of the most important types of data analysis, regression analysis is a way of mathematically sorting out factors that may have an impact. It focuses on questions:

  • Which factors matter most?

  • Which can we ignore?

  • How do those factors interact with each other?

  • Perhaps most importantly, how certain are we about all of these factors?

Those factors are called variables, and we have

  • Dependent (or response or target) variable: the main factor that you’re trying to understand or predict

  • Independent (or explanatory or predictor) variables: the factors you suspect have an impact on your dependent variable

Generally, regression model finds a line that fits the data “best” such that \(n\) residuals — one for each observed data point — are as small as possible in some overall sense. One way to achieve this goal is to use the “ordinary least squares (OLS) criterion,” which says to “minimize the sum of the squared prediction errors.”

What does the line estimate?

When looking to summarize the relationship between predictors \(X\) and a response \(Y\), we are interested in knowing the relationship in the population. The only way we could ever know it, though, is to be able to collect data on everybody in the population — most often an impossible task. We have to rely on taking and using a sample of data from the population to estimate the population regression line.

Therefore, several assumptions about the population are required to use sample regression line to estimate population regression line — statistical inference in a regression context.

Regression assumptions: 1. There is a population regression line. It joints the means of the dependent variable for all values of the explanatory variables, the mean of the errors is zero. 2. For any values of the explanatory variables, the variance (or standard deviation) of the dependent variable is a constant, the same for all such values. 3. For any values of the explanatory variables, the dependent variable is normally distributed. 4. The errors are probabilistically independent.

These assumptions represent an idealization of reality. From a practical point of view, if they represent a close approximate to reality, then the analysis is valid. But if the assumptions are grossly violated, statistical inferences that are based on these assumptions should be viewed with suspicion.

First Assumption

Let \(Y\) be the dependent variable and there are \(k\) explanatory variables \(X_1\) through \(X_k\). The assumption implies that there is an exact linear relationship between the mean of all \(Y\)’s for any fixed values of \(X\)’s and the \(X\)’s.

[3]:
interact(draw_sample,
         flag=widgets.Checkbox(value=False,description='Show Sample',disabled=False),
         id=widgets.Dropdown(options=range(1,4),value=1,description='Sample ID:',disabled=False));

Suppose we have population as shown in the plot. For example, \(X\) is the years of college education and \(Y\) is the annual salary. Given years of education, there is a range of possible salaries for different individuals.

The first assumption says, for each \(X\), the mean of \(Y\)s (\(=\overline{Y}\)) lies on the maroon line \(\overline{Y} = 3.08 X + 9.87\), which is the population regression line joining means. For each data point, its \(Y\) value is expressed by the population regression line with error \(Y = 3.08 X + 9.87 + \epsilon\) where \(\epsilon\) is the deviation of the data point to the maroon line.

Note that since the population is not accessible, the population regression (maroon) line is not observable in reality.

[4]:
interact(draw_sample,
         flag=widgets.Checkbox(value=True,description='Show Sample',disabled=False),
         id=widgets.Dropdown(options=range(1,4),value=1,description='Sample ID:',disabled=False));

The estimated regression line as shown in orange is based on the sample data, which is usually different from the population regression line. The regression lines derived from the first two samples are largely deviated from the population regression line.

Note that an error \(\epsilon\) is not quite the same as a residual \(e\). An error is the vertical distance from a point to the (unobservable) population regression line. A residual is the vertical distance from a point to the estimated regression line.

Second Assumption

The second assumption requires constant error variance, technical term is homoscedasticity. This assumption is often questionable because the variation in \(Y\) often increases as \(X\) increases. For example, the variation of spending increases as salary increases. We say the data exhibit heteroscedasticity (nonconstant error variance) if the variability of \(Y\) values is larger for some \(X\) values than for others. The easiest way to detect nonconstant error variance is through a visual inspection of a scatterplot.

It is usually sufficient to “visually” interpret a residuals versus fitted values plot. However, there are several hypothesis tests on the residuals for checking constant variance.

  • Brown-Forsythe test (Modified Levene Test)

  • Cook-Weisberg score test (Breusch-Pagan Test)

  • Bartlett’s Test

Third Assumption

The third assumption is equivalent to stating that the errors are normally distributed. We can check this by forming a histogram (or a Q-Q plot) of the residuals. - If the assumption holds, the histogram should be approximately symmetric and bell-shaped, and the points of a Q-Q plot should be close to a 45 degree line. - If there is an obvious skewness or some other nonnormal property, this indicates a violation of assumption 3.

[5]:
interact(draw_qq,
         dist=widgets.Dropdown(options=['normal','student_t','uniform','triangular'],
                               value='student_t',description='Distribution:',disabled=False));

The plot shows the Q-Q plot and histogram of four distributions including normal, student t, uniform and triangular.

Besides graphical method for assessing residual normality, there are several hypothesis tests in which the null hypothesis is that the errors have a normal distribution, such as

  • Kolmogorov-Smirnov Test

  • Anderson-Darling Test

  • Shapiro-Wilk Test

  • Ryan-Joiner Test

If those tests are available in software, then a large p-value indicates failure to reject this null hypothesis.

Fourth Assumption

The fourth assumption requires probabilistic independence of the errors. - This assumption means that information on some of the errors provides no information on the values of the other errors. However, when the sample data have been collected over time and the regression model fails to effectively capture any time trends, the random errors in the model are often positively correlated over time. This phenomenon is known as autocorrelation (or serial correlation) and can sometimes be detected by plotting the model residuals versus time. - For cross-sectional data, this assumption is usually taken for granted. - For time-series data, this assumption is often violated. - This is because of a property called autocorrelation. - The Durbin-Watson statistic is one measure of autocorrelation and thus measures the extent to which assumption 4 is violated.

Summary

We now summarize the four assumptions that underlie the linear regression model:

  • The mean of the response is a Linear function of the explanatory variables.

  • The errors are Independent.

  • The errors are Normally distributed.

  • The errors have Equal variances.

The first letters that are highlighted spell “LINE”.