OLS: Diagnostics#

Ordinary Least Squares / Linear Regression

import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.stats.api as sms
from statsmodels.compat import lzip
import seaborn as sns
import matplotlib.pyplot as plt
np.random.seed(0)

Outliers#

Below uses a fake dataset with one extreme outlier.

n = 500
df = pd.DataFrame({
    "x1":np.random.normal(10,1,n),
    "x2":np.random.normal(2,1,n),
    "e":np.random.normal(0,1,n)
})

df["y"] = 2 + 3*df["x1"] + (-2*df["x2"]) + df["e"]
df.loc[500] = [11, 10, 0, 60]
sns.scatterplot(df["x1"], df["x2"], df['y'])
/home/chansoo/projects/statsbook/.venv/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y, hue. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
<Axes: xlabel='x1', ylabel='x2'>
../../../../_images/877cf13217d52c7fb8e8bd4c9e7ebf0f2aa60872cb9063d4182ec0f220695d84.png
X = sm.add_constant(df[["x1","x2"]], prepend=False)
mod = sm.OLS(df["y"], X)
res = mod.fit()
print(res.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.719
Model:                            OLS   Adj. R-squared:                  0.718
Method:                 Least Squares   F-statistic:                     637.0
Date:                Thu, 01 Jun 2023   Prob (F-statistic):          5.48e-138
Time:                        12:46:00   Log-Likelihood:                -1086.7
No. Observations:                 501   AIC:                             2179.
Df Residuals:                     498   BIC:                             2192.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1             3.0629      0.095     32.220      0.000       2.876       3.250
x2            -1.3370      0.091    -14.636      0.000      -1.516      -1.157
const          0.1977      0.974      0.203      0.839      -1.715       2.110
==============================================================================
Omnibus:                      966.837   Durbin-Watson:                   1.318
Prob(Omnibus):                  0.000   Jarque-Bera (JB):          1203045.282
Skew:                          12.899   Prob(JB):                         0.00
Kurtosis:                     241.674   Cond. No.                         106.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Cook’s Distance:#

\[D_i = \frac{\sum^n_{j=1}(\hat{y_j}-\hat{y_{j(i)}})^2}{ps^2}\]

where \(y_j{i}\) is the fitted value using the model trained leaving out the i-th observation and p is the # of predictors.

get_influence() and summary_frame prints dfbetas (the difference in each parameter estimatie with and without the
observation). Since the dataset has 501 observations, there are 501 rows in the summary_frame. Note that 
this approach may not be ideal if dataset is really big since it runs the model N times
influence = res.get_influence()
df_inf = influence.summary_frame()
df_inf.sort_values('cooks_d', ascending=False)
dfb_x1 dfb_x2 dfb_const cooks_d standard_resid hat_diag dffits_internal student_resid dffits
500 2.497732 16.199589 -5.195702 1.875610e+01 19.872939 0.124708 7.501219 43.639717 16.472203
234 0.073639 -0.165384 -0.048103 1.262759e-02 -1.511125 0.016319 -0.194635 -1.513080 -0.194887
494 -0.156402 -0.022125 0.150835 9.350211e-03 -1.269424 0.017109 -0.167483 -1.270205 -0.167586
458 -0.018130 -0.152639 0.039171 9.270042e-03 -1.466662 0.012763 -0.166764 -1.468363 -0.166957
205 0.088814 -0.118305 -0.070447 8.538708e-03 -1.271721 0.015592 -0.160050 -1.272511 -0.160150
... ... ... ... ... ... ... ... ... ...
62 0.000144 -0.000040 -0.000150 1.864204e-08 -0.004085 0.003340 -0.000236 -0.004081 -0.000236
264 0.000125 -0.000046 -0.000120 7.438431e-09 -0.001464 0.010298 -0.000149 -0.001463 -0.000149
322 -0.000018 0.000047 0.000002 2.188690e-09 -0.001407 0.003305 -0.000081 -0.001406 -0.000081
296 -0.000003 -0.000003 0.000001 2.518891e-10 -0.000606 0.002055 -0.000027 -0.000605 -0.000027
472 0.000002 -0.000002 -0.000002 2.455959e-11 -0.000180 0.002257 -0.000009 -0.000180 -0.000009

501 rows × 9 columns

Leverage:#

OLS predicted values are represented as:

\[\hat{y} = X(X'X)^{-1}X'y\]

Let \(H=X(X'X)^{-1}X'\), then \(\hat{y} = Hy\)

We call “H” the hat matrix. The leverage, \(h_{ii}\) quantifies the influence that the observed response \(y_i\) has on its predicted value \(\hat{y_i}\)

influence = res.get_influence()
# equivalent to "hat_diag" column in df_inf
leverage = influence.hat_matrix_diag
df["leverage"] = leverage
df.sort_values('leverage', ascending=False)
x1 x2 e y leverage
500 11.000000 10.000000 0.000000 60.000000 0.124708
327 12.256723 4.225944 -0.025799 30.292483 0.022440
185 8.397942 -0.834555 -0.177813 28.685123 0.021730
89 11.054452 -1.046143 0.051820 37.307462 0.020670
398 11.466579 4.594425 0.768771 27.979658 0.019717
... ... ... ... ... ...
267 10.088422 2.019279 2.243602 30.470309 0.002030
449 9.889611 1.883770 0.436638 28.337930 0.002020
319 9.892695 2.022960 -1.297380 26.334785 0.002019
302 9.881836 1.942530 0.319432 28.079879 0.002014
263 9.947433 2.024612 2.023472 29.816546 0.002008

501 rows × 5 columns

# leverage vs studentized residuals plot
stud_res = res.outlier_test()
sns.scatterplot(leverage, stud_res["student_resid"])
/home/chansoo/projects/statsbook/.venv/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
<Axes: ylabel='student_resid'>
../../../../_images/8d3e1f41b146c8f015c02d45f1620453b428b5a89b399e57cb4575f77a5a2479.png

Note the changes in results after outlier removal.

df = df.drop(500)
X = sm.add_constant(df[["x1","x2"]], prepend=False)
mod = sm.OLS(df["y"], X)
res = mod.fit()
print(res.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.933
Model:                            OLS   Adj. R-squared:                  0.933
Method:                 Least Squares   F-statistic:                     3480.
Date:                Thu, 01 Jun 2023   Prob (F-statistic):          5.01e-293
Time:                        12:46:00   Log-Likelihood:                -691.18
No. Observations:                 500   AIC:                             1388.
Df Residuals:                     497   BIC:                             1401.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1             2.9548      0.043     68.145      0.000       2.870       3.040
x2            -2.0109      0.044    -45.317      0.000      -2.098      -1.924
const          2.5011      0.446      5.602      0.000       1.624       3.378
==============================================================================
Omnibus:                        0.355   Durbin-Watson:                   2.034
Prob(Omnibus):                  0.837   Jarque-Bera (JB):                0.203
Skew:                           0.010   Prob(JB):                        0.903
Kurtosis:                       3.097   Cond. No.                         106.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Normality of Residuals#

D’Agostino’s K^2 Test (Omnibus) and Jarque-Bera#

In the bottom section of the OLS summary table below, the Omnibus test and the Jarque-Bera (JB) tests both test the null hypothesis that the residuals are nomrally distributed.

Prob(Omnibus) is 0.205 and Prob(JB) is 0.266 so using both tests, we do not reject the null hypothesis that the residuals are normally distributed.

We plot the density of the residuals to confirm that the residuals appear to be normally distributed.

n = 500
df = pd.DataFrame({
    "x1":np.random.normal(10,1,n),
    "x2":np.random.normal(2,1,n),
    "e":np.random.normal(0,1,n)
})

df["y"] = 2 + 3*df["x1"] + (-2)*df["x2"] + df["e"]

X = sm.add_constant(df[["x1","x2"]], prepend=False)
mod = sm.OLS(df["y"], X)
res = mod.fit()
print(res.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.930
Model:                            OLS   Adj. R-squared:                  0.930
Method:                 Least Squares   F-statistic:                     3296.
Date:                Thu, 01 Jun 2023   Prob (F-statistic):          1.45e-287
Time:                        12:46:00   Log-Likelihood:                -678.36
No. Observations:                 500   AIC:                             1363.
Df Residuals:                     497   BIC:                             1375.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1             3.0042      0.043     69.126      0.000       2.919       3.090
x2            -1.9924      0.044    -45.698      0.000      -2.078      -1.907
const          1.8701      0.441      4.240      0.000       1.004       2.737
==============================================================================
Omnibus:                        3.168   Durbin-Watson:                   1.801
Prob(Omnibus):                  0.205   Jarque-Bera (JB):                2.645
Skew:                          -0.070   Prob(JB):                        0.266
Kurtosis:                       2.673   Cond. No.                         108.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
sns.kdeplot(res.resid)
<Axes: ylabel='Density'>
../../../../_images/fc465d65b2789cc6a812213e5458571c9a8b1b828fada5335392c0290b9e9118.png

Suppose we generate fake data such that residuals are not normally distributed. The results of the tests and the residual distribution plots change accordingly:

df = pd.DataFrame({
    "x1":np.random.normal(10,1,n),
    "x2":np.random.normal(2,1,n),
    "e":np.random.beta(0.1,0.2,n)
})

df["y"] = 2 + 3*df["x1"] + (-2)*df["x2"] + df["e"]

X = sm.add_constant(df[["x1","x2"]], prepend=False)
mod = sm.OLS(df["y"], X)
res_beta = mod.fit()
print(res_beta.summary())
sns.kdeplot(res_beta.resid)
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.988
Model:                            OLS   Adj. R-squared:                  0.988
Method:                 Least Squares   F-statistic:                 2.012e+04
Date:                Thu, 01 Jun 2023   Prob (F-statistic):               0.00
Time:                        12:46:00   Log-Likelihood:                -267.08
No. Observations:                 500   AIC:                             540.2
Df Residuals:                     497   BIC:                             552.8
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1             2.9867      0.018    165.033      0.000       2.951       3.022
x2            -2.0056      0.018   -109.089      0.000      -2.042      -1.969
const          2.4798      0.186     13.333      0.000       2.114       2.845
==============================================================================
Omnibus:                      661.487   Durbin-Watson:                   2.018
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               75.516
Skew:                           0.704   Prob(JB):                     4.00e-17
Kurtosis:                       1.719   Cond. No.                         104.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
<Axes: ylabel='Density'>
../../../../_images/f6ba59fedcc4c6eb88b2ad6b0ad32eb7be8d158398f978ba14be61ba3c2dd563.png

QQ Plots#

A QQ plot sorts your sample in ascending order (here, the sample are the residuals) against the quantiles of a distribution (default is normal distribution).

The number of quantiles is selected to match the size of the sample data.

In each of the two plots below, the red line is a 45-degree line. If the distribution is normal, the values should “hug” the line.

g = sm.graphics.qqplot(res.resid, line="45")
g_beta = sm.graphics.qqplot(res_beta.resid, line="45")
../../../../_images/bdf64eb56d867ade0524ff387fac83b3918d71bb0259d96860c9b7a7e5788826.png ../../../../_images/b1fa4aa75f056e8010ad81a6796ec728a2a9711a415fd77225b1df5816caa174.png

Multicollinearity#

Condition number can be used (high numbers indicate multicollinearity), but usually just look at correlation table and plots and VIF

df = pd.DataFrame({
    "x1":np.random.normal(10,1,n),
    "x2":np.random.normal(2,1,n),
    "e":np.random.beta(0.1,0.2,n)
})

df["x3"] = df["x1"] * 1.3 + np.random.normal(0,2,n)
df["x4"] = df["x2"] * 1.3 + np.random.normal(0,1,n)
df["x5"] = df["x1"] * 1.3 + np.random.normal(0,0.5,n)

Condition Number#

# Condition Number
np.linalg.cond(res.model.exog)
108.14224830979715

Correlation Matrix#

features = ["x1","x2","x3","x4","x5"]
corr_mat = df[features].corr()
corr_mat
x1 x2 x3 x4 x5
x1 1.000000 0.011713 0.555447 -0.046231 0.936899
x2 0.011713 1.000000 0.097119 0.794966 0.014678
x3 0.555447 0.097119 1.000000 0.051660 0.540569
x4 -0.046231 0.794966 0.051660 1.000000 -0.044700
x5 0.936899 0.014678 0.540569 -0.044700 1.000000
sns.heatmap(corr_mat)
<Axes: >
../../../../_images/938a609f768f99c07a576e3fecccf10a1fa1c3bd716de026d5916a9990ea2fe8.png
g = sns.PairGrid(df[features])
g.map_diag(sns.histplot)
g.map_offdiag(sns.scatterplot)
<seaborn.axisgrid.PairGrid at 0x7f6b44766970>
../../../../_images/9e8ae126daaae58cbfccf835ac566c8e9f21f9f5195d003c9f065210027a636d.png

VIF#

The variance inflation factor is a measure for the increase of the variance of the parameter estimates if an additional variable, given by exog_idx is added to the linear regression. It is a measure for multicollinearity of the design matrix.

To get VIF, pick each feature and regress it against all other features. For each regression, the factor is calculated as \(VIF = \frac{1}{1-R^2}\)

Generally, VIF > 5 indicates high multicollinearity

from statsmodels.stats.outliers_influence import variance_inflation_factor
X = df[features]
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i)
                          for i in range(len(X.columns))]
vif_data
feature VIF
0 x1 757.478874
1 x2 14.286991
2 x3 48.562106
3 x4 9.554965
4 x5 734.991722

Heteroskedasticity tests#

Also see: Breush-Pagan test and Goldfeld-Quandt test

n = 500
df = pd.DataFrame({
    "x1":np.random.normal(10,1,n),
    "x2":np.random.normal(2,1,n),
    "e":np.random.normal(0,1,n)
})

df["y"] = 2 + 3*df["x1"] + (-2)*df["x2"] + df["e"]

X = sm.add_constant(df[["x1","x2"]], prepend=False)
mod = sm.OLS(df["y"], X)
res = mod.fit()
print(res.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.921
Model:                            OLS   Adj. R-squared:                  0.921
Method:                 Least Squares   F-statistic:                     2904.
Date:                Thu, 01 Jun 2023   Prob (F-statistic):          6.61e-275
Time:                        12:46:02   Log-Likelihood:                -715.58
No. Observations:                 500   AIC:                             1437.
Df Residuals:                     497   BIC:                             1450.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1             2.9856      0.047     63.018      0.000       2.893       3.079
x2            -2.0059      0.045    -44.356      0.000      -2.095      -1.917
const          2.1361      0.480      4.455      0.000       1.194       3.078
==============================================================================
Omnibus:                        1.417   Durbin-Watson:                   2.108
Prob(Omnibus):                  0.492   Jarque-Bera (JB):                1.355
Skew:                          -0.023   Prob(JB):                        0.508
Kurtosis:                       2.749   Cond. No.                         109.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Plot Sqrt(Standarized Residual) vs Fitted values

Horizontal line suggests homoscedasticity

model_norm_residuals = res.get_influence().resid_studentized_internal
model_norm_residuals_abs_sqrt = np.sqrt(np.abs(model_norm_residuals))
sns.scatterplot(res.predict(), model_norm_residuals_abs_sqrt, alpha=0.5)
sns.regplot(res.predict(), model_norm_residuals_abs_sqrt,
            scatter=False,
            ci=False,
            lowess=True,
            line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})
/home/chansoo/projects/statsbook/.venv/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
/home/chansoo/projects/statsbook/.venv/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
<Axes: >
../../../../_images/54db4890aa9ee9de0b11c66a78320b76b0797e6320de8e227db438f9662eb377.png

Heteroscedasticity may occur if misspecified model:

n = 500
df = pd.DataFrame({
    "x1":np.random.normal(10,1,n),
    "x2":np.random.normal(2,1,n),
    "e":np.random.normal(0,1,n)
})

df["y"] = 2 + 2*df["x1"] + (3)*df["x2"]**2 + df["e"]

X = sm.add_constant(df[["x1","x2"]], prepend=False)
mod = sm.OLS(df["y"], X)
res = mod.fit()
sns.scatterplot(df["x1"], df["x2"], df["y"])
/home/chansoo/projects/statsbook/.venv/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y, hue. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
<Axes: xlabel='x1', ylabel='x2'>
../../../../_images/f4f43cc73609aefa4ca02d06f8b065bad71e0001320993c9396c8458c1f865b9.png
model_norm_residuals = res.get_influence().resid_studentized_internal
model_norm_residuals_abs_sqrt = np.sqrt(np.abs(model_norm_residuals))
sns.scatterplot(res.predict(), model_norm_residuals_abs_sqrt, alpha=0.5)
sns.regplot(res.predict(), model_norm_residuals_abs_sqrt,
            scatter=False,
            ci=False,
            lowess=True,
            line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})
/home/chansoo/projects/statsbook/.venv/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
/home/chansoo/projects/statsbook/.venv/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
<Axes: >
../../../../_images/9e137f7632ac0c187f5939c9ce9f54ad1cef1aa6e8f0713c77bec45116e68064.png

Heteroscedasticity may occur if autocorrelation:

n = 500
df = pd.DataFrame({
    "x1":np.random.normal(0,1,n),
    "time":np.arange(0,n),
    "e":np.random.normal(0,1,n)
})

y_list = [0]
for i in range(n):
    y = df.loc[i,"x1"] + 0.95*y_list[i] + df.loc[i,"e"]
    y_list.append(y)

df["y"] = y_list[1:]

X = sm.add_constant(df[["x1","time"]], prepend=False)
mod = sm.OLS(df["y"], X)
res = mod.fit()
sns.scatterplot(df["time"], df["y"])
/home/chansoo/projects/statsbook/.venv/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
<Axes: xlabel='time', ylabel='y'>
../../../../_images/0383ca2d471f61de92e386c040dba589985a935cc805f0c01e297c807030395b.png
model_norm_residuals = res.get_influence().resid_studentized_internal
model_norm_residuals_abs_sqrt = np.sqrt(np.abs(model_norm_residuals))
sns.scatterplot(df["time"], model_norm_residuals_abs_sqrt, alpha=0.5)
sns.regplot(df["time"], model_norm_residuals_abs_sqrt,
            scatter=False,
            ci=False,
            lowess=True,
            line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})
/home/chansoo/projects/statsbook/.venv/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
/home/chansoo/projects/statsbook/.venv/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
<Axes: xlabel='time'>
../../../../_images/55bb041099038e985a9f9ca18cd6baec6ea8a783a6e39b24ef26254d866b142e.png

Linearity tests#

Also see Harvey-Collier multiplier test for Null hypothesis that the linear specification is correct.

Plot residuals aggainst fitted values

(or plot observed vs predicted values)

sns.residplot(
    res.predict(), 
    res.resid, 
    lowess=True, 
    scatter_kws={'alpha': 0.5},
    line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8},
)
<Axes: >
../../../../_images/36897dc017ad692258c5a17ccf3645f65688f78f5faa53e38bb9e76bf5ec700b.png