Interactions#

import numpy as np
import pandas as pd
import statsmodels.api as sm
import seaborn as sns
import patsy
from matplotlib import pyplot as plt
from scipy.special import expit as invlogit

OLS#

n = 500
np.random.seed(0)
df = pd.DataFrame({
    "income":np.random.normal(5,1,n),
    # "educ":np.random.choice([1,2,3,4,5], n),
    "e":np.random.normal(0,1,n)
})

df["educ"] = 1 + np.random.binomial(4,df["income"] / df["income"].max())

df["favorability"] = 2 + 1.4*df["income"] + 5*df["educ"] + -0.9*df["income"]*df["educ"] + df["e"]
sns.scatterplot(df["income"],df["educ"], hue=df["favorability"])
/home/chansoo/projects/statsbook/.venv/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
<Axes: xlabel='income', ylabel='educ'>
../../../_images/e54376fdb329a6105e009f789e19d53c53348bbd10a57b04d02ad8deb4db1a71.png
f = 'favorability ~ income + C(educ) + income*C(educ)'
y, X = patsy.dmatrices(f, df, return_type='dataframe')
mod = sm.OLS(y, X)
res = mod.fit()
print(res.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:           favorability   R-squared:                       0.804
Model:                            OLS   Adj. R-squared:                  0.801
Method:                 Least Squares   F-statistic:                     224.0
Date:                Thu, 01 Jun 2023   Prob (F-statistic):          2.19e-167
Time:                        12:45:57   Log-Likelihood:                -694.78
No. Observations:                 500   AIC:                             1410.
Df Residuals:                     490   BIC:                             1452.
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
Intercept               8.2033      1.758      4.666      0.000       4.749      11.658
C(educ)[T.2]            3.6147      1.851      1.953      0.051      -0.022       7.251
C(educ)[T.3]            8.7862      1.823      4.821      0.000       5.205      12.367
C(educ)[T.4]           13.9905      1.813      7.718      0.000      10.429      17.552
C(educ)[T.5]           18.6545      1.847     10.102      0.000      15.026      22.283
income                  0.1805      0.455      0.397      0.691      -0.713       1.074
income:C(educ)[T.2]    -0.5144      0.473     -1.087      0.278      -1.444       0.416
income:C(educ)[T.3]    -1.4917      0.465     -3.207      0.001      -2.406      -0.578
income:C(educ)[T.4]    -2.4434      0.462     -5.285      0.000      -3.352      -1.535
income:C(educ)[T.5]    -3.2722      0.465     -7.034      0.000      -4.186      -2.358
==============================================================================
Omnibus:                        0.246   Durbin-Watson:                   2.091
Prob(Omnibus):                  0.884   Jarque-Bera (JB):                0.166
Skew:                           0.041   Prob(JB):                        0.920
Kurtosis:                       3.035   Cond. No.                         543.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The above example is fake data containing favorability ratings for some politician. Each row is a constituent and covariates measure their level of education (categorical variable from 1 to 5) and their income (in 10k).

Interpretations:

  • The intercept represents the favorability rating for an individual with education level 1 and 0 income.

  • The coefficient of C(educ)[T.d] represents the average difference in favorability rating between an individual with education level 1 and education level 2 for an individual with income = 0.

  • The coefficient of income represents the average change in favorability given a one unit increase (10k increase) in income for an individual with edcuation level = 0.

    • If there were no interaction term, this coefficient would represent a change in favorability given a one unit increase in income for any individual, regardless of income. But due to interactions, this coefficient can only be interpreted for an individual with education level = 0.

  • The coefficient of “income:C(educ)[T.2]” represents the average difference in the slope of the relationship between income and favorability, when between an individual with education level 1 vs education level 2. This coefficient can also be interpreted as the average change in differences between an education level 1 vs level 2, for a one unit increase in income. Here, a coefficient of -0.5144 suggests that the slope of income on favorability is 0.1805 for education level 1 individuals and 0.1805-0.5144=-0.3339 for education level 2. Another interpretation would be that for every one unit increase in income, the gap in favorability between education level 1 and education level 2 (3.6147), decreases by -0.5144.

def plot_scatter(df,educ):
    g = sns.scatterplot("income","favorability",data=df.query(f"educ=={educ}"), alpha=0.2)
    g = sns.regplot("income","favorability",data=df.query(f"educ=={educ}"), label=educ, scatter=False)
    return g

for educ in range(1,6):
    g1 = plot_scatter(df,educ)
g1.legend()
/home/chansoo/projects/statsbook/.venv/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
/home/chansoo/projects/statsbook/.venv/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
/home/chansoo/projects/statsbook/.venv/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
/home/chansoo/projects/statsbook/.venv/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
/home/chansoo/projects/statsbook/.venv/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
/home/chansoo/projects/statsbook/.venv/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
<matplotlib.legend.Legend at 0x7fd2141e30d0>
../../../_images/16e934f1d5c9ce11f448d8881ce19866c5faa246a79bcbd35c553fca150d2f7c.png

Centering#

Whether fitting an OLS, GLM, or MLM, centering can help with easier interpretation when interactions are present.

df["income_ctr"] = df["income"] - np.mean(df["income"])
f = 'favorability ~ income_ctr + C(educ) + income_ctr*C(educ)'
y, X = patsy.dmatrices(f, df, return_type='dataframe')
mod = sm.OLS(y, X)
res = mod.fit()
print(res.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:           favorability   R-squared:                       0.804
Model:                            OLS   Adj. R-squared:                  0.801
Method:                 Least Squares   F-statistic:                     224.0
Date:                Thu, 01 Jun 2023   Prob (F-statistic):          2.19e-167
Time:                        12:45:58   Log-Likelihood:                -694.78
No. Observations:                 500   AIC:                             1410.
Df Residuals:                     490   BIC:                             1452.
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
===========================================================================================
                              coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------
Intercept                   9.1013      0.580     15.697      0.000       7.962      10.241
C(educ)[T.2]                1.0560      0.597      1.768      0.078      -0.118       2.230
C(educ)[T.3]                1.3657      0.586      2.332      0.020       0.215       2.516
C(educ)[T.4]                1.8355      0.585      3.137      0.002       0.686       2.985
C(educ)[T.5]                2.3766      0.592      4.017      0.000       1.214       3.539
income_ctr                  0.1805      0.455      0.397      0.691      -0.713       1.074
income_ctr:C(educ)[T.2]    -0.5144      0.473     -1.087      0.278      -1.444       0.416
income_ctr:C(educ)[T.3]    -1.4917      0.465     -3.207      0.001      -2.406      -0.578
income_ctr:C(educ)[T.4]    -2.4434      0.462     -5.285      0.000      -3.352      -1.535
income_ctr:C(educ)[T.5]    -3.2722      0.465     -7.034      0.000      -4.186      -2.358
==============================================================================
Omnibus:                        0.246   Durbin-Watson:                   2.091
Prob(Omnibus):                  0.884   Jarque-Bera (JB):                0.166
Skew:                           0.041   Prob(JB):                        0.920
Kurtosis:                       3.035   Cond. No.                         42.2
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Note the change in the coefficients for the interaction terms.
Now, the coefficient for C(educ)[T.2], 1.0560, is the change in favorability between an individual with educ=1 and educ=2, with average income level. (as opposed to income = 0)

Logistic#

Now, we generate a binary “vote” column.
Then we model the probability of vote using a logit link.

df["vote"] = np.random.binomial(1, invlogit(df["favorability"]-np.mean(df["favorability"])), n)
sns.scatterplot("income","educ",hue="vote", data=df, alpha=0.5)
/home/chansoo/projects/statsbook/.venv/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
<Axes: xlabel='income', ylabel='educ'>
../../../_images/d910b1b82ff5e10b40371a396ffa5ce6471ff1792da91bc5dba09ac54cb38993.png
f = 'vote ~ income + C(educ) + income*C(educ)'
y, X = patsy.dmatrices(f, df, return_type='dataframe')
mod = sm.Logit(y, X)
res_log = mod.fit()
print(res_log.summary())
Optimization terminated successfully.
         Current function value: 0.524547
         Iterations 7
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                   vote   No. Observations:                  500
Model:                          Logit   Df Residuals:                      490
Method:                           MLE   Df Model:                            9
Date:                Thu, 01 Jun 2023   Pseudo R-squ.:                  0.2404
Time:                        12:45:58   Log-Likelihood:                -262.27
converged:                       True   LL-Null:                       -345.28
Covariance Type:            nonrobust   LLR p-value:                 4.187e-31
=======================================================================================
                          coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------------
Intercept              -4.4793      5.175     -0.866      0.387     -14.622       5.663
C(educ)[T.2]            4.5777      5.311      0.862      0.389      -5.831      14.986
C(educ)[T.3]           10.2035      5.343      1.910      0.056      -0.269      20.676
C(educ)[T.4]           14.9313      5.427      2.751      0.006       4.294      25.569
C(educ)[T.5]           18.8533      5.956      3.165      0.002       7.179      30.527
income                  0.7932      1.294      0.613      0.540      -1.743       3.329
income:C(educ)[T.2]    -0.7420      1.322     -0.561      0.575      -3.334       1.849
income:C(educ)[T.3]    -1.8763      1.321     -1.421      0.155      -4.465       0.712
income:C(educ)[T.4]    -2.7696      1.331     -2.081      0.037      -5.378      -0.161
income:C(educ)[T.5]    -3.4311      1.399     -2.453      0.014      -6.173      -0.689
=======================================================================================

And the model after centering:

f = 'vote ~ income_ctr + C(educ) + income_ctr*C(educ)'
y, X = patsy.dmatrices(f, df, return_type='dataframe')
mod = sm.Logit(y, X)
res_log = mod.fit()
print(res_log.summary())
Optimization terminated successfully.
         Current function value: 0.524547
         Iterations 7
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                   vote   No. Observations:                  500
Model:                          Logit   Df Residuals:                      490
Method:                           MLE   Df Model:                            9
Date:                Thu, 01 Jun 2023   Pseudo R-squ.:                  0.2404
Time:                        12:45:58   Log-Likelihood:                -262.27
converged:                       True   LL-Null:                       -345.28
Covariance Type:            nonrobust   LLR p-value:                 4.187e-31
===========================================================================================
                              coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------------
Intercept                  -0.5332      1.458     -0.366      0.715      -3.391       2.324
C(educ)[T.2]                0.8864      1.488      0.596      0.551      -2.030       3.803
C(educ)[T.3]                0.8695      1.469      0.592      0.554      -2.010       3.749
C(educ)[T.4]                1.1536      1.474      0.783      0.434      -1.734       4.042
C(educ)[T.5]                1.7848      1.514      1.179      0.238      -1.182       4.752
income_ctr                  0.7932      1.294      0.613      0.540      -1.743       3.329
income_ctr:C(educ)[T.2]    -0.7420      1.322     -0.561      0.575      -3.334       1.849
income_ctr:C(educ)[T.3]    -1.8763      1.321     -1.421      0.155      -4.465       0.712
income_ctr:C(educ)[T.4]    -2.7696      1.331     -2.081      0.037      -5.378      -0.161
income_ctr:C(educ)[T.5]    -3.4311      1.399     -2.453      0.014      -6.173      -0.689
===========================================================================================

Interpretations:

  • The intercept coefficient is -0.8973 before centering and 1.36 after centering. Before centering, this means that a voter with educ=1 and income=0 has a \(logit^{-1}(-0.8973) = 0.29\) probability of voting yes. After centering, this means that the voter with educ=1 and average income has a \(logit^{-1}(-0.2587) = 0.44\) probability of voting yes.

  • In the centered model, the coefficient for C(educ)[T.2] is 0.1327. For an individual with average income, the individual with educ=2 has \(logit^{-1}(-0.2587+0.1327) = 0.46\) probability of voting, a decrease of ~0.02 from 0.44.

  • In the centered model, the coefficient for C(educ)[T.2] is 0.1327. For an individual with average income, the individual with educ=2 has \(logit^{-1}(-0.2587+0.1327) = 0.46\) probability of voting, a decrease of ~0.02 from 0.44.

invlogit(-0.8973)
0.2896056640703297
invlogit(-0.2587)
0.4356833038013948
invlogit(0.1327)
0.5331264032240327
invlogit(-0.2587+0.1327)
0.46854160844368314

To compare individuals with income 0.5 below the mean, -0.5, and education = 3 vs individuals with income -0.5 and education = 5

Then compare with income = 1.

Verify in plots below.

income = -0.5
print(invlogit(-0.2587 + 0.8255 + 0.1284 * income + -1.0049*income))
print(invlogit(-0.2587 + 0.9480 + 0.1284 * income +  -2.8437*income))
0.7320503095881152
0.885639070225958
income = 1
print(invlogit(-0.2587 + 0.8255 + 0.1284 * income + -1.0049*income))
print(invlogit(-0.2587 + 0.9480 + 0.1284 * income +  -2.8437*income))
0.4231879671534434
0.11650000193030657
df["pred"] = res_log.predict()
def plot_scatter(df,educ):
    g = sns.scatterplot("income_ctr","vote",data=df.query(f"educ=={educ}"), alpha=0.2)
    g = sns.lineplot("income_ctr","pred",data=df.query(f"educ=={educ}"), label=educ)
    return g

for educ in range(1,6):
    g1 = plot_scatter(df,educ)
g1.legend()
/home/chansoo/projects/statsbook/.venv/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
/home/chansoo/projects/statsbook/.venv/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
/home/chansoo/projects/statsbook/.venv/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
/home/chansoo/projects/statsbook/.venv/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
/home/chansoo/projects/statsbook/.venv/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
/home/chansoo/projects/statsbook/.venv/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
/home/chansoo/projects/statsbook/.venv/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
/home/chansoo/projects/statsbook/.venv/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
/home/chansoo/projects/statsbook/.venv/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
/home/chansoo/projects/statsbook/.venv/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
<matplotlib.legend.Legend at 0x7fd2141e3310>
../../../_images/989f567828763f3d69383c9d8b6827a34a9a48b42c12ccd3c9cc95a688afcce2.png