Intepretation

Intepretation#

import numpy as np
import pandas as pd
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
import seaborn as sns
import patsy
sns.set(rc={'figure.figsize':(11.7,8.27)})

Fake Data Example: President Support#

Suppose we have the data below, where income is the independent variable (in 10,000s). The dependent variable is a binary variable, indicating presidential support.

def inverse_logit(x):
    return(1 / (1 + np.exp(-x)))

n = 500
np.random.seed(0)
df = pd.DataFrame({
    "income":np.random.normal(4,1,n)
})

df["Z"] = -1.4 + 0.33*df["income"] 
df["probs"] = inverse_logit(df["Z"])
df["pres_support"] = np.random.binomial(size=n, n=1, p=df["probs"])
df["pres_support"].value_counts()

sns.regplot(df["income"], df["pres_support"], lowess=True)

/home/chansoo/projects/statsbook/.venv/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

<Axes: xlabel='income', ylabel='pres_support'>

../../../../../_images/4ffd55ec45371cfe257542fc56204d91752833a9b46888dbe001e122b10a15ed.png

f = 'pres_support ~ income'
y, X = patsy.dmatrices(f, df, return_type='dataframe')
mod = sm.Logit(y, X)
res = mod.fit()
res.summary()

Optimization terminated successfully.
         Current function value: 0.666070
         Iterations 5

Logit Regression Results
Dep. Variable:	pres_support	No. Observations:	500
Model:	Logit	Df Residuals:	498
Method:	MLE	Df Model:	1
Date:	Thu, 01 Jun 2023	Pseudo R-squ.:	0.03896
Time:	12:45:49	Log-Likelihood:	-333.03
converged:	True	LL-Null:	-346.54
Covariance Type:	nonrobust	LLR p-value:	2.029e-07

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	-1.9498	0.396	-4.925	0.000	-2.726	-1.174
income	0.4843	0.097	5.002	0.000	0.295	0.674

predictions:

sns.lineplot(df["income"], res.predict())

/home/chansoo/projects/statsbook/.venv/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

<Axes: xlabel='income'>

../../../../../_images/39796285ee951268ace8778c23d02655b6a1b809d94e50274d32b0d33c7266f1.png

Getting Probabilities#

Just like linear regression, the intercept can be interpreted assuming zero for all other covarites. Here, that would mean income = 0.

print(f"P(pres support) = {inverse_logit(res.params[0])}")

P(pres support) = 0.12457544469619863

For the average income level:

print(f"mean: {np.mean(df['income'])}")
print(f"P(pres support) = {inverse_logit(res.params[0] + res.params[1] * np.mean(df['income']))}")

mean: 3.9746455606675664
P(pres support) = 0.4937388607411525

Intepreting the Coefficient on Income#

Since logistic regression is non-linear, the change in probability associated with a one unit increase in income depends on the income.

x = inverse_logit(res.params[0] + res.params[1] * 2) - inverse_logit(res.params[0] + res.params[1] * 1)
print(f"A one unit increase from 1 to 2 is associated with an increase in P(pres support) of {x}")

A one unit increase from 1 to 2 is associated with an increase in P(pres support) of 0.08501590471118814

x = inverse_logit(res.params[0] + res.params[1] * 4.5) - inverse_logit(res.params[0] + res.params[1] * 3.5)
print(f"A one unit increase from 3.5 to 4.5 is associated with an increase in P(pres support) of {x}")

A one unit increase from 3.5 to 4.5 is associated with an increase in P(pres support) of 0.12047149989311856

x = inverse_logit(res.params[0] + res.params[1] * 6) - inverse_logit(res.params[0] + res.params[1] * 5.5)
print(f"A one unit increase from 5.5 to 6 is associated with an increase in P(pres support) of {x}")

A one unit increase from 5.5 to 6 is associated with an increase in P(pres support) of 0.051069298867425617

Divide by 4 Rule#

The steppest change in probability occurs at P(pres support) = 0.5. The derivative is maximized at this point and attains the value \(\frac{\beta e^0}{(1+e^0)^2} = \frac{\beta}{4}\)

So the maximum difference in Pr(pres support) from a one unit change in income is at most \(\frac{\beta}{4}\).

For convenience, we can divide logistic regression coefficients by 4 to get an upper bound of the predictive difference corresponding to a one unit change in income.

res.params[1] / 4

0.12106435437336137

A one unit change in income is associated with AT MOST a 12.1% increase in pres support.

Odds Ratios:#

Not recommended since confusing…

odds = np.exp(res.params)
odds

Intercept    0.142303
income       1.622969
dtype: float64

print(f"A unit change in income is associated with a multiplicative change of {odds['income']} in the odds")

A unit change in income is associated with a multiplicative change of 1.6229693725926753 in the odds