Demo#

A logitic regression is used when the dependent variable is a binary outcome.

It uses a logit link function to relate the continuous \((-\inf,\inf)\) linear component (\(\eta\)) to the \((0,1)\) dependent variable, \(P(y_i=1)\). Then a bernoulli trial to convert these probabilities into binary data, 0 or 1.

\[logit(P(y=1)) = X\beta\]
\[y_i \sim Binomial(1, P_i)\]
import numpy as np
import pandas as pd
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
import seaborn as sns
import patsy
sns.set(rc={'figure.figsize':(11.7,8.27)})

Generate Fake Data#

We generate two independent variables x1 and x2. We let \(\eta\) be the linear component, \(X\beta\), then inverse_logit(\(\eta\)) is the link function that converts \(\eta\) to a probability. (inverse logit = logistic)

\[P(y=1) = inv.logit(X\beta)\]

Then y is the outcome of a bernoulli trial (binomial with 1 trial) with these probabilities.

def inverse_logit(x):
    return(1 / (1 + np.exp(-x)))

n = 500
np.random.seed(0)
df = pd.DataFrame({
    "x1":np.random.normal(1,1,n),
    "x2":np.random.normal(-2,1,n),
})

df["eta"] = - 3 + 2*df["x1"] - 1.5*df["x2"]
df["probs"] = inverse_logit(df["eta"])
df["y"] = np.random.binomial(size=n, n=1, p=df["probs"])
df["y"].value_counts()
1    370
0    130
Name: y, dtype: int64

The plot below shows the inverse_logit relation between \(\eta\) and the probs.

g = sns.lineplot(df["eta"], inverse_logit(df["eta"]))
g.set_xlabel("eta")
g.set_ylabel("prob")
/home/chansoo/projects/statsbook/.venv/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
Text(0, 0.5, 'prob')
../../../../../_images/10cfa724e62e7dd0bac5e3f5289a938cd85785ca53cba66093d63c217195422a.png

The plot below shows the binary classification problem, with \(y \in [{0,1}]\).

sns.scatterplot(df["x1"], df["x2"], hue=df["y"])
/home/chansoo/projects/statsbook/.venv/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
<Axes: xlabel='x1', ylabel='x2'>
../../../../../_images/88c320112d77862f4493656458508fa95cf8f8729870f4ab1666b972bb3e6b20.png
f = 'y ~ x1 + x2'
y, X = patsy.dmatrices(f, df, return_type='dataframe')

Statsmodels#

There are a couple ways to run logistic regression using statsmodels. Note that in both cases, in the model output, the Link function is “logit”.

See the “interpretation” page for more info about interpreting regression results.

mod = sm.Logit(y, X)
res = mod.fit()
print(res.summary())
Optimization terminated successfully.
         Current function value: 0.381869
         Iterations 7
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:                  500
Model:                          Logit   Df Residuals:                      497
Method:                           MLE   Df Model:                            2
Date:                Thu, 01 Jun 2023   Pseudo R-squ.:                  0.3336
Time:                        12:45:51   Log-Likelihood:                -190.93
converged:                       True   LL-Null:                       -286.53
Covariance Type:            nonrobust   LLR p-value:                 3.048e-42
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -2.2766      0.351     -6.495      0.000      -2.964      -1.590
x1             1.6063      0.179      8.950      0.000       1.255       1.958
x2            -1.1719      0.156     -7.521      0.000      -1.477      -0.866
==============================================================================
# alternatively, using GLM and specifying the Binomial family for logistic regression
mod = sm.GLM(y, X, family=sm.families.Binomial())
res2 = mod.fit()
print(res2.summary())
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                      y   No. Observations:                  500
Model:                            GLM   Df Residuals:                      497
Model Family:                Binomial   Df Model:                            2
Link Function:                  Logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -190.93
Date:                Thu, 01 Jun 2023   Deviance:                       381.87
Time:                        12:45:51   Pearson chi2:                     447.
No. Iterations:                     6   Pseudo R-squ. (CS):             0.3178
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -2.2766      0.351     -6.495      0.000      -2.964      -1.590
x1             1.6063      0.179      8.950      0.000       1.255       1.958
x2            -1.1719      0.156     -7.521      0.000      -1.477      -0.866
==============================================================================

sklearn#

model = LogisticRegression(fit_intercept = False)
mdl = model.fit(X, y)
model.coef_
/home/chansoo/projects/statsbook/.venv/lib/python3.8/site-packages/sklearn/utils/validation.py:1143: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
array([[-1.95377627,  1.48859512, -1.04496352]])

default setting is regularization; turn it off by setting penalty='none'

model = LogisticRegression(fit_intercept = False, penalty='none')
mdl = model.fit(X, y)
model.coef_
/home/chansoo/projects/statsbook/.venv/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:1173: FutureWarning: `penalty='none'`has been deprecated in 1.2 and will be removed in 1.4. To keep the past behaviour, set `penalty=None`.
  warnings.warn(
/home/chansoo/projects/statsbook/.venv/lib/python3.8/site-packages/sklearn/utils/validation.py:1143: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
array([[-2.27659181,  1.60631138, -1.17185961]])