Demo#

A poisson regression is used to model counts and/or rates. A count is zero or any positive integer, such as the number of events, the number of students, etc.

A poisson regression (log-linear model) uses a log link function to relate the linear component (\(\eta\)) to the \((0,\infty)\) count variable, e.g. (0,1,2…)

\[E(Y|X) = log(\mu) = X\beta\]

To model rates, we include an “offset” term:

\[E(Y|X) = log(\mu) = log(n) + X\beta\]

Then Y follows a Poisson distribution parameterized by \(\mu\):

\[Y \sim Poisson(\mu)\]
import numpy as np
import pandas as pd
import statsmodels.api as sm
import seaborn as sns
import patsy
sns.set(rc={'figure.figsize':(11.7,8.27)})

Generate Fake Data#

We generate two independent variables x1 and x2. We let \(\eta\) be the linear component, \(X\beta\).

To get \(\mu\), we exponentiate both sides of \(log(\mu) = \eta\), so we have \(\mu = exp(\eta)\)

Then y (# of students flagged) is poisson distributed, with its single parameter: \(\mu\).

To demonstrate poisson modeling with rates, we create an offset term: n_enrolled, which is simply n_student_flagged plus a random positive integer.

n = 500
np.random.seed(0)
df = pd.DataFrame({
    "x1":np.random.exponential(2,n),
    "x2":np.random.normal(2,1,n),
})

eta = np.exp(2 + 0.25*df["x1"] + 0.25*df["x2"])
df["n_students_flagged"] = np.random.poisson(lam=eta)
df["n_students_flagged"].value_counts()
df["n_enrolled"] = np.round(df["n_students_flagged"],0) + np.random.binomial(200,0.8,n)
sns.kdeplot(df["n_students_flagged"])
<Axes: xlabel='n_students_flagged', ylabel='Density'>
../../../../../_images/c7f3bcbc03b765b123b894ca9a4cfcb4d78ddc900818c36eae7ddec7c893f589.png
g = sns.kdeplot(df["n_students_flagged"]/df["n_enrolled"])
g.set_xlabel("rate (flagged/enrolled)")
Text(0.5, 0, 'rate (flagged/enrolled)')
../../../../../_images/ad078bbfd03839c7628ade7004a8df665bfba05c216050633dc6bf1568b5ff7d.png
f = 'n_students_flagged ~ x1 + x2'
y, X = patsy.dmatrices(f, df, return_type='dataframe')

Statsmodels#

mod = sm.GLM(y, X, family=sm.families.Poisson())
res2 = mod.fit()
print(res2.summary())
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:     n_students_flagged   No. Observations:                  500
Model:                            GLM   Df Residuals:                      497
Model Family:                 Poisson   Df Model:                            2
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -1486.9
Date:                Thu, 01 Jun 2023   Deviance:                       571.92
Time:                        12:45:56   Pearson chi2:                     563.
No. Iterations:                     5   Pseudo R-squ. (CS):              1.000
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      2.0270      0.025     79.737      0.000       1.977       2.077
x1             0.2469      0.003     86.086      0.000       0.241       0.253
x2             0.2447      0.010     25.097      0.000       0.226       0.264
==============================================================================

using n_enrolled as offset to model rate (flagged out of enrolled) instead of count (# enrolled)

including an offset is equivalent to including the offset as another pedictor, except with the coefficient fixed to 1.

\[log(\mu) = log(n) + X\beta\]
\[log(\frac{\mu}{n}) = X\beta\]
mod = sm.GLM(y, X, offset=df["n_enrolled"], family=sm.families.Poisson())
res2 = mod.fit()
print(res2.summary())
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:     n_students_flagged   No. Observations:                  500
Model:                            GLM   Df Residuals:                      497
Model Family:                 Poisson   Df Model:                            2
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:            -2.5985e+35
Date:                Thu, 01 Jun 2023   Deviance:                   5.1970e+35
Time:                        12:45:56   Pearson chi2:                 2.68e+37
No. Iterations:                   100   Pseudo R-squ. (CS):                nan
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    -99.7842   2.52e-18  -3.97e+19      0.000     -99.784     -99.784
x1           -24.1414   2.02e-17  -1.19e+18      0.000     -24.141     -24.141
x2            -2.5238   4.29e-18  -5.88e+17      0.000      -2.524      -2.524
==============================================================================
/home/chansoo/projects/statsbook/.venv/lib/python3.8/site-packages/statsmodels/genmod/families/family.py:132: RuntimeWarning: divide by zero encountered in divide
  return 1. / (self.link.deriv(mu)**2 * self.variance(mu))