Estimation#
import numpy as np
import pandas as pd
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
import seaborn as sns
import patsy
sns.set(rc={'figure.figsize':(11.7,8.27)})
n = 100
np.random.seed(0)
df = pd.DataFrame({
"x1":np.random.normal(10,2,n),
"e":np.random.normal(0,2,n)
})
y = 3 + 8*df["x1"] + df["e"]
OLS / WLS#
OLS and WLS solutions can be directly estimated using linear algebra.
OLS#
Given an OLS model:
If \(X'X\) is invertible (columns of X are independent), least squares solution \(\hat{\beta}\) is given by:
e.g. many different ways to show this, one is by minimizing sum of squared errors:
https://web.stanford.edu/~mrosenfe/soc_meth_proj3/matrix_OLS_NYU_notes.pdf
gives us normal equations:
Then multiply by \((X'X)^{-1}\) on both sides
Least squares solution:#
X = np.array(sm.add_constant(df[["x1"]], prepend=True))
# Least squares solution:
np.linalg.lstsq(X,y)
/tmp/ipykernel_91305/4115105844.py:2: FutureWarning: `rcond` parameter will change to the default of machine precision times ``max(M, N)`` where M and N are the input matrix dimensions.
To use the future default and silence this warning we advise to pass `rcond=None`, to keep using the old, explicitly pass `rcond=-1`.
np.linalg.lstsq(X,y)
(array([2.00332183, 8.11469843]),
array([422.86874168]),
2,
array([103.64946358, 1.94479008]))
# check
mod = sm.OLS(y, X)
mod.fit().summary()
Dep. Variable: | y | R-squared: | 0.984 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.984 |
Method: | Least Squares | F-statistic: | 6201. |
Date: | Thu, 01 Jun 2023 | Prob (F-statistic): | 2.07e-90 |
Time: | 12:45:47 | Log-Likelihood: | -213.99 |
No. Observations: | 100 | AIC: | 432.0 |
Df Residuals: | 98 | BIC: | 437.2 |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | 2.0033 | 1.063 | 1.884 | 0.063 | -0.107 | 4.113 |
x1 | 8.1147 | 0.103 | 78.745 | 0.000 | 7.910 | 8.319 |
Omnibus: | 5.184 | Durbin-Watson: | 1.995 |
---|---|---|---|
Prob(Omnibus): | 0.075 | Jarque-Bera (JB): | 3.000 |
Skew: | 0.210 | Prob(JB): | 0.223 |
Kurtosis: | 2.262 | Cond. No. | 53.3 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Maximum Likelihood solution:#
PDF of Normal Distribution:
The Likelihood is the product of he individual probabilities of each data point:
Then log likelihood after some algebra:
import scipy
from scipy.stats import norm
def log_lik(par_vec, y, X):
# If the standard deviation prameter is negative, return a large value:
if par_vec[-1] < 0:
return(1e8)
# The likelihood function values:
lik = norm.pdf(y,
loc = X.dot(par_vec[0:-1]),
scale = par_vec[-1])
# If all logarithms are zero, return a large value
if all(v == 0 for v in lik):
return(1e8)
# Logarithm of zero = -Inf
return(-sum(np.log(lik[np.nonzero(lik)])))
res = scipy.optimize.minimize(fun=log_lik, x0=[0,0,10], method = 'BFGS', args=(y,X))
res
message: Optimization terminated successfully.
success: True
status: 0
fun: 213.98843539738618
x: [ 2.003e+00 8.115e+00 2.056e+00]
nit: 19
jac: [ 5.722e-06 9.537e-06 3.815e-06]
hess_inv: [[ 6.975e-01 -6.601e-02 -1.099e-02]
[-6.601e-02 6.650e-03 9.695e-04]
[-1.099e-02 9.695e-04 2.290e-02]]
nfev: 124
njev: 31
WLS#
WLS relaxes homoscedasticity assumption using weights to estimate a non-constant variance-covariance error matrix. Weights are typically assumed or estimated using the data. Observations assigned larger weights are worth more, so the regression line would be closer to these points.
Suppose OLS with constants w:
Let matrix W be a diagonal matrix containing these weights, w.
The weighted least squares estimate is:
e.g. minimize sum of squared errors, just like OLS:
GLM: Generalized Linear Models#
Applied Example: Logistic Regression#
Likelihood:
Log-likelihood:
Derivative of log-likelihood with respect to \(\beta\)
Goal is to set this derivative to zero and solve.. but there is no closed-form solution. So we can approximate using a variety of methods.
def inverse_logit(x):
return(1 / (1 + np.exp(-x)))
n = 500
np.random.seed(0)
df = pd.DataFrame({
"x1":np.random.normal(1,1,n),
"x2":np.random.normal(-2,1,n),
})
df["eta"] = -3 + 2*df["x1"] - 1.5*df["x2"]
df["probs"] = inverse_logit(df["eta"])
df["y"] = np.random.binomial(size=n, n=1, p=df["probs"])
Optimizers#
Newton’s Method for Numerical Optimization (Newton-Raphson):#
The aim is to find where \(t(x) = 0\). The slope of \(t\) at a value \(x_n\) is given by:
Set \(t(x_{n+1}) = 0\), and re-arrange:
Generalized to higher dimensions
where \(\nabla^2t(x^n)^{-1}\) is the inverse of the Hessian matrix of t at \(x^n\)
IRLS#
see: https://ocw.mit.edu/courses/mathematics/18-650-statistics-for-applications-fall-2016/lecture-slides/MIT18_650F16_GLM.pdf
also: https://ee227c.github.io/code/lecture24.html
# Newton's optimization:
from numpy.linalg import inv
constant = np.repeat(1,n)
X = np.vstack((constant,df["x1"],df["x2"])).T
y = df["y"]
def newton_raphson(y, X):
beta = np.array([0,0,0])
delta=np.array([1,1,1])
i = 1
while delta.max() > 0.0000001:
i += 1
probs = np.exp(X @ beta) / (1 + np.exp(X @ beta))
W = np.diag(probs * (1-probs))
grad = -X.T @ (y-probs)
hess = (X.T @ W) @ X
delta = inv(hess) @ grad
beta = beta - delta
if i == 100000:
break
print(beta)
return beta
def irls(y, X):
beta = np.array([0,0,0])
delta=np.array([1,1,1])
i = 1
while delta.max() > 0.0000001:
i += 1
probs = np.exp(X @ beta) / (1 + np.exp(X @ beta))
W = np.diag(probs * (1-probs))
grad = -X.T @ (y-probs)
hess = (X.T @ W) @ X
delta = np.linalg.lstsq(hess, grad)[0]
beta = beta - delta
if i == 100000:
break
print(beta)
return beta
np.round(irls(y, X)-newton_raphson(y, X),0)
[-2.27658967 1.60631135 -1.17185893]
[-2.27658967 1.60631135 -1.17185893]
/tmp/ipykernel_91305/1290518140.py:36: FutureWarning: `rcond` parameter will change to the default of machine precision times ``max(M, N)`` where M and N are the input matrix dimensions.
To use the future default and silence this warning we advise to pass `rcond=None`, to keep using the old, explicitly pass `rcond=-1`.
delta = np.linalg.lstsq(hess, grad)[0]
array([0., 0., 0.])
# check against statsmodels
f = 'y ~ x1 + x2'
y, X = patsy.dmatrices(f, df, return_type='dataframe')
mod = sm.Logit(y, X)
mod.fit().summary()
Optimization terminated successfully.
Current function value: 0.381869
Iterations 7
Dep. Variable: | y | No. Observations: | 500 |
---|---|---|---|
Model: | Logit | Df Residuals: | 497 |
Method: | MLE | Df Model: | 2 |
Date: | Thu, 01 Jun 2023 | Pseudo R-squ.: | 0.3336 |
Time: | 12:45:48 | Log-Likelihood: | -190.93 |
converged: | True | LL-Null: | -286.53 |
Covariance Type: | nonrobust | LLR p-value: | 3.048e-42 |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -2.2766 | 0.351 | -6.495 | 0.000 | -2.964 | -1.590 |
x1 | 1.6063 | 0.179 | 8.950 | 0.000 | 1.255 | 1.958 |
x2 | -1.1719 | 0.156 | -7.521 | 0.000 | -1.477 | -0.866 |