Chapter 17: Practical statistical modeling
Contents
Chapter 17: Practical statistical modeling#
Note: The results may differ slightly between Python and R implementations, due to difference in random seeds as well as slight differences in data filtering between the two notebooks.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm, t, binom, scoreatpercentile, pearsonr, chi2
from scipy.spatial.distance import cdist
import pingouin as pg
import matplotlib
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.model_selection import KFold
from sklearn.preprocessing import scale
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.metrics import adjusted_rand_score
from scipy.cluster import hierarchy
from scipy.special import logit
from sklearn.decomposition import PCA
import yellowbrick as yb
from statsmodels.genmod.bayes_mixed_glm import BinomialBayesMixedGLM
%matplotlib inline
import rpy2.robjects as ro
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
pandas2ri.activate()
base = importr('lme4')
base = importr('tidyverse')
base = importr('broom')
%load_ext rpy2.ipython
rng = np.random.default_rng(1234567)
Figure 17.1 Outliers and influential observations#
npts = 24
outlier_df = pd.DataFrame({'Y': rng.normal(size=npts)})
outlier_df['X'] = outlier_df.Y * 0.4 + rng.normal(size=npts)
def get_cooksD(X, Y):
X = sm.add_constant(X)
model = sm.OLS(Y, X).fit()
influence = model.get_influence()
return(influence.cooks_distance[0])
#sns.set_context("talk", font_scale=1.2)
plt.figure(figsize=(8,8))
g = sns.JointGrid(data=outlier_df,
x="X",
y="Y")
g.plot_joint(sns.regplot, ci=None, scatter_kws={'s': 20 + get_cooksD(outlier_df.X, outlier_df.Y)*500})
g.plot_marginals(sns.boxplot)
<seaborn.axisgrid.JointGrid at 0x7f308864fd90>
<Figure size 800x800 with 0 Axes>
Figure 17.2: Collider bias#
# https://observablehq.com/@herbps10/collider-bias
npts = 1000
conf_df = pd.DataFrame({'height': rng.normal(175, 20, npts),
'speed': rng.normal(50, 10, npts)})
conf_df['zheight'] = scale(conf_df.height)
conf_df['zspeed'] = scale(conf_df.speed)
conf_df['speedXheight'] = conf_df.zheight * conf_df.zspeed
conf_df['NBA'] = (conf_df.speedXheight > scoreatpercentile(conf_df.speedXheight, 90)) & \
(conf_df.speed > conf_df.speed.mean()) & (conf_df.height > conf_df.height.mean())
sns.scatterplot(data=conf_df, x='height', y='speed', hue='NBA')
<Axes: xlabel='height', ylabel='speed'>
print(smf.ols(formula='speed ~ 1 + height', data=conf_df).fit().summary())
OLS Regression Results
==============================================================================
Dep. Variable: speed R-squared: 0.001
Model: OLS Adj. R-squared: 0.000
Method: Least Squares F-statistic: 1.122
Date: Tue, 21 Feb 2023 Prob (F-statistic): 0.290
Time: 21:53:20 Log-Likelihood: -3693.9
No. Observations: 1000 AIC: 7392.
Df Residuals: 998 BIC: 7402.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 52.8286 2.667 19.811 0.000 47.596 58.062
height -0.0160 0.015 -1.059 0.290 -0.046 0.014
==============================================================================
Omnibus: 6.119 Durbin-Watson: 1.994
Prob(Omnibus): 0.047 Jarque-Bera (JB): 6.187
Skew: 0.191 Prob(JB): 0.0453
Kurtosis: 2.954 Cond. No. 1.53e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.53e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Model with NBA player status (collider)
print(smf.ols(formula='speed ~ 1 + height + NBA', data=conf_df).fit().summary())
OLS Regression Results
==============================================================================
Dep. Variable: speed R-squared: 0.121
Model: OLS Adj. R-squared: 0.119
Method: Least Squares F-statistic: 68.71
Date: Tue, 21 Feb 2023 Prob (F-statistic): 1.11e-28
Time: 21:53:20 Log-Likelihood: -3629.9
No. Observations: 1000 AIC: 7266.
Df Residuals: 997 BIC: 7281.
Df Model: 2
Covariance Type: nonrobust
===============================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------
Intercept 61.3954 2.608 23.541 0.000 56.277 66.513
NBA[T.True] 17.6543 1.513 11.668 0.000 14.685 20.623
height -0.0692 0.015 -4.640 0.000 -0.098 -0.040
==============================================================================
Omnibus: 3.634 Durbin-Watson: 1.988
Prob(Omnibus): 0.163 Jarque-Bera (JB): 3.491
Skew: 0.127 Prob(JB): 0.175
Kurtosis: 3.137 Cond. No. 1.62e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.62e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
same regression only on the NBA players, where we see a strong negative relationship between speed and height:
print(smf.ols(formula='speed ~ 1 + height', data=conf_df.query('NBA == True')).fit().summary())
OLS Regression Results
==============================================================================
Dep. Variable: speed R-squared: 0.116
Model: OLS Adj. R-squared: 0.094
Method: Least Squares F-statistic: 5.261
Date: Tue, 21 Feb 2023 Prob (F-statistic): 0.0271
Time: 21:53:20 Log-Likelihood: -128.55
No. Observations: 42 AIC: 261.1
Df Residuals: 40 BIC: 264.6
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 90.6515 11.265 8.047 0.000 67.885 113.418
height -0.1258 0.055 -2.294 0.027 -0.237 -0.015
==============================================================================
Omnibus: 4.981 Durbin-Watson: 2.303
Prob(Omnibus): 0.083 Jarque-Bera (JB): 3.737
Skew: 0.678 Prob(JB): 0.154
Kurtosis: 3.544 Cond. No. 2.83e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.83e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Example 1: Self-regulation and arrest#
behavdata = pd.read_csv('https://raw.githubusercontent.com/statsthinking21/statsthinking21-figures-data/main/Eisenberg/meaningful_variables.csv')
behavdata.set_index('subcode', inplace=True)
demoghealthdata = pd.read_csv('https://raw.githubusercontent.com/statsthinking21/statsthinking21-figures-data/main/Eisenberg/demographic_health.csv')
demoghealthdata.set_index('subcode', inplace=True)
sexdict = {1:'Female', 0:'Male'}
demoghealthdata['Sex'] = [sexdict[i] for i in demoghealthdata['Sex']]
alldata = demoghealthdata.join(behavdata, how='inner')
rename_dict = {
'upps_impulsivity_survey': 'UPPS',
'sensation_seeking_survey': 'SSS',
'dickman_survey': 'Dickman',
'bis11_survey': 'BIS11'
}
impulsivity_vars = []
for column in alldata.columns:
for k, v in rename_dict.items():
if k in column:
alldata.rename(columns={column: column.replace(k, v)}, inplace=True)
impulsivity_vars.append(column.replace(k, v))
Table 17.1#
Frequency distribution of number of reported arrests in the Eisenberg et al. dataset
arrestdata = alldata.dropna(subset=['ArrestedChargedLifeCount'])
arrestdata.loc['everArrested'] = arrestdata.ArrestedChargedLifeCount > 0
pd.DataFrame({
'number': arrestdata.ArrestedChargedLifeCount.value_counts(),
'proportion': arrestdata.ArrestedChargedLifeCount.value_counts()/arrestdata.shape[0]})
number | proportion | |
---|---|---|
0.0 | 409 | 0.783525 |
1.0 | 58 | 0.111111 |
2.0 | 33 | 0.063218 |
3.0 | 9 | 0.017241 |
5.0 | 5 | 0.009579 |
4.0 | 5 | 0.009579 |
6.0 | 2 | 0.003831 |
Figure 17.3#
impulsivity_data = alldata.dropna(subset=['ArrestedChargedLifeCount',
'TrafficTicketsLastYearCount',
'TrafficAccidentsLifeCount', 'Age','Sex'])
impulsivity_data['everArrested'] = impulsivity_data.ArrestedChargedLifeCount > 0
cc = impulsivity_data[impulsivity_vars].corr()
sns.clustermap(cc, method='average', annot=True)
/tmp/ipykernel_2300/2323405881.py:4: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
impulsivity_data['everArrested'] = impulsivity_data.ArrestedChargedLifeCount > 0
<seaborn.matrix.ClusterGrid at 0x7f3143067dc0>
Figure 17.4#
cls = AgglomerativeClustering(n_clusters=2)
cls.fit(cc)
# 0 is sensation seeking, 1 is impulsivity
varnames = ['mean_senseek', 'mean_impulsivity']
newvars = {}
for cluster in range(2):
vars = [impulsivity_vars[i] for i in range(len(impulsivity_vars)) if cls.labels_[i] == cluster]
newvars[varnames[cluster]] = impulsivity_data.loc[:, vars].mean(axis=1, skipna=True)
for k, v, in newvars.items():
impulsivity_data[k] = scale(v)
g = sns.PairGrid(impulsivity_data[varnames])
def corrfunc(x, y, hue=None, ax=None, **kws):
"""Plot the correlation coefficient in the top left hand corner of a plot."""
r, _ = pearsonr(x, y)
ax = ax or plt.gca()
ax.annotate(f'r = {r:.2f}', xy=(.2, .5), size=24, xycoords=ax.transAxes)
g.map_upper(corrfunc)
g.map_lower(sns.regplot, color='black')
g.map_diag(sns.histplot, kde=True, legend=False)
plt.show()
/tmp/ipykernel_2300/221161768.py:12: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
impulsivity_data[k] = scale(v)
/tmp/ipykernel_2300/221161768.py:12: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
impulsivity_data[k] = scale(v)
Figure 17.5#
fig, ax = plt.subplots(1, 2, figsize=(12,6))
sns.boxplot(data=impulsivity_data, x='everArrested', y='mean_impulsivity', ax=ax[0])
sns.swarmplot(data=impulsivity_data, x='everArrested', y='mean_impulsivity', ax=ax[0], color='black', alpha=0.5)
sns.boxplot(data=impulsivity_data, x='everArrested', y='mean_senseek', ax=ax[1])
sns.swarmplot(data=impulsivity_data, x='everArrested', y='mean_senseek', ax=ax[1], color='black', alpha=0.5)
<Axes: xlabel='everArrested', ylabel='mean_senseek'>
t-test output for the mean impulsivity scores:
tt = pg.ttest(x=impulsivity_data.query('everArrested == True').mean_impulsivity,
y=impulsivity_data.query('everArrested == False').mean_impulsivity,
alternative='greater')
tt
T | dof | alternative | p-val | CI95% | cohen-d | BF10 | power | |
---|---|---|---|---|---|---|---|---|
T-test | 2.910521 | 160.928447 | greater | 0.00206 | [0.14, inf] | 0.334129 | 13.485 | 0.931108 |
t-test output for the sensation seeking variable:
tt = pg.ttest(x=impulsivity_data.query('everArrested == True').mean_senseek,
y=impulsivity_data.query('everArrested == False').mean_senseek,
alternative='greater')
tt
T | dof | alternative | p-val | CI95% | cohen-d | BF10 | power | |
---|---|---|---|---|---|---|---|---|
T-test | 4.769141 | 182.383008 | greater | 0.000002 | [0.32, inf] | 0.496897 | 1.11e+04 | 0.998687 |
Figure 17.6: Logistic regression#
rng = np.random.default_rng(1234567)
n = 100
b0 = 100
b1 = 5
logreg_df = pd.DataFrame({'x': rng.normal(0, 10, n)})
logreg_df['y'] = b0 + b1 * logreg_df.x + rng.normal(0, 3, n)
logreg_df['y_bin'] = (rng.uniform(size=n) < 1 /(1 + np.exp( -(logreg_df.y - logreg_df.y.mean()) / logreg_df.y.std()))).astype('int')
logreg_df['y_bin_jitter'] = logreg_df.y_bin + rng.uniform(-.01, .01, size=n)
lm = smf.ols(formula='y_bin ~ 1 + x', data=logreg_df)
lm_result = lm.fit()
logreg_df['lm_pred'] = lm_result.predict()
logistic = smf.logit(formula='y_bin ~ 1 + x', data=logreg_df)
logistic_result = logistic.fit()
logreg_df['logistic_pred'] = logistic_result.predict()
fig, ax = plt.subplots(1, 2, figsize=(12,6))
sns.scatterplot(data=logreg_df, x='x', y='y_bin_jitter', ax=ax[0])
sns.lineplot(data=logreg_df, x='x', y='lm_pred', ax=ax[0])
ax[0].set_title('A: Linear regression')
sns.scatterplot(data=logreg_df, x='x', y='y_bin_jitter', ax=ax[1])
sns.lineplot(data=logreg_df, x='x', y='logistic_pred', ax=ax[1])
ax[0].set_title('A: Logistic regression')
Optimization terminated successfully.
Current function value: 0.533560
Iterations 6
Text(0.5, 1.0, 'A: Logistic regression')
Logistic regression output:
print(logistic_result.summary())
Logit Regression Results
==============================================================================
Dep. Variable: y_bin No. Observations: 100
Model: Logit Df Residuals: 98
Method: MLE Df Model: 1
Date: Tue, 21 Feb 2023 Pseudo R-squ.: 0.2300
Time: 21:53:26 Log-Likelihood: -53.356
converged: True LL-Null: -69.295
Covariance Type: nonrobust LLR p-value: 1.642e-08
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 0.0447 0.238 0.188 0.851 -0.422 0.512
x 0.1369 0.030 4.604 0.000 0.079 0.195
==============================================================================
Logistic regression output for impulsivity dataset
impulsivity_data.loc[:, 'AgeSquared'] = (impulsivity_data.Age - impulsivity_data.Age.mean()) ** 2
impulsivity_data.loc[:, 'everArrestedInt'] = impulsivity_data.everArrested.astype('int')
# the following is for R
impulsivity_data.loc[:, 'SexInt'] = [1 if i == 'Male' else 0 for i in impulsivity_data.Sex]
impulsivity_data = impulsivity_data[['everArrestedInt', 'mean_impulsivity', 'mean_senseek', 'Age', 'AgeSquared', 'Sex', 'SexInt']]
glm_imp_arrest = smf.glm(
formula='everArrestedInt ~ 1 + mean_impulsivity + mean_senseek + Age + Sex',
data=impulsivity_data,
family=sm.families.Binomial())
logistic_result = glm_imp_arrest.fit()
print(logistic_result.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: everArrestedInt No. Observations: 521
Model: GLM Df Residuals: 516
Model Family: Binomial Df Model: 4
Link Function: Logit Scale: 1.0000
Method: IRLS Log-Likelihood: -250.58
Date: Tue, 21 Feb 2023 Deviance: 501.15
Time: 21:53:26 Pearson chi2: 532.
No. Iterations: 5 Pseudo R-squ. (CS): 0.07599
Covariance Type: nonrobust
====================================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------------
Intercept -3.3826 0.558 -6.066 0.000 -4.476 -2.290
Sex[T.Male] 0.6510 0.238 2.736 0.006 0.185 1.117
mean_impulsivity 0.2859 0.113 2.528 0.011 0.064 0.508
mean_senseek 0.3778 0.118 3.214 0.001 0.147 0.608
Age 0.0485 0.014 3.372 0.001 0.020 0.077
====================================================================================
/tmp/ipykernel_2300/1360466332.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
impulsivity_data.loc[:, 'AgeSquared'] = (impulsivity_data.Age - impulsivity_data.Age.mean()) ** 2
/tmp/ipykernel_2300/1360466332.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
impulsivity_data.loc[:, 'everArrestedInt'] = impulsivity_data.everArrested.astype('int')
/tmp/ipykernel_2300/1360466332.py:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
impulsivity_data.loc[:, 'SexInt'] = [1 if i == 'Male' else 0 for i in impulsivity_data.Sex]
Compare with R results
%%R -i impulsivity_data
glm_imp_arrest = glm('everArrestedInt ~ 1 + mean_impulsivity + mean_senseek + Age + SexInt',
data=impulsivity_data,
family='binomial')
print(summary(glm_imp_arrest))
Call:
glm(formula = "everArrestedInt ~ 1 + mean_impulsivity + mean_senseek + Age + SexInt",
family = "binomial", data = impulsivity_data)
Deviance Residuals:
Min
1Q
Median
3Q
Max
-1.3386
-0.7119
-0.5722
-0.3575
2.4726
Coefficients:
Estimate
Std. Error
z value
Pr(>|z|)
(Intercept)
-3.38260
0.55767
-6.066
1.31e-09
***
mean_impulsivity
0.28595
0.11309
2.528
0.011459
*
mean_senseek
0.37779
0.11754
3.214
0.001308
**
Age
0.04849
0.01438
3.372
0.000746
***
SexInt
0.65095
0.23793
2.736
0.006222
**
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for
binomial
family taken to be
1
)
Null deviance: 542.33 on 520 degrees of freedom
Residual deviance: 501.15 on 516 degrees of freedom
AIC:
511.15
Number of Fisher Scoring iterations:
4
Test for overdispersion#
There is no built-in test for overdispersion in Python. However, we can perform a simple chi-squared test for overdispersion, by dividing the Pearson chi-squared statistic by the residual degrees of freedom, which gives the same result as the R overdispersion test using the chi-squared.
chisq_stat = logistic_result.pearson_chi2 / logistic_result.df_resid
print('dispersion = ', chisq_stat)
pval = 1 - chi2.cdf(logistic_result.pearson_chi2, logistic_result.df_resid)
print('pvalue (alternative ="greater") = ', pval)
dispersion = 1.0305993193299552
pvalue (alternative ="greater") = 0.3059811971051235
Compare to R result
%%R
library('DHARMa')
print(testDispersion(glm_imp_arrest, alternative='greater', type='PearsonChisq'))
R[write to console]: This is DHARMa 0.4.6. For overview type '?DHARMa'. For recent changes, type news(package = 'DHARMa')
Parametric dispersion test via mean Pearson-chisq statistic
data:
glm_imp_arrest
dispersion = 1.0306, df = 516, p-value = 0.306
alternative hypothesis:
greater
Figure 17.7#
# the statsmodels lowess smoother returned all zeros, so we are using this
# from https://xavierbourretsicotte.github.io/loess.html
from math import ceil
import numpy as np
from scipy import linalg
def lowess_ag(x, y, f=2. / 3., iter=3):
"""lowess(x, y, f=2./3., iter=3) -> yest
Lowess smoother: Robust locally weighted regression.
The lowess function fits a nonparametric regression curve to a scatterplot.
The arrays x and y contain an equal number of elements; each pair
(x[i], y[i]) defines a data point in the scatterplot. The function returns
the estimated (smooth) values of y.
The smoothing span is given by f. A larger value for f will result in a
smoother curve. The number of robustifying iterations is given by iter. The
function will run faster with a smaller number of iterations.
"""
n = len(x)
r = int(ceil(f * n))
h = [np.sort(np.abs(x - x[i]))[r] for i in range(n)]
w = np.clip(np.abs((x[:, None] - x[None, :]) / h), 0.0, 1.0)
w = (1 - w ** 3) ** 3
yest = np.zeros(n)
delta = np.ones(n)
for iteration in range(iter):
for i in range(n):
weights = delta * w[:, i]
b = np.array([np.sum(weights * y), np.sum(weights * y * x)])
A = np.array([[np.sum(weights), np.sum(weights * x)],
[np.sum(weights * x), np.sum(weights * x * x)]])
beta = linalg.solve(A, b)
yest[i] = beta[0] + beta[1] * x[i]
residuals = y - yest
s = np.median(np.abs(residuals))
delta = np.clip(residuals / (6.0 * s), -1, 1)
delta = (1 - delta ** 2) ** 2
return yest
def cat_logit_est(iv, df, max_bins=18):
ncats = np.min((max_bins, np.round(iv.unique().shape[0] / 2).astype('int')))
iv_cut = pd.cut(iv, ncats)
glm_imp_arrest = smf.glm(
formula='everArrestedInt ~ 1 + iv_cut', data=df,
family=sm.families.Binomial())
logistic_result = glm_imp_arrest.fit()
# rint(logistic_result.summary())
return(logistic_result.predict(linear=True))
def loess_logit_est(iv, df, frac=0.5):
lowess = lowess_ag(iv.values, df.everArrestedInt.values, f=frac, )
lowess[lowess<=0] = 0 #negatives can happen at edges
logit_of_fit = logit(lowess)
return(logit_of_fit)
# using loess as in R example
impulsivity_data['observed_logit_Age'] = loess_logit_est(impulsivity_data.Age, impulsivity_data)
impulsivity_data['observed_logit_senseek'] = loess_logit_est(impulsivity_data.mean_senseek, impulsivity_data)
impulsivity_data['observed_logit_impulsivity'] = loess_logit_est(impulsivity_data.mean_impulsivity, impulsivity_data)
fig, ax = plt.subplots(2, 2, figsize=(8, 8))
sns.scatterplot(data=impulsivity_data, x='Age', y='observed_logit_Age', ax=ax[0][0],
s=25 + logistic_result.get_influence().cooks_distance[0]*5000)
sns.regplot(data=impulsivity_data, x='Age', y='observed_logit_Age', scatter=False, lowess=True,ax=ax[0][0])
sns.scatterplot(data=impulsivity_data, x='mean_senseek', y='observed_logit_senseek', ax=ax[0][1],
s=25 + logistic_result.get_influence().cooks_distance[0]*5000)
sns.regplot(data=impulsivity_data, x='mean_senseek', y='observed_logit_senseek', scatter=False, lowess=True,ax=ax[0][1])
sns.scatterplot(data=impulsivity_data, x='mean_impulsivity', y='observed_logit_impulsivity', ax=ax[1][0],
s=25 + logistic_result.get_influence().cooks_distance[0]*5000)
sns.regplot(data=impulsivity_data, x='mean_impulsivity', y='observed_logit_impulsivity', scatter=False, lowess=True,ax=ax[1][0])
ax[1][1].set_visible(False)
/tmp/ipykernel_2300/3594777561.py:37: RuntimeWarning: divide by zero encountered in divide
delta = np.clip(residuals / (6.0 * s), -1, 1)
/tmp/ipykernel_2300/3594777561.py:37: RuntimeWarning: invalid value encountered in divide
delta = np.clip(residuals / (6.0 * s), -1, 1)
model including \(Age^2\) alongside Age
glm_imp_arrest = smf.glm(
formula='everArrestedInt ~ 1 + mean_impulsivity + mean_senseek + Age + AgeSquared + Sex',
data=impulsivity_data,
family=sm.families.Binomial())
logistic_result2 = glm_imp_arrest.fit()
print(logistic_result2.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: everArrestedInt No. Observations: 521
Model: GLM Df Residuals: 515
Model Family: Binomial Df Model: 5
Link Function: Logit Scale: 1.0000
Method: IRLS Log-Likelihood: -247.72
Date: Tue, 21 Feb 2023 Deviance: 495.45
Time: 21:53:27 Pearson chi2: 536.
No. Iterations: 5 Pseudo R-squ. (CS): 0.08605
Covariance Type: nonrobust
====================================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------------
Intercept -3.9824 0.649 -6.135 0.000 -5.255 -2.710
Sex[T.Male] 0.6864 0.238 2.881 0.004 0.219 1.153
mean_impulsivity 0.3067 0.114 2.691 0.007 0.083 0.530
mean_senseek 0.3810 0.118 3.230 0.001 0.150 0.612
Age 0.0726 0.019 3.896 0.000 0.036 0.109
AgeSquared -0.0042 0.002 -2.269 0.023 -0.008 -0.001
====================================================================================
Table 17.2#
params = pd.DataFrame(logistic_result2.params, columns=['estimate'])
params = params.join(logistic_result2.conf_int())
params = np.exp(params.rename(columns={0: '2.5%', 1: '97.5%'}).drop('Intercept'))
params
estimate | 2.5% | 97.5% | |
---|---|---|---|
Sex[T.Male] | 1.986540 | 1.245420 | 3.168685 |
mean_impulsivity | 1.358978 | 1.086920 | 1.699134 |
mean_senseek | 1.463696 | 1.161603 | 1.844354 |
Age | 1.075341 | 1.036751 | 1.115367 |
AgeSquared | 0.995775 | 0.992140 | 0.999424 |
Bayes factor using approximation via BIC
glm_imp_arrest_baseline = smf.glm(
formula='everArrestedInt ~ 1 + Age + AgeSquared + Sex',
data=impulsivity_data,
family=sm.families.Binomial())
logistic_result_baseline = glm_imp_arrest_baseline.fit()
BF_01 = np.exp((logistic_result2.bic - logistic_result_baseline.bic)/2) # BICs to Bayes factor
1/BF_01
/opt/conda/lib/python3.10/site-packages/statsmodels/genmod/generalized_linear_model.py:1799: FutureWarning: The bic value is computed using the deviance formula. After 0.13 this will change to the log-likelihood based formula. This change has no impact on the relative rank of models compared using BIC. You can directly access the log-likelihood version using the `bic_llf` attribute. You can suppress this message by calling statsmodels.genmod.generalized_linear_model.SET_USE_BIC_LLF with True to get the LLF-based version now or False to retainthe deviance version.
warnings.warn(
241.79055403580912
Example 2: Mask-wearing and face-touching#
Figure 17.8#
maskdata = pd.read_csv('https://raw.githubusercontent.com/statsthinking21/statsthinking21-figures-data/main/mask_wearing/DataVersion2/MaskFaceTouchOSF.csv')
maskdata = maskdata.query('face_touching != "Missing"')
maskdata.face_touching = np.where(maskdata.face_touching == 'Yes', 1, 0)
maskdata.mask_front_touching = np.where(maskdata.mask_front_touching == 'Yes', 1, 0)
maskdata.mask_strap_touching = np.where(maskdata.mask_strap_touching == 'Yes', 1, 0)
maskdata['face_touching'] = np.clip(maskdata.face_touching + maskdata.mask_front_touching + maskdata.mask_strap_touching, None, 1)
maskdata['mask_wearing'] = np.where(maskdata.mask_YesNo == 'Yes', 1, 0)
maskdata['age_std'] = scale(maskdata.age)
maskdata['segment_crowding_std'] = scale(maskdata.segment_crowding)
maskdata = maskdata.query('change == 0 and non_covering != "Yes"')
maskdata_study1 = maskdata.query('study == 1')
maskdata_study2 = maskdata.query('study == 2')
# there is no built-in stacked bar plot function in seaborn
fig, ax = plt.subplots(1, 2, figsize=(8,4))
study1_sums = maskdata_study1.groupby('mask_wearing')['face_touching'].value_counts()
study2_sums = maskdata_study2.groupby('mask_wearing')['face_touching'].value_counts()
labels = ['No', 'Yes']
ax[0].bar(labels, study1_sums.loc[:, 1], label='face touching')
ax[0].bar(labels, study1_sums.loc[:, 0], bottom=study1_sums.loc[:, 1], label='no face touching')
ax[1].bar(labels, study2_sums.loc[:, 1], label='face touching')
ax[1].bar(labels, study2_sums.loc[:, 0], bottom=study2_sums.loc[:, 1], label='no face touching')
ax[1].legend()
for i in range(2):
ax[i].set_ylabel('Count')
ax[i].set_title(f'Study {i + 1}')
plt.show()
simple chi-squared test
pg.chi2_independence(maskdata_study1, 'mask_wearing', 'face_touching')[2].loc[0, :]
test pearson
lambda 1.0
chi2 0.029801
dof 1.0
pval 0.862943
cramer 0.008821
power 0.053421
Name: 0, dtype: object
analogous test for the data from the second study
pg.chi2_independence(maskdata_study2, 'mask_wearing', 'face_touching')[2].loc[0, :]
test pearson
lambda 1.0
chi2 1.085429
dof 1.0
pval 0.297486
cramer 0.050776
power 0.180619
Name: 0, dtype: object
Logistic regression model including duration and study along with mask wearing
glm_mask1 = smf.glm(
formula='face_touching ~ mask_wearing + duration_of_observation + study',
data=maskdata,
family=sm.families.Binomial())
logistic_result_mask1 = glm_mask1.fit()
print(logistic_result_mask1.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: face_touching No. Observations: 804
Model: GLM Df Residuals: 800
Model Family: Binomial Df Model: 3
Link Function: Logit Scale: 1.0000
Method: IRLS Log-Likelihood: -342.50
Date: Tue, 21 Feb 2023 Deviance: 685.00
Time: 21:53:28 Pearson chi2: 806.
No. Iterations: 5 Pseudo R-squ. (CS): 0.04025
Covariance Type: nonrobust
===========================================================================================
coef std err z P>|z| [0.025 0.975]
-------------------------------------------------------------------------------------------
Intercept -3.3426 0.392 -8.538 0.000 -4.110 -2.575
mask_wearing 0.1919 0.198 0.970 0.332 -0.196 0.579
duration_of_observation 0.0290 0.006 4.689 0.000 0.017 0.041
study 0.5717 0.200 2.852 0.004 0.179 0.965
===========================================================================================
# clean up maskdata for use within R
maskdata = maskdata[['face_touching', 'mask_wearing', 'duration_of_observation', 'study', 'unique_situation']]
%%R -i maskdata
glm_result_combined = glm(face_touching ~ mask_wearing + duration_of_observation + study,
family=binomial, data=maskdata)
print(summary(glm_result_combined))
Call:
glm(formula = face_touching ~ mask_wearing + duration_of_observation +
study, family = binomial, data = maskdata)
Deviance Residuals:
Min
1Q
Median
3Q
Max
-1.7465
-0.6011
-0.5299
-0.4732
2.1918
Coefficients:
Estimate
Std. Error
z value
Pr(>|z|)
(Intercept)
-3.342553
0.391507
-8.538
< 2e-16
***
mask_wearing
0.191901
0.197742
0.970
0.33182
duration_of_observation
0.028991
0.006182
4.689
2.74e-06
***
study
0.571684
0.200441
2.852
0.00434
**
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for
binomial
family taken to be
1
)
Null deviance: 718.03 on 803 degrees of freedom
Residual deviance: 685.00 on 800 degrees of freedom
AIC:
693
Number of Fisher Scoring iterations:
4
Mixed effects model#
There is no exact replica of glmer()
within Python. The statsmodels package includes a function for estimating the parameters of a mixed effect logistic regression model using Bayesian estimation, which we show here for comparison.
random = {"a": '0 + C(unique_situation)', "b": '0 + C(unique_situation) * mask_wearing'}
glm_result = BinomialBayesMixedGLM.from_formula('face_touching ~ mask_wearing + duration_of_observation', random, maskdata)
fitted_glm_result = glm_result.fit_vb()
print(fitted_glm_result.summary())
Binomial Mixed GLM Results
======================================================================
Type Post. Mean Post. SD SD SD (LB) SD (UB)
----------------------------------------------------------------------
Intercept M -2.5007 0.0980
mask_wearing M 0.1138 0.1473
duration_of_observation M 0.0293 0.0032
a V -1.0124 0.0624 0.363 0.321 0.412
b V -1.5560 0.0441 0.211 0.193 0.230
======================================================================
Parameter types are mean structure (M) and variance structure (V)
Variance parameters are modeled as log standard deviations
/opt/conda/lib/python3.10/site-packages/statsmodels/genmod/bayes_mixed_glm.py:793: UserWarning: VB fitting did not converge
warnings.warn("VB fitting did not converge")
This function doesn’t give p-values, but we can compute the probability that each of the parameters is greater than zero, using the mean and standard deviation estimates.
p_mask_wearing = 1- norm.cdf(0, loc=-fitted_glm_result.fe_mean[1], scale=fitted_glm_result.fe_sd[1])
p_duration = 1 - norm.cdf(0, loc=-fitted_glm_result.fe_mean[2], scale=fitted_glm_result.fe_sd[2])
print('P(mask_wearing <= 0| data) =', p_mask_wearing)
print('P(duration <= 0| data) =', p_duration)
P(mask_wearing <= 0| data) = 0.21993044746773704
P(duration <= 0| data) = 0.0
Compute in R for comparison
%%R -i maskdata
library(lme4)
library(tidyverse)
glmer_result = glmer(face_touching ~ mask_wearing + duration_of_observation + (1 + mask_wearing |unique_situation), data=maskdata, family=binomial)
glmer_result_baseline = update(glmer_result, formula = ~ . -mask_wearing) # Without mask effect term
maskdata = maskdata %>%
mutate(glmer_resid = residuals(glmer_result),
)
print(summary(glmer_result))
Generalized linear mixed model fit by maximum likelihood (Laplace
Approximation)
[
glmerMod
]
Family:
binomial
( logit )
Formula:
face_touching ~ mask_wearing + duration_of_observation + (1 +
mask_wearing | unique_situation)
Data:
maskdata
AIC
BIC
logLik
deviance
df.resid
704.2
732.4
-346.1
692.2
798
Scaled residuals:
Min
1Q
Median
3Q
Max
-1.6847
-0.4289
-0.3888
-0.3335
3.1095
Random effects:
Groups
Name
Variance
Std.Dev.
Corr
unique_situation
(Intercept)
0.1829
0.4277
mask_wearing
0.1599
0.3999
-0.40
Number of obs: 804, groups:
unique_situation, 129
Fixed effects:
Estimate
Std. Error
z value
Pr(>|z|)
(Intercept)
-2.518255
0.245319
-10.265
< 2e-16
***
mask_wearing
0.145705
0.253822
0.574
0.566
duration_of_observation
0.029691
0.006409
4.633
3.61e-06
***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr)
msk_wr
mask_wearng
-0.471
drtn_f_bsrv
-0.698
-0.011
test for overdispersion:
%%R
print(testDispersion(glmer_result, alternative='greater', type='PearsonChisq'))
Parametric dispersion test via mean Pearson-chisq statistic
data:
glmer_result
dispersion = 0.92427, df = 798, p-value = 0.9377
alternative hypothesis:
greater
Figure 17.9#
We do this in R since otherwise we would have to work with the glmer result in Python, which is difficult.
%%R
library(broom.mixed)
library(psych)
loess_logit_est2 = function(iv, span=1){
fit = loess(face_touching_int ~ iv, data=maskmodeldata, span=span)$fitted
fit[fit<=0] = 0 #negatives can happen at edges
logit_of_fit = logit(fit)
return(logit_of_fit)
}
maskmodeldata = augment(glmer_result) %>%
mutate(face_touching_int = as.integer(face_touching))
maskmodeldata = maskmodeldata %>%
mutate(observed_logit = loess_logit_est2(duration_of_observation))
ggplot(maskmodeldata, aes(duration_of_observation, observed_logit, size=.cooksd))+
geom_jitter() +
geom_smooth(method = "loess", se=FALSE)
R[write to console]:
Attaching package: ‘psych’
R[write to console]: The following objects are masked from ‘package:ggplot2’:
%+%, alpha
`geom_smooth()` using formula = 'y ~ x'
Table 17.3#
here too we use R code
%%R
effect_table = tidy(glmer_result,conf.int=TRUE,exponentiate=TRUE,effects="fixed") %>%
dplyr::select(term, estimate, conf.low, conf.high)
names(effect_table) = c('', 'Odds ratio', '2.5 %', '97.5 %')
effect_table[2:3,]
# A tibble: 2 × 4
`` `Odds ratio` `2.5 %` `97.5 %`
<chr> <dbl> <dbl> <dbl>
1 mask_wearing 1.16 0.703 1.90
2 duration_of_observation 1.03 1.02 1.04
Example 3: Asthma and air pollution#
3. Prepare the data for analysis#
pmdata = pd.read_csv('https://raw.githubusercontent.com/statsthinking21/statsthinking21-figures-data/main/CDC_PM2.5_2014/Daily_Census_Tract-Level_PM2.5_mean_2014.csv').rename(
columns={'ctfips': 'TractFIPS', 'ds_pm_pred': 'pm25_mean'})
asthmadata = pd.read_csv('https://raw.githubusercontent.com/statsthinking21/statsthinking21-figures-data/main/500cities_disease/acsdata_with_censusdata.csv')
asthmadata.set_index('TractFIPS', inplace=True)
asthmadata.drop([i for i in asthmadata.columns if '_Crude95CI' in i], axis=1, inplace=True)
pm_asthma_data = pmdata.join(asthmadata, how='inner', on='TractFIPS', rsuffix='_r').dropna().rename(
columns={'CASTHMA_CrudePrev': 'asthma_prev', 'PlaceFIPS': 'city'})
pm_asthma_data.MedianIncome = pm_asthma_data.MedianIncome / 1000
pm_asthma_data.Population2010 = pm_asthma_data.Population2010 / 1000
pm_asthma_data.reset_index(inplace=True)
%%R
# asthma data
# https://chronicdata.cdc.gov/500-Cities-Places/500-Cities-Census-Tract-level-Data-GIS-Friendly-Fo/5mtz-k78d (for 2014)
#
# PM2.5 data:
#
# filtered for 2014 and averaged ds_pm_pred using:
#
# https://data.cdc.gov/Environmental-Health-Toxicology/Daily-Census-Tract-Level-PM2-5-Concentrations-2011/fcqm-xrf4/data
pmdata = read_csv('https://raw.githubusercontent.com/statsthinking21/statsthinking21-figures-data/main/CDC_PM2.5_2014/Daily_Census_Tract-Level_PM2.5_mean_2014.csv') %>%
rename(TractFIPS=ctfips,
pm25_mean=ds_pm_pred)
asthmadata = read_csv('https://raw.githubusercontent.com/statsthinking21/statsthinking21-figures-data/main/500cities_disease/acsdata_with_censusdata.csv') %>%
dplyr::select(-ends_with('_Crude95CI'))
pm_asthma_data = inner_join(asthmadata, pmdata, by='TractFIPS') %>%
drop_na() %>%
rename(asthma_prev=CASTHMA_CrudePrev,
city=PlaceFIPS) %>%
mutate(MedianIncome = MedianIncome/1000,
Population2010 = Population2010/1000)
# mutate(log_asthma_prev = log(asthma_prev))
dim(pm_asthma_data)
Rows: 72283 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (2): ctfips, ds_pm_pred
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 26892 Columns: 66
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (33): StateAbbr, PlaceName, Place_TractID, ACCESS2_Crude95CI, ARTHRITIS_...
dbl (33): PlaceFIPS, TractFIPS, Population2010, ACCESS2_CrudePrev, ARTHRITIS...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1]
26681
39
Figure 17.10#
plt.figure(figsize=(12,12))
g = sns.PairGrid(pm_asthma_data[['pm25_mean', 'asthma_prev', 'MedianIncome', 'MedianAge', 'Population2010']])
def corrfunc(x, y, hue=None, ax=None, **kws):
"""Plot the correlation coefficient in the top left hand corner of a plot."""
r, _ = pearsonr(x, y)
ax = ax or plt.gca()
ax.annotate(f'r = {r:.2f}', xy=(.2, .5), size=24, xycoords=ax.transAxes)
g.map_upper(corrfunc)
g.map_lower(sns.regplot, color='black', scatter_kws={'s': 10, 'alpha':.5})
g.map_diag(sns.histplot, kde=True, legend=False)
plt.show()
<Figure size 1200x1200 with 0 Axes>
standard linear regression model
lm_asthma_pm = smf.ols('asthma_prev ~ pm25_mean + MedianIncome + MedianAge + Population2010', data=pm_asthma_data).fit()
print(lm_asthma_pm.summary())
OLS Regression Results
==============================================================================
Dep. Variable: asthma_prev R-squared: 0.137
Model: OLS Adj. R-squared: 0.137
Method: Least Squares F-statistic: 1061.
Date: Tue, 21 Feb 2023 Prob (F-statistic): 0.00
Time: 21:54:14 Log-Likelihood: -54500.
No. Observations: 26681 AIC: 1.090e+05
Df Residuals: 26676 BIC: 1.091e+05
Df Model: 4
Covariance Type: nonrobust
==================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------
Intercept 12.5516 0.150 83.796 0.000 12.258 12.845
pm25_mean 0.0328 0.006 5.294 0.000 0.021 0.045
MedianIncome -0.1096 0.002 -53.969 0.000 -0.114 -0.106
MedianAge 0.0047 0.004 1.202 0.229 -0.003 0.012
Population2010 -0.1136 0.006 -19.105 0.000 -0.125 -0.102
==============================================================================
Omnibus: 2192.456 Durbin-Watson: 1.967
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2804.084
Skew: 0.741 Prob(JB): 0.00
Kurtosis: 3.569 Cond. No. 592.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Figure 17.11#
model_data = pd.DataFrame({'resid': lm_asthma_pm.resid}, index=pm_asthma_data.index)
fig = sm.qqplot(model_data.resid, line="q")
Figure 17.12#
# first fit linear mixed effects model
lme = smf.mixedlm("asthma_prev ~ pm25_mean + MedianAge + MedianIncome + Population2010",
pm_asthma_data, groups=pm_asthma_data["city"],
re_formula="~pm25_mean").fit(method=["lbfgs"])
print(lme.summary())
Mixed Linear Model Regression Results
=================================================================
Model: MixedLM Dependent Variable: asthma_prev
No. Observations: 26681 Method: REML
No. Groups: 494 Scale: 1.9514
Min. group size: 8 Log-Likelihood: -48031.9272
Max. group size: 2110 Converged: Yes
Mean group size: 54.0
-----------------------------------------------------------------
Coef. Std.Err. z P>|z| [0.025 0.975]
-----------------------------------------------------------------
Intercept 6.977 1.146 6.087 0.000 4.730 9.223
pm25_mean 0.360 0.074 4.856 0.000 0.215 0.506
MedianAge 0.050 0.028 1.754 0.079 -0.006 0.105
MedianIncome -0.103 0.014 -7.125 0.000 -0.132 -0.075
Population2010 -0.027 0.005 -5.652 0.000 -0.037 -0.018
Group Var 158.916 15.957
Group x pm25_mean Cov -15.097 1.536
pm25_mean Var 1.458 0.150
=================================================================
model_data['city'] = pm_asthma_data.city
model_data['resid_lme'] = lme.resid
def boxplot_sorted(df, by, column, ax):
df2 = pd.DataFrame({col:vals[column] for col, vals in df.groupby(by)})
meds = df2.median().sort_values()
plot = df2[meds.index].boxplot(rot=90, ax=ax)
plot.get_xaxis().set_ticks([])
fig, ax = plt.subplots(2, 1, figsize=(12,6))
boxplot_sorted(model_data, 'city', 'resid', ax=ax[0])
boxplot_sorted(model_data, 'city', 'resid_lme', ax=ax[1])
R mixed effect model result for comparison
%%R
library(lme4)
lmer_asthma_pm = lmer(asthma_prev ~ pm25_mean + MedianAge + MedianIncome + Population2010 + (1 + pm25_mean|city),
data=pm_asthma_data,
control = lmerControl(
optimizer ='optimx', optCtrl=list(method='nlminb')))
print(summary(lmer_asthma_pm))
R[write to console]: Loading required namespace: optimx
Linear mixed model fit by REML ['lmerMod']
Formula:
asthma_prev ~ pm25_mean + MedianAge + MedianIncome + Population2010 +
(1 + pm25_mean | city)
Data:
pm_asthma_data
Control:
lmerControl(optimizer = "optimx", optCtrl = list(method = "nlminb"))
REML criterion at convergence:
96063.9
Scaled residuals:
Min
1Q
Median
3Q
Max
-3.3329
-0.6246
-0.0829
0.5462
5.4814
Random effects:
Groups
Name
Variance
Std.Dev.
Corr
city
(Intercept)
158.898
12.605
pm25_mean
1.457
1.207
-0.99
Residual
1.951
1.397
Number of obs: 26681, groups:
city, 494
Fixed effects:
Estimate
Std. Error
t value
(Intercept)
6.976828
1.122358
6.216
pm25_mean
0.360445
0.071826
5.018
MedianAge
0.049659
0.028236
1.759
MedianIncome
-0.103198
0.014298
-7.217
Population2010
-0.027124
0.004796
-5.655
Correlation of Fixed Effects:
(Intr)
pm25_m
MednAg
MdnInc
pm25_mean
-0.646
MedianAge
-0.698
0.002
MedianIncom
0.098
-0.005
-0.506
Popultn2010
-0.017
-0.006
0.010
-0.016
optimizer (optimx) convergence code: 0 (OK)
Model failed to converge with max|grad| = 0.00298875 (tol = 0.002, component 1)
Table 17.4#
lme_ci = pd.DataFrame(lme.params, columns=['estimate'])
lme_ci = lme_ci.join(lme.conf_int().rename(columns={0: '2.5%', 1:'97.5%'})).iloc[1:5,:]
lme_ci
estimate | 2.5% | 97.5% | |
---|---|---|---|
pm25_mean | 0.360453 | 0.214966 | 0.505941 |
MedianAge | 0.049660 | -0.005829 | 0.105149 |
MedianIncome | -0.103198 | -0.131586 | -0.074811 |
Population2010 | -0.027124 | -0.036531 | -0.017718 |
Example 4: Response of plants to nitrogen fertilizers and soil tilling#
3. Prepare and visualize the data#
fert_data_all = pd.read_csv('https://raw.githubusercontent.com/statsthinking21/statsthinking21-figures-data/main/fertilizer/Fertsyntraitsall_Jan27_2008.txt', delimiter='\t')
fert_data_all = fert_data_all.query('Rawabundmetric == "Grams biomass" and Site == "KBS"')
fert_data_all['plotID_common'] = fert_data_all.plotID = fert_data_all.Fert
fert_data_all['log_rawabund'] = scale(np.log(fert_data_all.Rawabund))
fert_data_all.shape
(2615, 27)
Figure 17.13#
sns.histplot(fert_data_all.Rawabund, bins=100)
plt.xlabel('Raw crop abundance')
Text(0.5, 0, 'Raw crop abundance')
Simple linear model output:
lm_result_fert = smf.ols('Rawabund ~ Fert*Experiment + Year', data=fert_data_all).fit()
print(lm_result_fert.summary())
OLS Regression Results
==============================================================================
Dep. Variable: Rawabund R-squared: 0.030
Model: OLS Adj. R-squared: 0.029
Method: Least Squares F-statistic: 20.52
Date: Tue, 21 Feb 2023 Prob (F-statistic): 1.16e-16
Time: 21:54:43 Log-Likelihood: -16314.
No. Observations: 2615 AIC: 3.264e+04
Df Residuals: 2610 BIC: 3.267e+04
Df Model: 4
Covariance Type: nonrobust
===============================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------------
Intercept 1154.5206 1685.230 0.685 0.493 -2150.002 4459.043
Experiment[T.Untilled] -5.3171 6.568 -0.810 0.418 -18.196 7.562
Fert 57.1985 7.608 7.518 0.000 42.280 72.117
Fert:Experiment[T.Untilled] -35.4946 9.933 -3.573 0.000 -54.972 -16.018
Year -0.5602 0.844 -0.664 0.507 -2.215 1.094
==============================================================================
Omnibus: 2428.192 Durbin-Watson: 1.041
Prob(Omnibus): 0.000 Jarque-Bera (JB): 84679.507
Skew: 4.496 Prob(JB): 0.00
Kurtosis: 29.388 Cond. No. 1.39e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.39e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
Figure 17.14#
This figure involves fitting a linear mixed model with multiple random slopes. This is in theory possible using statsmodels, but it is not well documented. Instead, we will simply use the R bridge to run lmer on our data.
fert_data_clean = fert_data_all[['log_rawabund', 'Fert', 'Experiment', 'Year', 'plotID_common', 'Species_code']]
%%R -i fert_data_clean
lm.result_fert = lm(log_rawabund ~ Fert*Experiment + Year ,
data=fert_data_clean)
lmer.result_fert = lmer(log_rawabund ~ Fert*Experiment + Year + (1 + Fert|plotID_common) + (1 + Fert|Species_code) ,
data=fert_data_clean)
%%R
library(cowplot)
library(broom)
model.data = augment(lm.result_fert) %>%
mutate(plotID_common = fert_data_clean$plotID_common,
Species_code = fert_data_clean$Species_code)
p1 = ggplot(model.data, aes(sample=.resid)) + geom_qq() + geom_qq_line()
lmer.result_fert = lmer(log_rawabund ~ Fert*Experiment + Year + (1 + Fert|plotID_common) + (1 + Fert|Species_code),
data=fert_data_clean)
fert_data = fert_data_clean %>%
mutate(lmer_resid = resid(lmer.result_fert))
p2 = ggplot(fert_data, aes(sample=lmer_resid)) + geom_qq() + geom_qq_line()
plot_grid(p1, p2)
Figure 17.15#
%%R
model_means_sorted = model.data %>% group_by(Species_code) %>%
summarize_all(mean) %>%
arrange(.resid)
ggplot(model.data, aes(y=.resid, group=factor(Species_code, levels=model_means_sorted$Species_code))) + geom_boxplot()
Linear mixed effects model output:
%%R
print(summary(lmer.result_fert))
Linear mixed model fit by REML ['lmerMod']
Formula:
log_rawabund ~ Fert * Experiment + Year + (1 + Fert | plotID_common) +
(1 + Fert | Species_code)
Data:
fert_data_clean
REML criterion at convergence:
6476.4
Scaled residuals:
Min
1Q
Median
3Q
Max
-3.8583
-0.6309
0.1140
0.7059
2.7394
Random effects:
Groups
Name
Variance
Std.Dev.
Corr
Species_code
(Intercept)
0.32074
0.5663
Fert
0.07087
0.2662
0.01
plotID_common
(Intercept)
0.64447
0.8028
Fert
0.48111
0.6936
0.31
Residual
0.62516
0.7907
Number of obs: 2615, groups:
Species_code, 132; plotID_common, 2
Fixed effects:
Estimate
Std. Error
t value
(Intercept)
13.420304
11.377770
1.180
Fert
0.279636
1.454441
0.192
ExperimentUntilled
0.384860
0.057878
6.649
Year
-0.006994
0.005681
-1.231
Fert:ExperimentUntilled
-0.333301
0.081763
-4.076
Correlation of Fixed Effects:
(Intr)
Fert
ExprmU
Year
Fert
-0.039
ExprmntUntl
-0.083
0.016
Year
-0.997
0.000
0.080
Frt:ExprmnU
-0.008
-0.036
-0.522
0.010
optimizer (nloptwrap) convergence code: 0 (OK)
unable to evaluate scaled gradient
Model failed to converge: degenerate Hessian with 2 negative eigenvalues
Effect sizes
%%R
library(emmeans)
emm = emmeans(lmer.result_fert, pairwise ~ Fert*Experiment)
print(contrast(emm, 'tukey'))
contrast
estimate
SE
df
t.ratio
p.value
Fert0 Tilled - Fert1 Tilled
-0.2796
1.4545
15559535
-0.192
0.9975
Fert0 Tilled - Fert0 Untilled
-0.3849
0.0583
1183
-6.597
<.0001
Fert0 Tilled - Fert1 Untilled
-0.3312
1.4542
41417709
-0.228
0.9958
Fert1 Tilled - Fert0 Untilled
-0.1052
1.4547
29073394
-0.072
0.9999
Fert1 Tilled - Fert1 Untilled
-0.0516
0.0722
759
-0.714
0.8914
Fert0 Untilled - Fert1 Untilled
0.0537
1.4539
38437869
0.037
1.0000
Degrees-of-freedom method: kenward-roger
P value adjustment: tukey method for comparing a family of 4 estimates