Chapter 17: Practical statistical modeling
Contents
Chapter 17: Practical statistical modeling#
library(car)
library(tidyverse)
library(ggplot2)
library(emmeans)
library(cowplot)
library(knitr)
# library(ggfortify)
library(gplots)
# library(matlab)
library(dendextend)
library(psych)
library(MuMIn)
library(lme4)
library(lmerTest)
library(broom)
library(DHARMa)
library(ggbeeswarm)
library(viridis)
library(influence.ME)
library(broom.mixed)
library(ggExtra)
theme_set(theme_minimal(base_size = 14))
set.seed(123456) # set random seed to exactly replicate results
Loading required package: carData
── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.1 ✔ purrr 1.0.1
✔ tibble 3.1.8 ✔ dplyr 1.1.0
✔ tidyr 1.3.0 ✔ stringr 1.5.0
✔ readr 2.1.4 ✔ forcats 1.0.0
── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
✖ dplyr::recode() masks car::recode()
✖ purrr::some() masks car::some()
Attaching package: ‘gplots’
The following object is masked from ‘package:stats’:
lowess
---------------------
Welcome to dendextend version 1.16.0
Type citation('dendextend') for how to cite the package.
Type browseVignettes(package = 'dendextend') for the package vignette.
The github page is: https://github.com/talgalili/dendextend/
Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
You may ask questions at stackoverflow, use the r and dendextend tags:
https://stackoverflow.com/questions/tagged/dendextend
To suppress this message use: suppressPackageStartupMessages(library(dendextend))
---------------------
Attaching package: ‘dendextend’
The following object is masked from ‘package:stats’:
cutree
Attaching package: ‘psych’
The following objects are masked from ‘package:ggplot2’:
%+%, alpha
The following object is masked from ‘package:car’:
logit
Loading required package: Matrix
Attaching package: ‘Matrix’
The following objects are masked from ‘package:tidyr’:
expand, pack, unpack
Attaching package: ‘lmerTest’
The following object is masked from ‘package:lme4’:
lmer
The following object is masked from ‘package:stats’:
step
This is DHARMa 0.4.6. For overview type '?DHARMa'. For recent changes, type news(package = 'DHARMa')
Loading required package: viridisLite
Attaching package: ‘influence.ME’
The following object is masked from ‘package:stats’:
influence
Figure 17.1 Outliers and influential observations#
set.seed(1234)
npts = 24
outlier_df = data.frame(Y = rnorm(npts, 0, 1)) %>%
mutate(X = rnorm(npts, 0, 1) + Y*0.4)
lm.result = lm(Y ~ X, data=outlier_df)
model.data = augment(lm.result)
p2 = ggplot(model.data, aes(X, Y)) +
geom_smooth(method='lm', se=FALSE) +
geom_point(size=model.data$.cooksd*10)
p2 <- ggMarginal(p2, type="boxplot")
p2
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
Warning message:
“Continuous x aesthetic
ℹ did you forget `aes(group = ...)`?”
Warning message:
“Continuous x aesthetic
ℹ did you forget `aes(group = ...)`?”
Figure 17.2: Collider bias#
# https://observablehq.com/@herbps10/collider-bias
npts = 1000
set.seed(123456)
conf_df = data.frame(height=rnorm(npts, 175, 20),
speed = rnorm(npts, 50, 10)) %>%
mutate(zheight = scale(height),
zspeed = scale(speed),
speedXheight = zheight * zspeed,
NBA = as.factor(speedXheight > quantile(speedXheight, .90) & speed > mean(speed) & height & mean(height)))
ggplot(conf_df, aes(height, speed, color=NBA)) + geom_point(size=1.5)
summary(lm(speed ~ height, data=conf_df))
Call:
lm(formula = speed ~ height, data = conf_df)
Residuals:
Min 1Q Median 3Q Max
-28.559 -7.363 -0.107 7.402 36.052
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 49.542457 2.865519 17.289 <2e-16 ***
height 0.002157 0.016250 0.133 0.894
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 10.19 on 998 degrees of freedom
Multiple R-squared: 1.765e-05, Adjusted R-squared: -0.0009843
F-statistic: 0.01762 on 1 and 998 DF, p-value: 0.8944
Model with NBA player status (collider)
summary(lm(speed ~ height + NBA, data=conf_df))
Call:
lm(formula = speed ~ height + NBA, data = conf_df)
Residuals:
Min 1Q Median 3Q Max
-27.917 -6.848 -0.118 7.000 36.294
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 58.09588 2.85426 20.354 < 2e-16 ***
height -0.05070 0.01632 -3.107 0.00194 **
NBATRUE 16.08565 1.57750 10.197 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 9.701 on 997 degrees of freedom
Multiple R-squared: 0.09446, Adjusted R-squared: 0.09264
F-statistic: 52 on 2 and 997 DF, p-value: < 2.2e-16
same regression only on the NBA players, where we see a strong negative relationship between speed and height:
summary(lm(speed ~ height, data=conf_df %>% dplyr::filter(NBA==TRUE)))
Call:
lm(formula = speed ~ height, data = conf_df %>% dplyr::filter(NBA ==
TRUE))
Residuals:
Min 1Q Median 3Q Max
-6.6078 -3.7125 -0.9888 2.5072 12.7771
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 131.6209 12.9493 10.164 6.88e-13 ***
height -0.3315 0.0632 -5.245 4.79e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.785 on 42 degrees of freedom
Multiple R-squared: 0.3958, Adjusted R-squared: 0.3814
F-statistic: 27.51 on 1 and 42 DF, p-value: 4.792e-06
Example 1: Self-regulation and arrest#
behavdata <- read_csv('https://raw.githubusercontent.com/statsthinking21/statsthinking21-figures-data/main/Eisenberg/meaningful_variables.csv',
show_col_types = FALSE)
demoghealthdata <- read_csv('https://raw.githubusercontent.com/statsthinking21/statsthinking21-figures-data/main/Eisenberg/demographic_health.csv',
show_col_types = FALSE)
# dplyr::recode Sex variable from 0/1 to Male/Female
demoghealthdata <- demoghealthdata %>%
mutate(Sex = dplyr::recode_factor(Sex, `0`="Male", `1`="Female"))
# combine the data into a single data frame by subcode
alldata <- merge(behavdata, demoghealthdata, by='subcode')
Table 17.1#
arrestdata <- alldata %>%
drop_na(ArrestedChargedLifeCount) %>%
mutate(everArrested = ArrestedChargedLifeCount > 0)
#ggplot(arrestdata,aes(ArrestedChargedLifeCount)) +
# geom_histogram(bins=max(unique(arrestdata$ArrestedChargedLifeCount)) + 1) # add 1 to account for zeros
arrest_table <- arrestdata %>%
group_by(ArrestedChargedLifeCount) %>%
summarize(number = n(),
proportion = n()/nrow(arrestdata))
kable(arrest_table, caption='Frequency distribution of number of reported arrests in the Eisenberg et al. dataset', digits=3)
Table: Frequency distribution of number of reported arrests in the Eisenberg et al. dataset
| ArrestedChargedLifeCount| number| proportion|
|------------------------:|------:|----------:|
| 0| 409| 0.785|
| 1| 58| 0.111|
| 2| 33| 0.063|
| 3| 9| 0.017|
| 4| 5| 0.010|
| 5| 5| 0.010|
| 6| 2| 0.004|
Figure 17.3#
# First, rename the survey variables so that they are shorter, and find all variables related to impulsivity. We will first rename the variables so taht their names are shorter, for easier display.
rename_list = list('upps_impulsivity_survey' = 'UPPS', 'sensation_seeking_survey' = 'SSS',
'dickman_survey' = 'Dickman', 'bis11_survey' = 'BIS11')
impulsivity_variables = c()
for (potential_match in names(alldata)){
for (n in names(rename_list)){
if (str_detect(potential_match, n)){
# print(sprintf('found match: %s %s', n, potential_match))
replacement_name <- str_replace(potential_match, n, toString(rename_list[n]))
names(alldata)[names(alldata) == potential_match] <- replacement_name
impulsivity_variables <- c(impulsivity_variables, replacement_name)
}
}
}
impulsivity_data <- alldata[,impulsivity_variables] %>%
drop_na()
cc = cor(impulsivity_data, use='pairwise.complete')
colors <- viridis(100)
cellvalues <- format(round(cc, 2), nsmall = 2)
par(mar=c(4,4,4,8)+0.1)
hm <- heatmap.2(cc, trace='none', key=FALSE, dendrogram='row',
cellnote=cellvalues, notecol='black',
col=colors, margins=c(10, 14), revC = TRUE)
Figure 17.4#
# use the clustering from the heatmap to identify the two sets of variables
groups = cutree(hm$rowDendrogram, k=2)
vars_cluster1 <- names(groups[groups == 1])
vars_cluster2 <- names(groups[groups == 2])
# compute the mean impulsivity and sensation seeking values for each individual
# standardize the measures to make interpretation of the model easier
alldata <- alldata %>%
mutate(mean_impulsivity = rowMeans(dplyr::select(., all_of(vars_cluster1)), na.rm = TRUE),
mean_impulsivity = (mean_impulsivity - mean(mean_impulsivity))/sd(mean_impulsivity),
mean_senseek = rowMeans(dplyr::select(., all_of(vars_cluster2)), na.rm = TRUE),
mean_senseek = (mean_senseek - mean(mean_senseek))/sd(mean_senseek))
# create some additional useful variables, and clean up the data frame
# to only include the variables of interest and individuals without missing data
arrestdata <- alldata %>%
mutate(everArrested = ArrestedChargedLifeCount > 0,
anyTickets = TrafficTicketsLastYearCount > 0,
anyAccidents = TrafficAccidentsLifeCount > 0) %>%
dplyr::select(c(mean_impulsivity, mean_senseek, ArrestedChargedLifeCount,
everArrested, anyTickets, TrafficTicketsLastYearCount,
anyAccidents,TrafficAccidentsLifeCount, Age, Sex)) %>%
drop_na() %>%
mutate(
everArrestedInt = as.integer(everArrested),
everArrested = dplyr::recode_factor(everArrestedInt, `1`='Arrested', `0`='NotArrested'),
SexInt = dplyr::recode(Sex, Female=0, Male=1))
pairs.panels(arrestdata[,c('mean_impulsivity', 'mean_senseek')])
Figure 17.5#
plotdata = pivot_longer(arrestdata,
c(mean_impulsivity, mean_senseek),
names_to = 'variable') %>%
mutate(variable = dplyr::recode(variable, mean_impulsivity="Mean impulsivity",
mean_senseek="Mean sens. seeking"))
p1 = ggplot(plotdata, aes(everArrested, value)) +
geom_boxplot() +
geom_beeswarm(alpha=0.1) +
facet_grid(. ~ variable)
p1 = ggplot(arrestdata, aes(everArrested, mean_impulsivity)) +
geom_boxplot() +
geom_beeswarm(alpha=0.1) +
ylab('Mean impulsivity')
p2 = ggplot(arrestdata, aes(everArrested, mean_senseek)) +
geom_boxplot() +
geom_beeswarm(alpha=0.1)+
ylab('Mean sensation seeking')
plot_grid(p1, p2)
t-test output for the mean impulsivity scores:
t.test(mean_impulsivity ~ everArrested, data=arrestdata, alternative='greater')
imp_arrest_d = cohen.d(mean_impulsivity ~ everArrested, data=arrestdata)
Welch Two Sample t-test
data: mean_impulsivity by everArrested
t = 2.9105, df = 160.93, p-value = 0.00206
alternative hypothesis: true difference in means between group Arrested and group NotArrested is greater than 0
95 percent confidence interval:
0.1430254 Inf
sample estimates:
mean in group Arrested mean in group NotArrested
0.2618746 -0.0695193
t-test output for the sensation seeking variable:
t.test(mean_senseek ~ everArrested, data=arrestdata, alternative='greater')
imp_arrest_d = cohen.d(mean_senseek ~ everArrested, data=arrestdata)
Welch Two Sample t-test
data: mean_senseek by everArrested
t = 4.7691, df = 182.38, p-value = 1.887e-06
alternative hypothesis: true difference in means between group Arrested and group NotArrested is greater than 0
95 percent confidence interval:
0.3167728 Inf
sample estimates:
mean in group Arrested mean in group NotArrested
0.3758469 -0.1090015
Figure 17.6: Logistic regression#
set.seed(1234)
n = 100
b0 = 100
b1 = 3
logreg_df = data.frame(x = rnorm(n, 0,10)) %>%
mutate(y = as.integer(b0 + b1*x + rnorm(n, 0, 5)),
y_bin = as.integer(runif(n) < 1/(1 + exp(-(y - mean(y))/sd(y)))))
lm_result_binary = lm(y_bin ~ x, data=logreg_df)
glm_result_binary = glm(y_bin ~ x, family=binomial, data=logreg_df)
p1 = ggplot(logreg_df, aes(x, y_bin)) +
geom_jitter(height=0.01, width=0) +
geom_smooth(method='lm', se=FALSE) +
ggtitle('B) Linear regression')
pred_df = data.frame(x=seq(-25, 25, by=.1))
pred_df = pred_df %>%
mutate(y = predict(glm_result_binary, newdata=., type='response'))
p2= ggplot(logreg_df, aes(x, y_bin)) +
geom_jitter(height=0.01, width=0) +
geom_line(data=pred_df, aes(x,y), color='blue') +
ggtitle('B) Logistic regression')
plot_grid(p1, p2)
`geom_smooth()` using formula = 'y ~ x'
Logistic regression output:
summary(glm_result_binary)
Call:
glm(formula = y_bin ~ x, family = binomial, data = logreg_df)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.3453 -0.8524 -0.4870 0.8256 2.1367
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.05597 0.25148 0.223 0.824
x 0.15406 0.03363 4.581 4.62e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 137.63 on 99 degrees of freedom
Residual deviance: 103.31 on 98 degrees of freedom
AIC: 107.31
Number of Fisher Scoring iterations: 4
Logistic regression output for impulsivity dataset
impmodeldata <- arrestdata %>%
dplyr::select(everArrestedInt, mean_impulsivity, mean_senseek, Age, Sex) %>%
mutate(AgeSquared = (Age - mean(Age))**2)
# glm requires int or logical variable as Y
glm_imp_arrest = glm(everArrestedInt ~ mean_impulsivity + mean_senseek + Age + AgeSquared + Sex,
data=impmodeldata, family=binomial)
glm_imp_arrest_baseline = glm(everArrestedInt ~ Age + Sex,
data=impmodeldata, family=binomial)
probabilities <- predict(glm_imp_arrest, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
model.data = augment(glm_imp_arrest)
glm_imp_arrest
# summary(glm_imp_arrest)
Call: glm(formula = everArrestedInt ~ mean_impulsivity + mean_senseek +
Age + AgeSquared + Sex, family = binomial, data = impmodeldata)
Coefficients:
(Intercept) mean_impulsivity mean_senseek Age
-3.294753 0.306971 0.383251 0.072638
AgeSquared SexFemale
-0.004233 -0.686395
Degrees of Freedom: 520 Total (i.e. Null); 515 Residual
Null Deviance: 542.3
Residual Deviance: 495.4 AIC: 507.4
Test for overdispersion
testDispersion(glm_imp_arrest, alternative='greater', type='PearsonChisq')
Parametric dispersion test via mean Pearson-chisq statistic
data: glm_imp_arrest
dispersion = 1.0409, df = 515, p-value = 0.2522
alternative hypothesis: greater
Figure 17.7#
cat_logit_est = function(iv){
ncats = min(18, round(length(unique(iv))/2))
iv_cut = cut_number(iv, ncats)
mod = glm(everArrestedInt ~ iv_cut, family=binomial, data=impmodeldata)
return(mod$linear.predictors)
}
loess_logit_est = function(iv, span=1){
fit = loess(everArrestedInt ~ iv, data=impmodeldata, span=span)$fitted
fit[fit<=0] = 0 #negatives can happen at edges
logit_of_fit = logit(fit)
return(logit_of_fit)
}
# pull out continuous predictors
predictors <- colnames(model.data %>% dplyr::select(-everArrestedInt, -Sex, -starts_with('.')))
use_glm = FALSE
if (use_glm){
model_string = "Using glm() to estimate logit"
model.data = model.data %>%
mutate(observed_logit_Age = cat_logit_est(Age),
observed_logit_senseek = cat_logit_est(mean_senseek),
observed_logit_impuls = cat_logit_est(mean_impulsivity))
} else {
model_string = "Using loess() to estimate logit"
model.data = model.data %>%
mutate(observed_logit_Age = loess_logit_est(Age),
observed_logit_senseek = loess_logit_est(mean_senseek),
observed_logit_impuls = loess_logit_est(mean_impulsivity))
}
point_alpha=.3
jitter_width=0
jitter_height=.1
p1 = ggplot(model.data, aes(Age,observed_logit_Age, size=.cooksd)) +
geom_jitter(width=jitter_width, height=jitter_height, alpha=point_alpha) +
geom_smooth(se=FALSE, span=1) +
ggtitle(model_string)+
theme(legend.position="none")
p2 = ggplot(model.data, aes(mean_senseek,observed_logit_senseek, size=.cooksd)) +
geom_jitter(width=jitter_width, height=jitter_height, alpha=point_alpha) +
geom_smooth(se=FALSE, span=1)+
theme(legend.position="none")
p3 = ggplot(model.data, aes(mean_impulsivity,observed_logit_impuls, size=.cooksd)) +
geom_jitter(width=jitter_width, height=jitter_height, alpha=point_alpha) +
geom_smooth(se=FALSE,span=1)
plot_grid(p1, p2, p3)
# ggplot(impmodeldata_long_sample, aes(sample_logit, .fitted)) +
# geom_point() +
# facet_wrap(~predictors, scales = "free_x") +
# geom_smooth(se=FALSE)
#
#
Warning message:
“Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.”
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Warning message:
“The following aesthetics were dropped during statistical transformation: size
ℹ This can happen when ggplot fails to infer the correct grouping structure in the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical variable into a factor?”
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Warning message:
“The following aesthetics were dropped during statistical transformation: size
ℹ This can happen when ggplot fails to infer the correct grouping structure in the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical variable into a factor?”
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Warning message:
“The following aesthetics were dropped during statistical transformation: size
ℹ This can happen when ggplot fails to infer the correct grouping structure in the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical variable into a factor?”
model including \(Age^2\) alongside Age
glm_imp_arrest2 = glm(everArrestedInt ~ mean_impulsivity + mean_senseek + Age + AgeSquared + Sex,
data=impmodeldata, family=binomial)
glm_imp_arrest_baseline2 = glm(everArrestedInt ~ Age + AgeSquared + Sex,
data=impmodeldata, family=binomial)
summary(glm_imp_arrest2)
Call:
glm(formula = everArrestedInt ~ mean_impulsivity + mean_senseek +
Age + AgeSquared + Sex, family = binomial, data = impmodeldata)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.4676 -0.7270 -0.5532 -0.3318 2.6388
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.294753 0.601139 -5.481 4.23e-08 ***
mean_impulsivity 0.306971 0.114057 2.691 0.00712 **
mean_senseek 0.383251 0.118646 3.230 0.00124 **
Age 0.072638 0.018644 3.896 9.78e-05 ***
AgeSquared -0.004233 0.001866 -2.269 0.02327 *
SexFemale -0.686395 0.238219 -2.881 0.00396 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 542.33 on 520 degrees of freedom
Residual deviance: 495.45 on 515 degrees of freedom
AIC: 507.45
Number of Fisher Scoring iterations: 4
Table 17.2#
oddsratios = exp(cbind("Odds ratio" = coef(glm_imp_arrest), confint.default(glm_imp_arrest, level = 0.95)))[2:6,]
kable(oddsratios, digits=3, caption='Effect sizes for each of the variables in the logistic regression model, expressed as odds ratios, along with 95 percent confidence limits for each odds ratio.')
Table: Effect sizes for each of the variables in the logistic regression model, expressed as odds ratios, along with 95 percent confidence limits for each odds ratio.
| | Odds ratio| 2.5 %| 97.5 %|
|:----------------|----------:|-----:|------:|
|mean_impulsivity | 1.359| 1.087| 1.700|
|mean_senseek | 1.467| 1.163| 1.851|
|Age | 1.075| 1.037| 1.115|
|AgeSquared | 0.996| 0.992| 0.999|
|SexFemale | 0.503| 0.316| 0.803|
Bayes factor using approximation via BIC
BF_01 = exp((BIC(glm_imp_arrest2) - BIC(glm_imp_arrest_baseline2))/2) # BICs to Bayes factor
# 1/BF_01
Example 2: Mask-wearing and face-touching#
Figure 17.8#
# data cleaning based on .do file distributed with data
maskdata = read_csv('https://raw.githubusercontent.com/statsthinking21/statsthinking21-figures-data/main/mask_wearing/DataVersion2/MaskFaceTouchOSF.csv') %>%
filter(face_touching != "Missing") %>%
mutate(face_touching = dplyr::recode(face_touching, 'Yes'=1, 'No'=0),
mask_front_touching = dplyr::recode(mask_front_touching, 'Yes'=1, .default=0),
mask_strap_touching = dplyr::recode(mask_strap_touching, 'Yes'=1,.default=0),
face_touching = face_touching | mask_front_touching | mask_strap_touching,
mask_wearing = dplyr::recode(mask_YesNo, 'Yes'=TRUE, 'No'=FALSE),
age_std = scale(age),
study = as.factor(study),
segment_crowding_std = scale(segment_crowding)) %>%
filter(change==0 & non_covering != 'Yes')
maskdata_study1 = maskdata %>%
filter(study==1)
maskdata_study2 = maskdata %>%
filter(study==2)
Rows: 898 Columns: 20
── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (11): gender, mask_YesNo, fully_covering, partially_covering, non_coveri...
dbl (9): age, duration_of_observation, change, segment_crowding, study, loc...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ggplot(maskdata %>% mutate(study=dplyr::recode(study, `1`='Study 1', `2`='Study 2'))
, aes(x=mask_wearing, fill=face_touching)) + geom_bar() + facet_grid(. ~ study)
simple chi-squared test
chisq.test(maskdata_study1$mask_wearing, maskdata_study1$face_touching)
Pearson's Chi-squared test with Yates' continuity correction
data: maskdata_study1$mask_wearing and maskdata_study1$face_touching
X-squared = 0.029801, df = 1, p-value = 0.8629
analogous test for the data from the second study
chisq.test(maskdata_study2$mask_wearing, maskdata_study2$face_touching)
Pearson's Chi-squared test with Yates' continuity correction
data: maskdata_study2$mask_wearing and maskdata_study2$face_touching
X-squared = 1.0854, df = 1, p-value = 0.2975
Logistic regression model including duration and study along with mask wearing
glm_result_combined = glm(face_touching ~ mask_wearing + duration_of_observation + study,
family=binomial, data=maskdata)
summary(glm_result_combined)
Call:
glm(formula = face_touching ~ mask_wearing + duration_of_observation +
study, family = binomial, data = maskdata)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.7465 -0.6011 -0.5299 -0.4732 2.1918
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.770868 0.248228 -11.163 < 2e-16 ***
mask_wearingTRUE 0.191901 0.197742 0.970 0.33182
duration_of_observation 0.028991 0.006182 4.689 2.74e-06 ***
study2 0.571684 0.200441 2.852 0.00434 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 718.03 on 803 degrees of freedom
Residual deviance: 685.00 on 800 degrees of freedom
AIC: 693
Number of Fisher Scoring iterations: 4
Mixed effects model
glmer_result = glmer(face_touching ~ mask_wearing + duration_of_observation + (1 + mask_wearing |unique_situation), data=maskdata, family=binomial)
glmer_result_baseline = update(glmer_result, formula = ~ . -mask_wearing) # Without mask effect term
summary(glmer_result)
maskdata = maskdata %>%
mutate(glmer_resid = residuals(glmer_result),
)
Generalized linear mixed model fit by maximum likelihood (Laplace
Approximation) [glmerMod]
Family: binomial ( logit )
Formula: face_touching ~ mask_wearing + duration_of_observation + (1 +
mask_wearing | unique_situation)
Data: maskdata
AIC BIC logLik deviance df.resid
704.2 732.4 -346.1 692.2 798
Scaled residuals:
Min 1Q Median 3Q Max
-1.6847 -0.4289 -0.3888 -0.3335 3.1095
Random effects:
Groups Name Variance Std.Dev. Corr
unique_situation (Intercept) 0.1829 0.4277
mask_wearingTRUE 0.1599 0.3999 -0.40
Number of obs: 804, groups: unique_situation, 129
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.518255 0.245319 -10.265 < 2e-16 ***
mask_wearingTRUE 0.145705 0.253822 0.574 0.566
duration_of_observation 0.029691 0.006409 4.633 3.61e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) m_TRUE
msk_wrnTRUE -0.471
drtn_f_bsrv -0.698 -0.011
test for overdispersion:
testDispersion(glmer_result, alternative='greater', type='PearsonChisq')
Parametric dispersion test via mean Pearson-chisq statistic
data: glmer_result
dispersion = 0.92427, df = 798, p-value = 0.9377
alternative hypothesis: greater
Figure 17.9#
#alt.est.b <- influence(glmer_result, obs=TRUE) #"unique_situation")
#cd = cooks.distance(alt.est.b)
loess_logit_est2 = function(iv, span=1){
fit = loess(face_touching_int ~ iv, data=maskmodeldata, span=span)$fitted
fit[fit<=0] = 0 #negatives can happen at edges
logit_of_fit = logit(fit)
return(logit_of_fit)
}
maskmodeldata = augment(glmer_result) %>%
mutate(face_touching_int = as.integer(face_touching))
maskmodeldata = maskmodeldata %>%
mutate(observed_logit = loess_logit_est2(duration_of_observation))
ggplot(maskmodeldata, aes(duration_of_observation, observed_logit, size=.cooksd))+
geom_jitter() +
geom_smooth(method = "loess", se=FALSE)
Warning message in hatvalues.merMod(model):
“the hat matrix may not make sense for GLMMs”
Warning message in hatvalues.merMod(model):
“the hat matrix may not make sense for GLMMs”
`geom_smooth()` using formula = 'y ~ x'
Warning message:
“The following aesthetics were dropped during statistical transformation: size
ℹ This can happen when ggplot fails to infer the correct grouping structure in the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical variable into a factor?”
Table 17.3#
effect_table = tidy(glmer_result,conf.int=TRUE,exponentiate=TRUE,effects="fixed") %>%
dplyr::select(term, estimate, conf.low, conf.high)
names(effect_table) = c('', 'Odds ratio', '2.5 %', '97.5 %')
kable(effect_table[2:3, ], digits=3, caption = 'Odds ratios and confidence intervals for the independent variables in the mask wearing study. Note that odds ratios are only presented for the fixed effects; since video segments were modeled as a random effect, that variable is not included.')
Table: Odds ratios and confidence intervals for the independent variables in the mask wearing study. Note that odds ratios are only presented for the fixed effects; since video segments were modeled as a random effect, that variable is not included.
| | Odds ratio| 2.5 %| 97.5 %|
|:-----------------------|----------:|-----:|------:|
|mask_wearingTRUE | 1.157| 0.703| 1.903|
|duration_of_observation | 1.030| 1.017| 1.043|
Example 3: Asthma and air pollution#
3. Prepare the data for analysis#
# asthma data
# https://chronicdata.cdc.gov/500-Cities-Places/500-Cities-Census-Tract-level-Data-GIS-Friendly-Fo/5mtz-k78d (for 2014)
#
# PM2.5 data:
#
# filtered for 2014 and averaged ds_pm_pred using:
#
# https://data.cdc.gov/Environmental-Health-Toxicology/Daily-Census-Tract-Level-PM2-5-Concentrations-2011/fcqm-xrf4/data
pmdata = read_csv('https://raw.githubusercontent.com/statsthinking21/statsthinking21-figures-data/main/CDC_PM2.5_2014/Daily_Census_Tract-Level_PM2.5_mean_2014.csv') %>%
rename(TractFIPS=ctfips,
pm25_mean=ds_pm_pred)
asthmadata = read_csv('https://raw.githubusercontent.com/statsthinking21/statsthinking21-figures-data/main/500cities_disease/acsdata_with_censusdata.csv') %>%
dplyr::select(-ends_with('_Crude95CI'))
pm_asthma_data = inner_join(asthmadata, pmdata, by='TractFIPS') %>%
drop_na() %>%
rename(asthma_prev=CASTHMA_CrudePrev,
city=PlaceFIPS) %>%
mutate(MedianIncome = MedianIncome/1000,
Population2010 = Population2010/1000)
# mutate(log_asthma_prev = log(asthma_prev))
Rows: 72283 Columns: 2
── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
dbl (2): ctfips, ds_pm_pred
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 26892 Columns: 66
── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (33): StateAbbr, PlaceName, Place_TractID, ACCESS2_Crude95CI, ARTHRITIS_...
dbl (33): PlaceFIPS, TractFIPS, Population2010, ACCESS2_CrudePrev, ARTHRITIS...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Figure 17.10#
pairs.panels(pm_asthma_data %>% dplyr::select(pm25_mean, asthma_prev, MedianIncome, MedianAge, Population2010))
standard linear regression model
lm_asthma_pm = lm(asthma_prev ~ pm25_mean + MedianIncome + MedianAge + Population2010, data=pm_asthma_data)
summary(lm_asthma_pm)
Call:
lm(formula = asthma_prev ~ pm25_mean + MedianIncome + MedianAge +
Population2010, data = pm_asthma_data)
Residuals:
Min 1Q Median 3Q Max
-5.1237 -1.2997 -0.2894 1.0326 10.3332
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.551556 0.149787 83.796 < 2e-16 ***
pm25_mean 0.032812 0.006198 5.294 1.2e-07 ***
MedianIncome -0.109583 0.002030 -53.969 < 2e-16 ***
MedianAge 0.004697 0.003908 1.202 0.229
Population2010 -0.113642 0.005948 -19.105 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.866 on 26676 degrees of freedom
Multiple R-squared: 0.1372, Adjusted R-squared: 0.1371
F-statistic: 1061 on 4 and 26676 DF, p-value: < 2.2e-16
Figure 17.11#
model.data = augment(lm_asthma_pm)
ggplot(model.data, aes(sample=.resid)) +
geom_qq() +
geom_qq_line()
Figure 17.12#
model.data = model.data %>%
mutate(city = pm_asthma_data$city)
model_means_sorted = model.data %>% group_by(city) %>%
summarize_all(mean) %>%
arrange(.resid)
p1 = ggplot(model.data, aes(y=.resid, group=factor(city, levels=model_means_sorted$city))) +
geom_boxplot(outlier.size=0.5, alpha=.2, size=.1)
lmer_asthma_pm = lmer(asthma_prev ~ pm25_mean + MedianAge + MedianIncome + Population2010 + (1 + pm25_mean|city),
data=pm_asthma_data)
lmer_asthma_pm_baseline = update(lmer_asthma_pm, formula = ~ . -pm25_mean)
pm_asthma_data = pm_asthma_data %>%
mutate(lmer_resid = resid(lmer_asthma_pm))
p2 = ggplot(pm_asthma_data, aes(y=lmer_resid, group=city)) +
geom_boxplot(outlier.size=0.5, alpha=.2, size=.1)
plot_grid(p1, p2, nrow=2)
Warning message in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
“Model failed to converge with max|grad| = 0.0242659 (tol = 0.002, component 1)”
Warning message in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
“Model failed to converge with max|grad| = 0.0102508 (tol = 0.002, component 1)”
linear mixed effect model output:
summary(lmer_asthma_pm)
Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: asthma_prev ~ pm25_mean + MedianAge + MedianIncome + Population2010 +
(1 + pm25_mean | city)
Data: pm_asthma_data
REML criterion at convergence: 96063.9
Scaled residuals:
Min 1Q Median 3Q Max
-3.3329 -0.6247 -0.0829 0.5462 5.4814
Random effects:
Groups Name Variance Std.Dev. Corr
city (Intercept) 158.825 12.603
pm25_mean 1.457 1.207 -0.99
Residual 1.951 1.397
Number of obs: 26681, groups: city, 494
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 6.978e+00 1.122e+00 1.770e+02 6.218 3.52e-09 ***
pm25_mean 3.604e-01 7.181e-02 1.487e+02 5.018 1.47e-06 ***
MedianAge 4.965e-02 2.823e-02 1.790e+02 1.759 0.0803 .
MedianIncome -1.032e-01 1.430e-02 1.718e+02 -7.219 1.62e-11 ***
Population2010 -2.712e-02 4.796e-03 2.567e+04 -5.655 1.57e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) pm25_m MednAg MdnInc
pm25_mean -0.646
MedianAge -0.698 0.002
MedianIncom 0.098 -0.005 -0.506
Popultn2010 -0.017 -0.006 0.010 -0.016
optimizer (nloptwrap) convergence code: 0 (OK)
Model failed to converge with max|grad| = 0.0242659 (tol = 0.002, component 1)
Table 17.4#
ci = confint(lmer_asthma_pm)
lmer_confint = cbind(fixef(lmer_asthma_pm), ci[5:9,])[2:5, ]
kable(lmer_confint, digits=2, caption='Parameter estimates and confidence intervals for regression parameters from mixed effect model of asthma prevalence.')
Computing profile confidence intervals ...
Table: Parameter estimates and confidence intervals for regression parameters from mixed effect model of asthma prevalence.
| | | 2.5 %| 97.5 %|
|:--------------|-----:|-----:|------:|
|pm25_mean | 0.36| 0.22| 0.50|
|MedianAge | 0.05| -0.01| 0.10|
|MedianIncome | -0.10| -0.13| -0.07|
|Population2010 | -0.03| -0.04| -0.02|
Compute delta r-squared
r2 = r.squaredGLMM(lmer_asthma_pm)
# r2
r2_baseline = r.squaredGLMM(lmer_asthma_pm_baseline)
# r2_baseline
delta_r2 = r2[1] - r2_baseline[1] [1] # marginal r-squared, reflecting fixed effects only
# delta_r2
lmer_asthma_pm_noincome = update(lmer_asthma_pm, formula = ~ . -MedianIncome)
r2_noincome = r.squaredGLMM(lmer_asthma_pm_noincome)
delta_r2_noincome = r2[1] - r2_noincome[1] [1] # marginal r-squared, reflecting fixed effects only
#delta_r2_noincome
Warning message:
“'r.squaredGLMM' now calculates a revised statistic. See the help page.”
Warning message in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
“Model failed to converge with max|grad| = 0.0073298 (tol = 0.002, component 1)”
Example 4: Response of plants to nitrogen fertilizers and soil tilling#
3. Prepare and visualize the data#
fert_data_all = read_delim('https://raw.githubusercontent.com/statsthinking21/statsthinking21-figures-data/main/fertilizer/Fertsyntraitsall_Jan27_2008.txt', delim='\t')
fert_data = fert_data_all %>%
dplyr::filter(Rawabundmetric == "Grams biomass") %>%
dplyr::filter(Site == 'KBS') %>%
mutate(YearFac = as.factor(Year),
plotID_common = plotID - Fert,
Fert = as.factor(Fert),
log_rawabund = as.numeric(scale(log(Rawabund))))
Warning message:
“One or more parsing issues, call `problems()` on your data frame for details, e.g.:
dat <- vroom(...)
problems(dat)”
Rows: 20144 Columns: 25
── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (18): Site, Experiment, Abbreviation, Other, Species_code, USDA_symbol, ...
dbl (7): Year, Fert, block, plotID, Rawabund, Relabund, Plotsize
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Figure 17.13#
ggplot(fert_data, aes(Rawabund)) +
geom_histogram(bins=100) +
xlab('Raw crop abundance')
Simple linear model output:
lm.result_fert = lm(Rawabund ~ Fert*Experiment + Year, data=fert_data)
summary(lm.result_fert)
Call:
lm(formula = Rawabund ~ Fert * Experiment + Year, data = fert_data)
Residuals:
Min 1Q Median 3Q Max
-95.64 -48.97 -30.95 -12.65 1284.57
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1154.5206 1685.2301 0.685 0.493353
Fert1 57.1985 7.6080 7.518 7.59e-14 ***
ExperimentUntilled -5.3171 6.5679 -0.810 0.418266
Year -0.5602 0.8438 -0.664 0.506796
Fert1:ExperimentUntilled -35.4946 9.9328 -3.573 0.000359 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 124.1 on 2610 degrees of freedom
Multiple R-squared: 0.03048, Adjusted R-squared: 0.029
F-statistic: 20.52 on 4 and 2610 DF, p-value: < 2.2e-16
Figure 17.14#
model.data = augment(lm.result_fert) %>%
mutate(plotID_common = fert_data$plotID_common,
Species_code = fert_data$Species_code)
p1 = ggplot(model.data, aes(sample=.resid)) + geom_qq() + geom_qq_line()
lmer.result_fert = lmer(log_rawabund ~ Fert*Experiment + Year + (1 + Fert|plotID_common) + (1 + Fert|Species_code) , data=fert_data)
fert_data = fert_data %>%
mutate(lmer_resid = resid(lmer.result_fert))
p2 = ggplot(fert_data, aes(sample=lmer_resid)) + geom_qq() + geom_qq_line()
plot_grid(p1, p2)
Figure 17.15#
model_means_sorted = model.data %>% group_by(Species_code) %>%
summarize_all(mean) %>%
arrange(.resid)
ggplot(model.data, aes(y=.resid, group=factor(Species_code, levels=model_means_sorted$Species_code))) + geom_boxplot()
Warning message:
“There were 264 warnings in `summarise()`.
The first warning was:
ℹ In argument: `Fert = (new("standardGeneric", .Data = function (x, ...) ...`.
ℹ In group 1: `Species_code = "ABUTH"`.
Caused by warning in `mean.default()`:
! argument is not numeric or logical: returning NA
ℹ Run `dplyr::last_dplyr_warnings()` to see the 263 remaining warnings.”
Linear mixed effects model output:
summary(lmer.result_fert)
Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula:
log_rawabund ~ Fert * Experiment + Year + (1 + Fert | plotID_common) +
(1 + Fert | Species_code)
Data: fert_data
REML criterion at convergence: 6458.2
Scaled residuals:
Min 1Q Median 3Q Max
-3.9539 -0.6286 0.1066 0.7031 2.7829
Random effects:
Groups Name Variance Std.Dev. Corr
Species_code (Intercept) 0.30960 0.55642
Fert1 0.07358 0.27126 0.01
plotID_common (Intercept) 0.01571 0.12533
Fert1 0.00420 0.06481 -0.71
Residual 0.61678 0.78535
Number of obs: 2615, groups: Species_code, 132; plotID_common, 12
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 1.471e+01 1.129e+01 2.547e+03 1.303 0.192682
Fert1 2.825e-01 7.713e-02 3.155e+01 3.662 0.000908 ***
ExperimentUntilled 3.711e-01 9.278e-02 1.333e+01 4.000 0.001443 **
Year -7.643e-03 5.653e-03 2.546e+03 -1.352 0.176498
Fert1:ExperimentUntilled -3.261e-01 9.019e-02 2.057e+01 -3.615 0.001665 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) Fert1 ExprmU Year
Fert1 -0.012
ExprmntUntl -0.055 0.328
Year -1.000 0.009 0.051
Frt1:ExprmU -0.006 -0.717 -0.529 0.009
Effect sizes
emm = emmeans(lmer.result_fert, pairwise ~ Fert*Experiment)
contrast(emm, 'tukey')
contrast estimate SE df t.ratio p.value
Fert0 Tilled - Fert1 Tilled -0.2825 0.0787 33.1 -3.587 0.0056
Fert0 Tilled - Fert0 Untilled -0.3711 0.0932 14.3 -3.983 0.0063
Fert0 Tilled - Fert1 Untilled -0.3275 0.0931 27.9 -3.516 0.0078
Fert1 Tilled - Fert0 Untilled -0.0886 0.1001 36.4 -0.885 0.8126
Fert1 Tilled - Fert1 Untilled -0.0450 0.0897 23.3 -0.502 0.9577
Fert0 Untilled - Fert1 Untilled 0.0436 0.0650 19.0 0.671 0.9069
Degrees-of-freedom method: kenward-roger
P value adjustment: tukey method for comparing a family of 4 estimates
Compute delta r-squared values
lmer.result_fert_noint = update(lmer.result_fert, formula = ~ . -Fert:Experiment)
lmer.result_fert_null = update(lmer.result_fert, formula = ~ . -Fert*Experiment)
r2 = r.squaredGLMM(lmer.result_fert)
#r2
r2_noint = r.squaredGLMM(lmer.result_fert_noint)
#r2_noint
r2_null = r.squaredGLMM(lmer.result_fert_null)
#r2_null
delta_r2_int = r2[1] - r2_noint[1] [1] # marginal r-squared, reflecting fixed effects only
#delta_r2_int
delta_r2_full = r2[1] - r2_null[1] [1] # marginal r-squared, reflecting fixed effects only
#delta_r2_full