Chapter 17: Practical statistical modeling#

library(car)
library(tidyverse)
library(ggplot2)
library(emmeans)
library(cowplot)
library(knitr)
# library(ggfortify)
library(gplots)
# library(matlab)
library(dendextend)
library(psych)
library(MuMIn)
library(lme4)
library(lmerTest)
library(broom)
library(DHARMa)
library(ggbeeswarm)
library(viridis)
library(influence.ME)
library(broom.mixed)
library(ggExtra)

theme_set(theme_minimal(base_size = 14))

set.seed(123456) # set random seed to exactly replicate results

Loading required package: carData

── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.2 ──

✔ ggplot2 3.4.1     ✔ purrr   1.0.1
✔ tibble  3.1.8     ✔ dplyr   1.1.0
✔ tidyr   1.3.0     ✔ stringr 1.5.0
✔ readr   2.1.4     ✔ forcats 1.0.0

── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
✖ dplyr::recode() masks car::recode()
✖ purrr::some()   masks car::some()

Attaching package: ‘gplots’

The following object is masked from ‘package:stats’:

    lowess

---------------------
Welcome to dendextend version 1.16.0
Type citation('dendextend') for how to cite the package.

Type browseVignettes(package = 'dendextend') for the package vignette.
The github page is: https://github.com/talgalili/dendextend/

Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
You may ask questions at stackoverflow, use the r and dendextend tags: 
	 https://stackoverflow.com/questions/tagged/dendextend

	To suppress this message use:  suppressPackageStartupMessages(library(dendextend))
---------------------

Attaching package: ‘dendextend’

The following object is masked from ‘package:stats’:

    cutree

Attaching package: ‘psych’

The following objects are masked from ‘package:ggplot2’:

    %+%, alpha

The following object is masked from ‘package:car’:

    logit

Loading required package: Matrix

Attaching package: ‘Matrix’

The following objects are masked from ‘package:tidyr’:

    expand, pack, unpack

Attaching package: ‘lmerTest’

The following object is masked from ‘package:lme4’:

    lmer

The following object is masked from ‘package:stats’:

    step

This is DHARMa 0.4.6. For overview type '?DHARMa'. For recent changes, type news(package = 'DHARMa')

Loading required package: viridisLite

Attaching package: ‘influence.ME’

The following object is masked from ‘package:stats’:

    influence

Figure 17.1 Outliers and influential observations#

set.seed(1234)
npts = 24
outlier_df = data.frame(Y = rnorm(npts, 0, 1)) %>%
  mutate(X = rnorm(npts, 0, 1) + Y*0.4)

lm.result = lm(Y ~ X, data=outlier_df)
model.data = augment(lm.result)
p2 = ggplot(model.data, aes(X, Y)) +
  geom_smooth(method='lm', se=FALSE) +
  geom_point(size=model.data$.cooksd*10)
p2 <- ggMarginal(p2, type="boxplot")
p2

`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'

Warning message:
“Continuous x aesthetic
ℹ did you forget `aes(group = ...)`?”

Warning message:
“Continuous x aesthetic
ℹ did you forget `aes(group = ...)`?”

Figure 17.2: Collider bias#

# https://observablehq.com/@herbps10/collider-bias
npts = 1000
set.seed(123456)

conf_df = data.frame(height=rnorm(npts, 175, 20),
                     speed = rnorm(npts, 50, 10)) %>%
  mutate(zheight = scale(height),
         zspeed = scale(speed),
         speedXheight = zheight * zspeed,
         NBA = as.factor(speedXheight > quantile(speedXheight, .90) & speed > mean(speed) & height & mean(height)))

ggplot(conf_df, aes(height, speed, color=NBA)) + geom_point(size=1.5)

summary(lm(speed ~ height, data=conf_df))

Call:
lm(formula = speed ~ height, data = conf_df)

Residuals:
    Min      1Q  Median      3Q     Max 
-28.559  -7.363  -0.107   7.402  36.052 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 49.542457   2.865519  17.289   <2e-16 ***
height       0.002157   0.016250   0.133    0.894    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 10.19 on 998 degrees of freedom
Multiple R-squared:  1.765e-05,	Adjusted R-squared:  -0.0009843 
F-statistic: 0.01762 on 1 and 998 DF,  p-value: 0.8944

Model with NBA player status (collider)

summary(lm(speed ~ height + NBA, data=conf_df))

Call:
lm(formula = speed ~ height + NBA, data = conf_df)

Residuals:
    Min      1Q  Median      3Q     Max 
-27.917  -6.848  -0.118   7.000  36.294 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 58.09588    2.85426  20.354  < 2e-16 ***
height      -0.05070    0.01632  -3.107  0.00194 ** 
NBATRUE     16.08565    1.57750  10.197  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9.701 on 997 degrees of freedom
Multiple R-squared:  0.09446,	Adjusted R-squared:  0.09264 
F-statistic:    52 on 2 and 997 DF,  p-value: < 2.2e-16

same regression only on the NBA players, where we see a strong negative relationship between speed and height:

summary(lm(speed ~ height, data=conf_df %>% dplyr::filter(NBA==TRUE)))

Call:
lm(formula = speed ~ height, data = conf_df %>% dplyr::filter(NBA == 
    TRUE))

Residuals:
    Min      1Q  Median      3Q     Max 
-6.6078 -3.7125 -0.9888  2.5072 12.7771 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 131.6209    12.9493  10.164 6.88e-13 ***
height       -0.3315     0.0632  -5.245 4.79e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.785 on 42 degrees of freedom
Multiple R-squared:  0.3958,	Adjusted R-squared:  0.3814 
F-statistic: 27.51 on 1 and 42 DF,  p-value: 4.792e-06

Example 1: Self-regulation and arrest#

behavdata <- read_csv('https://raw.githubusercontent.com/statsthinking21/statsthinking21-figures-data/main/Eisenberg/meaningful_variables.csv',
                      show_col_types = FALSE)
demoghealthdata <- read_csv('https://raw.githubusercontent.com/statsthinking21/statsthinking21-figures-data/main/Eisenberg/demographic_health.csv',
                            show_col_types = FALSE)

# dplyr::recode Sex variable from 0/1 to Male/Female
demoghealthdata <- demoghealthdata %>%
  mutate(Sex = dplyr::recode_factor(Sex, `0`="Male", `1`="Female"))

# combine the data into a single data frame by subcode
alldata <- merge(behavdata, demoghealthdata, by='subcode')

Table 17.1#

arrestdata <- alldata %>%
  drop_na(ArrestedChargedLifeCount) %>%
  mutate(everArrested = ArrestedChargedLifeCount > 0)

#ggplot(arrestdata,aes(ArrestedChargedLifeCount)) +
#  geom_histogram(bins=max(unique(arrestdata$ArrestedChargedLifeCount)) + 1) # add 1 to account for zeros

arrest_table <- arrestdata %>%
  group_by(ArrestedChargedLifeCount) %>%
  summarize(number = n(),
            proportion = n()/nrow(arrestdata))

kable(arrest_table, caption='Frequency distribution of number of reported arrests in the Eisenberg et al. dataset', digits=3)

Table: Frequency distribution of number of reported arrests in the Eisenberg et al. dataset

| ArrestedChargedLifeCount| number| proportion|
|------------------------:|------:|----------:|
|                        0|    409|      0.785|
|                        1|     58|      0.111|
|                        2|     33|      0.063|
|                        3|      9|      0.017|
|                        4|      5|      0.010|
|                        5|      5|      0.010|
|                        6|      2|      0.004|

Figure 17.3#

# First, rename the survey variables so that they are shorter, and find all variables related to impulsivity.  We will first rename the variables so taht their names are shorter, for easier display.

rename_list = list('upps_impulsivity_survey' = 'UPPS', 'sensation_seeking_survey' = 'SSS',
                   'dickman_survey' = 'Dickman',  'bis11_survey' = 'BIS11')
impulsivity_variables = c()

for (potential_match in names(alldata)){
  for (n in names(rename_list)){
    if (str_detect(potential_match, n)){
      # print(sprintf('found match: %s %s', n, potential_match))
      replacement_name <- str_replace(potential_match, n, toString(rename_list[n]))
      names(alldata)[names(alldata) == potential_match] <- replacement_name
      impulsivity_variables <- c(impulsivity_variables, replacement_name)
    }
  }
}

impulsivity_data <- alldata[,impulsivity_variables] %>%
  drop_na()
cc = cor(impulsivity_data, use='pairwise.complete')

colors <- viridis(100)
cellvalues <- format(round(cc, 2), nsmall = 2)
par(mar=c(4,4,4,8)+0.1)
hm <- heatmap.2(cc, trace='none', key=FALSE, dendrogram='row',
                cellnote=cellvalues, notecol='black',
                col=colors, margins=c(10, 14), revC = TRUE)

Figure 17.4#

# use the clustering from the heatmap to identify the two sets of variables
groups = cutree(hm$rowDendrogram, k=2)
vars_cluster1 <- names(groups[groups == 1])
vars_cluster2 <- names(groups[groups == 2])

# compute the mean impulsivity and sensation seeking values for each individual
# standardize the  measures to  make interpretation of the model easier

alldata <- alldata %>%
  mutate(mean_impulsivity = rowMeans(dplyr::select(., all_of(vars_cluster1)), na.rm = TRUE),
         mean_impulsivity = (mean_impulsivity - mean(mean_impulsivity))/sd(mean_impulsivity),
         mean_senseek = rowMeans(dplyr::select(., all_of(vars_cluster2)), na.rm = TRUE),
         mean_senseek = (mean_senseek - mean(mean_senseek))/sd(mean_senseek))


# create some additional useful variables, and clean up the data frame
# to only include the variables of interest and individuals without missing data
arrestdata <- alldata %>%
  mutate(everArrested = ArrestedChargedLifeCount > 0,
         anyTickets = TrafficTicketsLastYearCount > 0,
         anyAccidents = TrafficAccidentsLifeCount > 0) %>%
  dplyr::select(c(mean_impulsivity, mean_senseek, ArrestedChargedLifeCount,
           everArrested, anyTickets, TrafficTicketsLastYearCount,
           anyAccidents,TrafficAccidentsLifeCount, Age, Sex)) %>%
  drop_na() %>%
  mutate(
    everArrestedInt = as.integer(everArrested),
    everArrested = dplyr::recode_factor(everArrestedInt, `1`='Arrested', `0`='NotArrested'),
    SexInt = dplyr::recode(Sex, Female=0, Male=1))

pairs.panels(arrestdata[,c('mean_impulsivity', 'mean_senseek')])

Figure 17.5#

plotdata = pivot_longer(arrestdata,
                        c(mean_impulsivity, mean_senseek),
                        names_to = 'variable') %>%
  mutate(variable = dplyr::recode(variable, mean_impulsivity="Mean impulsivity",
         mean_senseek="Mean sens. seeking"))


p1 = ggplot(plotdata, aes(everArrested, value)) +
  geom_boxplot() +
  geom_beeswarm(alpha=0.1) +
  facet_grid(. ~ variable)

p1 = ggplot(arrestdata, aes(everArrested, mean_impulsivity)) +
  geom_boxplot() +
  geom_beeswarm(alpha=0.1) +
  ylab('Mean impulsivity')
p2 = ggplot(arrestdata, aes(everArrested, mean_senseek)) +
  geom_boxplot() +
  geom_beeswarm(alpha=0.1)+
  ylab('Mean sensation seeking')
plot_grid(p1, p2)

t-test output for the mean impulsivity scores:

t.test(mean_impulsivity ~ everArrested, data=arrestdata, alternative='greater')
imp_arrest_d = cohen.d(mean_impulsivity ~ everArrested, data=arrestdata)

	Welch Two Sample t-test

data:  mean_impulsivity by everArrested
t = 2.9105, df = 160.93, p-value = 0.00206
alternative hypothesis: true difference in means between group Arrested and group NotArrested is greater than 0
95 percent confidence interval:
 0.1430254       Inf
sample estimates:
   mean in group Arrested mean in group NotArrested 
                0.2618746                -0.0695193 

t-test output for the sensation seeking variable:

t.test(mean_senseek ~ everArrested, data=arrestdata, alternative='greater')
imp_arrest_d = cohen.d(mean_senseek ~ everArrested, data=arrestdata)

	Welch Two Sample t-test

data:  mean_senseek by everArrested
t = 4.7691, df = 182.38, p-value = 1.887e-06
alternative hypothesis: true difference in means between group Arrested and group NotArrested is greater than 0
95 percent confidence interval:
 0.3167728       Inf
sample estimates:
   mean in group Arrested mean in group NotArrested 
                0.3758469                -0.1090015 

Figure 17.6: Logistic regression#

set.seed(1234)
n = 100
b0 = 100
b1 = 3

logreg_df = data.frame(x = rnorm(n, 0,10)) %>%
  mutate(y = as.integer(b0 + b1*x + rnorm(n, 0, 5)),
         y_bin = as.integer(runif(n) < 1/(1 + exp(-(y - mean(y))/sd(y)))))
lm_result_binary = lm(y_bin ~ x, data=logreg_df)
glm_result_binary = glm(y_bin ~ x, family=binomial, data=logreg_df)

p1 = ggplot(logreg_df, aes(x, y_bin)) +
  geom_jitter(height=0.01, width=0) +
  geom_smooth(method='lm', se=FALSE) +
  ggtitle('B) Linear regression')

pred_df = data.frame(x=seq(-25, 25, by=.1))
pred_df = pred_df %>%
  mutate(y = predict(glm_result_binary, newdata=., type='response'))

p2= ggplot(logreg_df, aes(x, y_bin)) +
  geom_jitter(height=0.01, width=0) +
  geom_line(data=pred_df, aes(x,y), color='blue') +
  ggtitle('B) Logistic regression')

plot_grid(p1, p2)

`geom_smooth()` using formula = 'y ~ x'

Logistic regression output:

summary(glm_result_binary)

Call:
glm(formula = y_bin ~ x, family = binomial, data = logreg_df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.3453  -0.8524  -0.4870   0.8256   2.1367  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.05597    0.25148   0.223    0.824    
x            0.15406    0.03363   4.581 4.62e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 137.63  on 99  degrees of freedom
Residual deviance: 103.31  on 98  degrees of freedom
AIC: 107.31

Number of Fisher Scoring iterations: 4

Logistic regression output for impulsivity dataset

impmodeldata <- arrestdata %>%
  dplyr::select(everArrestedInt, mean_impulsivity, mean_senseek, Age, Sex) %>%
  mutate(AgeSquared = (Age - mean(Age))**2)

# glm requires int or logical variable as Y
glm_imp_arrest = glm(everArrestedInt ~ mean_impulsivity + mean_senseek + Age +  AgeSquared + Sex,
                    data=impmodeldata, family=binomial)
glm_imp_arrest_baseline = glm(everArrestedInt ~ Age + Sex,
                    data=impmodeldata, family=binomial)
probabilities <- predict(glm_imp_arrest, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
model.data = augment(glm_imp_arrest)

glm_imp_arrest
# summary(glm_imp_arrest)

Call:  glm(formula = everArrestedInt ~ mean_impulsivity + mean_senseek + 
    Age + AgeSquared + Sex, family = binomial, data = impmodeldata)

Coefficients:
     (Intercept)  mean_impulsivity      mean_senseek               Age  
       -3.294753          0.306971          0.383251          0.072638  
      AgeSquared         SexFemale  
       -0.004233         -0.686395  

Degrees of Freedom: 520 Total (i.e. Null);  515 Residual
Null Deviance:	    542.3 
Residual Deviance: 495.4 	AIC: 507.4

Test for overdispersion

testDispersion(glm_imp_arrest, alternative='greater', type='PearsonChisq')

	Parametric dispersion test via mean Pearson-chisq statistic

data:  glm_imp_arrest
dispersion = 1.0409, df = 515, p-value = 0.2522
alternative hypothesis: greater

Figure 17.7#

cat_logit_est = function(iv){
  ncats = min(18, round(length(unique(iv))/2))
  iv_cut = cut_number(iv, ncats)
  mod = glm(everArrestedInt ~ iv_cut, family=binomial, data=impmodeldata)
  return(mod$linear.predictors)
}

loess_logit_est = function(iv, span=1){
  fit = loess(everArrestedInt ~ iv, data=impmodeldata, span=span)$fitted
  fit[fit<=0] = 0 #negatives can happen at edges
  logit_of_fit = logit(fit)
  return(logit_of_fit)
}

# pull out continuous predictors
predictors <- colnames(model.data %>% dplyr::select(-everArrestedInt, -Sex, -starts_with('.')))

use_glm = FALSE
if (use_glm){
  model_string = "Using glm() to estimate logit"
  model.data = model.data %>%
    mutate(observed_logit_Age = cat_logit_est(Age),
           observed_logit_senseek = cat_logit_est(mean_senseek),
           observed_logit_impuls =  cat_logit_est(mean_impulsivity))
} else {
  model_string = "Using loess() to estimate logit"
  model.data = model.data %>%
    mutate(observed_logit_Age = loess_logit_est(Age),
           observed_logit_senseek = loess_logit_est(mean_senseek),
           observed_logit_impuls = loess_logit_est(mean_impulsivity))
}


point_alpha=.3
jitter_width=0
jitter_height=.1
p1  = ggplot(model.data, aes(Age,observed_logit_Age,  size=.cooksd)) +
  geom_jitter(width=jitter_width, height=jitter_height, alpha=point_alpha) +
  geom_smooth(se=FALSE, span=1) +
  ggtitle(model_string)+
  theme(legend.position="none")

p2 = ggplot(model.data, aes(mean_senseek,observed_logit_senseek,  size=.cooksd)) +
  geom_jitter(width=jitter_width, height=jitter_height, alpha=point_alpha) +
  geom_smooth(se=FALSE, span=1)+
  theme(legend.position="none")

p3 = ggplot(model.data, aes(mean_impulsivity,observed_logit_impuls, size=.cooksd)) +
  geom_jitter(width=jitter_width, height=jitter_height, alpha=point_alpha) +
  geom_smooth(se=FALSE,span=1)

plot_grid(p1, p2, p3)
# ggplot(impmodeldata_long_sample, aes(sample_logit, .fitted)) +
#   geom_point() +
#   facet_wrap(~predictors, scales = "free_x") +
#   geom_smooth(se=FALSE)
#
#

Warning message:
“Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.”

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Warning message:
“The following aesthetics were dropped during statistical transformation: size
ℹ This can happen when ggplot fails to infer the correct grouping structure in the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical variable into a factor?”

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Warning message:
“The following aesthetics were dropped during statistical transformation: size
ℹ This can happen when ggplot fails to infer the correct grouping structure in the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical variable into a factor?”

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Warning message:
“The following aesthetics were dropped during statistical transformation: size
ℹ This can happen when ggplot fails to infer the correct grouping structure in the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical variable into a factor?”

model including \(Age^2\) alongside Age

glm_imp_arrest2 = glm(everArrestedInt ~ mean_impulsivity + mean_senseek + Age +  AgeSquared + Sex,
                    data=impmodeldata, family=binomial)
glm_imp_arrest_baseline2 = glm(everArrestedInt ~ Age + AgeSquared + Sex,
                    data=impmodeldata, family=binomial)

summary(glm_imp_arrest2)

Call:
glm(formula = everArrestedInt ~ mean_impulsivity + mean_senseek + 
    Age + AgeSquared + Sex, family = binomial, data = impmodeldata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.4676  -0.7270  -0.5532  -0.3318   2.6388  

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)      -3.294753   0.601139  -5.481 4.23e-08 ***
mean_impulsivity  0.306971   0.114057   2.691  0.00712 ** 
mean_senseek      0.383251   0.118646   3.230  0.00124 ** 
Age               0.072638   0.018644   3.896 9.78e-05 ***
AgeSquared       -0.004233   0.001866  -2.269  0.02327 *  
SexFemale        -0.686395   0.238219  -2.881  0.00396 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 542.33  on 520  degrees of freedom
Residual deviance: 495.45  on 515  degrees of freedom
AIC: 507.45

Number of Fisher Scoring iterations: 4

Table 17.2#

oddsratios = exp(cbind("Odds ratio" = coef(glm_imp_arrest), confint.default(glm_imp_arrest, level = 0.95)))[2:6,]
kable(oddsratios, digits=3, caption='Effect sizes for each of the variables in the logistic regression model, expressed as odds ratios, along with 95 percent confidence limits for each odds ratio.')

Table: Effect sizes for each of the variables in the logistic regression model, expressed as odds ratios, along with 95 percent confidence limits for each odds ratio.

|                 | Odds ratio| 2.5 %| 97.5 %|
|:----------------|----------:|-----:|------:|
|mean_impulsivity |      1.359| 1.087|  1.700|
|mean_senseek     |      1.467| 1.163|  1.851|
|Age              |      1.075| 1.037|  1.115|
|AgeSquared       |      0.996| 0.992|  0.999|
|SexFemale        |      0.503| 0.316|  0.803|

Bayes factor using approximation via BIC

BF_01 = exp((BIC(glm_imp_arrest2) - BIC(glm_imp_arrest_baseline2))/2)  # BICs to Bayes factor
# 1/BF_01

Example 2: Mask-wearing and face-touching#

Figure 17.8#

# data cleaning based on .do file distributed with data

maskdata = read_csv('https://raw.githubusercontent.com/statsthinking21/statsthinking21-figures-data/main/mask_wearing/DataVersion2/MaskFaceTouchOSF.csv') %>%
  filter(face_touching != "Missing") %>%
  mutate(face_touching = dplyr::recode(face_touching, 'Yes'=1, 'No'=0),
         mask_front_touching = dplyr::recode(mask_front_touching, 'Yes'=1, .default=0),
         mask_strap_touching = dplyr::recode(mask_strap_touching, 'Yes'=1,.default=0),
         face_touching = face_touching | mask_front_touching | mask_strap_touching,
         mask_wearing = dplyr::recode(mask_YesNo, 'Yes'=TRUE, 'No'=FALSE),
         age_std = scale(age),
         study = as.factor(study),
         segment_crowding_std = scale(segment_crowding)) %>%
  filter(change==0 & non_covering != 'Yes')

maskdata_study1 = maskdata %>%
  filter(study==1)

maskdata_study2 = maskdata %>%
  filter(study==2)

Rows: 898 Columns: 20

── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (11): gender, mask_YesNo, fully_covering, partially_covering, non_coveri...
dbl  (9): age, duration_of_observation, change, segment_crowding, study, loc...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

ggplot(maskdata %>% mutate(study=dplyr::recode(study, `1`='Study 1', `2`='Study 2'))
       , aes(x=mask_wearing, fill=face_touching)) + geom_bar() + facet_grid(. ~ study)

simple chi-squared test

chisq.test(maskdata_study1$mask_wearing, maskdata_study1$face_touching)

	Pearson's Chi-squared test with Yates' continuity correction

data:  maskdata_study1$mask_wearing and maskdata_study1$face_touching
X-squared = 0.029801, df = 1, p-value = 0.8629

analogous test for the data from the second study

chisq.test(maskdata_study2$mask_wearing, maskdata_study2$face_touching)

	Pearson's Chi-squared test with Yates' continuity correction

data:  maskdata_study2$mask_wearing and maskdata_study2$face_touching
X-squared = 1.0854, df = 1, p-value = 0.2975

Logistic regression model including duration and study along with mask wearing

glm_result_combined = glm(face_touching ~ mask_wearing + duration_of_observation + study,
                          family=binomial, data=maskdata)
summary(glm_result_combined)

Call:
glm(formula = face_touching ~ mask_wearing + duration_of_observation + 
    study, family = binomial, data = maskdata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.7465  -0.6011  -0.5299  -0.4732   2.1918  

Coefficients:
                         Estimate Std. Error z value Pr(>|z|)    
(Intercept)             -2.770868   0.248228 -11.163  < 2e-16 ***
mask_wearingTRUE         0.191901   0.197742   0.970  0.33182    
duration_of_observation  0.028991   0.006182   4.689 2.74e-06 ***
study2                   0.571684   0.200441   2.852  0.00434 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 718.03  on 803  degrees of freedom
Residual deviance: 685.00  on 800  degrees of freedom
AIC: 693

Number of Fisher Scoring iterations: 4

Mixed effects model

glmer_result = glmer(face_touching ~ mask_wearing + duration_of_observation + (1 + mask_wearing  |unique_situation), data=maskdata, family=binomial)
glmer_result_baseline = update(glmer_result, formula = ~ . -mask_wearing)  # Without mask effect term
summary(glmer_result)

maskdata = maskdata %>%
  mutate(glmer_resid = residuals(glmer_result),
         )

Generalized linear mixed model fit by maximum likelihood (Laplace
  Approximation) [glmerMod]
 Family: binomial  ( logit )
Formula: face_touching ~ mask_wearing + duration_of_observation + (1 +  
    mask_wearing | unique_situation)
   Data: maskdata

     AIC      BIC   logLik deviance df.resid 
   704.2    732.4   -346.1    692.2      798 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-1.6847 -0.4289 -0.3888 -0.3335  3.1095 

Random effects:
 Groups           Name             Variance Std.Dev. Corr 
 unique_situation (Intercept)      0.1829   0.4277        
                  mask_wearingTRUE 0.1599   0.3999   -0.40
Number of obs: 804, groups:  unique_situation, 129

Fixed effects:
                         Estimate Std. Error z value Pr(>|z|)    
(Intercept)             -2.518255   0.245319 -10.265  < 2e-16 ***
mask_wearingTRUE         0.145705   0.253822   0.574    0.566    
duration_of_observation  0.029691   0.006409   4.633 3.61e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects:
            (Intr) m_TRUE
msk_wrnTRUE -0.471       
drtn_f_bsrv -0.698 -0.011

test for overdispersion:

testDispersion(glmer_result, alternative='greater', type='PearsonChisq')

	Parametric dispersion test via mean Pearson-chisq statistic

data:  glmer_result
dispersion = 0.92427, df = 798, p-value = 0.9377
alternative hypothesis: greater

Figure 17.9#

#alt.est.b <- influence(glmer_result, obs=TRUE) #"unique_situation")
#cd = cooks.distance(alt.est.b)

loess_logit_est2 = function(iv, span=1){
  fit = loess(face_touching_int ~ iv, data=maskmodeldata, span=span)$fitted
  fit[fit<=0] = 0 #negatives can happen at edges
  logit_of_fit = logit(fit)
  return(logit_of_fit)
}


maskmodeldata = augment(glmer_result) %>%
  mutate(face_touching_int = as.integer(face_touching))
maskmodeldata = maskmodeldata %>%
  mutate(observed_logit = loess_logit_est2(duration_of_observation))

ggplot(maskmodeldata, aes(duration_of_observation, observed_logit, size=.cooksd))+
  geom_jitter() +
  geom_smooth(method = "loess", se=FALSE)

Warning message in hatvalues.merMod(model):
“the hat matrix may not make sense for GLMMs”

Warning message in hatvalues.merMod(model):
“the hat matrix may not make sense for GLMMs”

`geom_smooth()` using formula = 'y ~ x'

Warning message:
“The following aesthetics were dropped during statistical transformation: size
ℹ This can happen when ggplot fails to infer the correct grouping structure in the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical variable into a factor?”

Table 17.3#

effect_table = tidy(glmer_result,conf.int=TRUE,exponentiate=TRUE,effects="fixed") %>%
  dplyr::select(term, estimate, conf.low, conf.high)

names(effect_table) = c('', 'Odds ratio', '2.5 %', '97.5 %')
kable(effect_table[2:3, ], digits=3, caption = 'Odds ratios and confidence intervals for the independent variables in the mask wearing study. Note that odds ratios are only presented for the fixed effects; since video segments were modeled as a random effect, that variable is not included.')

Table: Odds ratios and confidence intervals for the independent variables in the mask wearing study. Note that odds ratios are only presented for the fixed effects; since video segments were modeled as a random effect, that variable is not included.

|                        | Odds ratio| 2.5 %| 97.5 %|
|:-----------------------|----------:|-----:|------:|
|mask_wearingTRUE        |      1.157| 0.703|  1.903|
|duration_of_observation |      1.030| 1.017|  1.043|

Example 3: Asthma and air pollution#

3. Prepare the data for analysis#

# asthma data
# https://chronicdata.cdc.gov/500-Cities-Places/500-Cities-Census-Tract-level-Data-GIS-Friendly-Fo/5mtz-k78d (for 2014)
#
# PM2.5 data:
#
# filtered for 2014 and averaged ds_pm_pred using:
#
# https://data.cdc.gov/Environmental-Health-Toxicology/Daily-Census-Tract-Level-PM2-5-Concentrations-2011/fcqm-xrf4/data


pmdata = read_csv('https://raw.githubusercontent.com/statsthinking21/statsthinking21-figures-data/main/CDC_PM2.5_2014/Daily_Census_Tract-Level_PM2.5_mean_2014.csv') %>%
  rename(TractFIPS=ctfips,
         pm25_mean=ds_pm_pred)

asthmadata = read_csv('https://raw.githubusercontent.com/statsthinking21/statsthinking21-figures-data/main/500cities_disease/acsdata_with_censusdata.csv') %>%
  dplyr::select(-ends_with('_Crude95CI'))

pm_asthma_data = inner_join(asthmadata, pmdata, by='TractFIPS') %>%
  drop_na() %>%
  rename(asthma_prev=CASTHMA_CrudePrev,
         city=PlaceFIPS) %>%
  mutate(MedianIncome = MedianIncome/1000,
         Population2010 = Population2010/1000)
  # mutate(log_asthma_prev = log(asthma_prev))

Rows: 72283 Columns: 2

── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
dbl (2): ctfips, ds_pm_pred

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Rows: 26892 Columns: 66

── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (33): StateAbbr, PlaceName, Place_TractID, ACCESS2_Crude95CI, ARTHRITIS_...
dbl (33): PlaceFIPS, TractFIPS, Population2010, ACCESS2_CrudePrev, ARTHRITIS...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Figure 17.10#

pairs.panels(pm_asthma_data %>% dplyr::select(pm25_mean, asthma_prev, MedianIncome, MedianAge, Population2010))

standard linear regression model

lm_asthma_pm = lm(asthma_prev ~ pm25_mean + MedianIncome +  MedianAge + Population2010, data=pm_asthma_data)
summary(lm_asthma_pm)

Call:
lm(formula = asthma_prev ~ pm25_mean + MedianIncome + MedianAge + 
    Population2010, data = pm_asthma_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.1237 -1.2997 -0.2894  1.0326 10.3332 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)    12.551556   0.149787  83.796  < 2e-16 ***
pm25_mean       0.032812   0.006198   5.294  1.2e-07 ***
MedianIncome   -0.109583   0.002030 -53.969  < 2e-16 ***
MedianAge       0.004697   0.003908   1.202    0.229    
Population2010 -0.113642   0.005948 -19.105  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.866 on 26676 degrees of freedom
Multiple R-squared:  0.1372,	Adjusted R-squared:  0.1371 
F-statistic:  1061 on 4 and 26676 DF,  p-value: < 2.2e-16

Figure 17.11#

model.data = augment(lm_asthma_pm)

ggplot(model.data, aes(sample=.resid)) +
  geom_qq() +
  geom_qq_line()

Figure 17.12#

model.data = model.data %>%
  mutate(city = pm_asthma_data$city)

model_means_sorted = model.data %>% group_by(city) %>%
  summarize_all(mean) %>%
  arrange(.resid)

p1 = ggplot(model.data, aes(y=.resid, group=factor(city, levels=model_means_sorted$city))) +
  geom_boxplot(outlier.size=0.5, alpha=.2, size=.1)

lmer_asthma_pm = lmer(asthma_prev ~ pm25_mean + MedianAge + MedianIncome + Population2010 + (1 + pm25_mean|city),
                    data=pm_asthma_data)
lmer_asthma_pm_baseline = update(lmer_asthma_pm, formula = ~ . -pm25_mean)

pm_asthma_data = pm_asthma_data %>%
  mutate(lmer_resid = resid(lmer_asthma_pm))

p2 = ggplot(pm_asthma_data, aes(y=lmer_resid, group=city)) +
  geom_boxplot(outlier.size=0.5, alpha=.2, size=.1)

plot_grid(p1, p2, nrow=2)

Warning message in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
“Model failed to converge with max|grad| = 0.0242659 (tol = 0.002, component 1)”

Warning message in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
“Model failed to converge with max|grad| = 0.0102508 (tol = 0.002, component 1)”

linear mixed effect model output:

summary(lmer_asthma_pm)

Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: asthma_prev ~ pm25_mean + MedianAge + MedianIncome + Population2010 +  
    (1 + pm25_mean | city)
   Data: pm_asthma_data

REML criterion at convergence: 96063.9

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.3329 -0.6247 -0.0829  0.5462  5.4814 

Random effects:
 Groups   Name        Variance Std.Dev. Corr 
 city     (Intercept) 158.825  12.603        
          pm25_mean     1.457   1.207   -0.99
 Residual               1.951   1.397        
Number of obs: 26681, groups:  city, 494

Fixed effects:
                 Estimate Std. Error         df t value Pr(>|t|)    
(Intercept)     6.978e+00  1.122e+00  1.770e+02   6.218 3.52e-09 ***
pm25_mean       3.604e-01  7.181e-02  1.487e+02   5.018 1.47e-06 ***
MedianAge       4.965e-02  2.823e-02  1.790e+02   1.759   0.0803 .  
MedianIncome   -1.032e-01  1.430e-02  1.718e+02  -7.219 1.62e-11 ***
Population2010 -2.712e-02  4.796e-03  2.567e+04  -5.655 1.57e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects:
            (Intr) pm25_m MednAg MdnInc
pm25_mean   -0.646                     
MedianAge   -0.698  0.002              
MedianIncom  0.098 -0.005 -0.506       
Popultn2010 -0.017 -0.006  0.010 -0.016
optimizer (nloptwrap) convergence code: 0 (OK)
Model failed to converge with max|grad| = 0.0242659 (tol = 0.002, component 1)

Table 17.4#

ci = confint(lmer_asthma_pm)
lmer_confint = cbind(fixef(lmer_asthma_pm), ci[5:9,])[2:5, ]

kable(lmer_confint, digits=2, caption='Parameter estimates and confidence intervals for regression parameters from mixed effect model of asthma prevalence.')

Computing profile confidence intervals ...

Table: Parameter estimates and confidence intervals for regression parameters from mixed effect model of asthma prevalence.

|               |      | 2.5 %| 97.5 %|
|:--------------|-----:|-----:|------:|
|pm25_mean      |  0.36|  0.22|   0.50|
|MedianAge      |  0.05| -0.01|   0.10|
|MedianIncome   | -0.10| -0.13|  -0.07|
|Population2010 | -0.03| -0.04|  -0.02|

Compute delta r-squared

r2 = r.squaredGLMM(lmer_asthma_pm)
# r2
r2_baseline = r.squaredGLMM(lmer_asthma_pm_baseline)
# r2_baseline

delta_r2 = r2[1] - r2_baseline[1] [1] # marginal r-squared, reflecting fixed effects only
# delta_r2

lmer_asthma_pm_noincome = update(lmer_asthma_pm, formula = ~ . -MedianIncome)
r2_noincome = r.squaredGLMM(lmer_asthma_pm_noincome)
delta_r2_noincome = r2[1] - r2_noincome[1] [1] # marginal r-squared, reflecting fixed effects only
#delta_r2_noincome

Warning message:
“'r.squaredGLMM' now calculates a revised statistic. See the help page.”

Warning message in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
“Model failed to converge with max|grad| = 0.0073298 (tol = 0.002, component 1)”

Example 4: Response of plants to nitrogen fertilizers and soil tilling#

3. Prepare and visualize the data#

fert_data_all = read_delim('https://raw.githubusercontent.com/statsthinking21/statsthinking21-figures-data/main/fertilizer/Fertsyntraitsall_Jan27_2008.txt', delim='\t')

fert_data = fert_data_all %>%
  dplyr::filter(Rawabundmetric == "Grams biomass") %>%
  dplyr::filter(Site == 'KBS') %>%
  mutate(YearFac = as.factor(Year),
         plotID_common = plotID - Fert,
         Fert = as.factor(Fert),
         log_rawabund = as.numeric(scale(log(Rawabund))))

Warning message:
“One or more parsing issues, call `problems()` on your data frame for details, e.g.:
  dat <- vroom(...)
  problems(dat)”

Rows: 20144 Columns: 25

── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (18): Site, Experiment, Abbreviation, Other, Species_code, USDA_symbol, ...
dbl  (7): Year, Fert, block, plotID, Rawabund, Relabund, Plotsize

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Figure 17.13#

ggplot(fert_data, aes(Rawabund)) +
  geom_histogram(bins=100) +
  xlab('Raw crop abundance')

Simple linear model output:

lm.result_fert = lm(Rawabund ~ Fert*Experiment + Year, data=fert_data)
summary(lm.result_fert)

Call:
lm(formula = Rawabund ~ Fert * Experiment + Year, data = fert_data)

Residuals:
    Min      1Q  Median      3Q     Max 
 -95.64  -48.97  -30.95  -12.65 1284.57 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)              1154.5206  1685.2301   0.685 0.493353    
Fert1                      57.1985     7.6080   7.518 7.59e-14 ***
ExperimentUntilled         -5.3171     6.5679  -0.810 0.418266    
Year                       -0.5602     0.8438  -0.664 0.506796    
Fert1:ExperimentUntilled  -35.4946     9.9328  -3.573 0.000359 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 124.1 on 2610 degrees of freedom
Multiple R-squared:  0.03048,	Adjusted R-squared:  0.029 
F-statistic: 20.52 on 4 and 2610 DF,  p-value: < 2.2e-16

Figure 17.14#

model.data = augment(lm.result_fert) %>%
  mutate(plotID_common = fert_data$plotID_common,
         Species_code = fert_data$Species_code)

p1 = ggplot(model.data, aes(sample=.resid)) + geom_qq() + geom_qq_line()

lmer.result_fert = lmer(log_rawabund ~ Fert*Experiment + Year + (1 + Fert|plotID_common) + (1 + Fert|Species_code) , data=fert_data)

fert_data = fert_data %>%
  mutate(lmer_resid = resid(lmer.result_fert))

p2 = ggplot(fert_data, aes(sample=lmer_resid)) + geom_qq() + geom_qq_line()
plot_grid(p1, p2)

Figure 17.15#

model_means_sorted = model.data %>% group_by(Species_code) %>%
  summarize_all(mean) %>%
  arrange(.resid)
ggplot(model.data, aes(y=.resid, group=factor(Species_code, levels=model_means_sorted$Species_code))) + geom_boxplot()

Warning message:
“There were 264 warnings in `summarise()`.
The first warning was:
ℹ In argument: `Fert = (new("standardGeneric", .Data = function (x, ...) ...`.
ℹ In group 1: `Species_code = "ABUTH"`.
Caused by warning in `mean.default()`:
! argument is not numeric or logical: returning NA
ℹ Run `dplyr::last_dplyr_warnings()` to see the 263 remaining warnings.”

Linear mixed effects model output:

summary(lmer.result_fert)

Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: 
log_rawabund ~ Fert * Experiment + Year + (1 + Fert | plotID_common) +  
    (1 + Fert | Species_code)
   Data: fert_data

REML criterion at convergence: 6458.2

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.9539 -0.6286  0.1066  0.7031  2.7829 

Random effects:
 Groups        Name        Variance Std.Dev. Corr 
 Species_code  (Intercept) 0.30960  0.55642       
               Fert1       0.07358  0.27126  0.01 
 plotID_common (Intercept) 0.01571  0.12533       
               Fert1       0.00420  0.06481  -0.71
 Residual                  0.61678  0.78535       
Number of obs: 2615, groups:  Species_code, 132; plotID_common, 12

Fixed effects:
                           Estimate Std. Error         df t value Pr(>|t|)    
(Intercept)               1.471e+01  1.129e+01  2.547e+03   1.303 0.192682    
Fert1                     2.825e-01  7.713e-02  3.155e+01   3.662 0.000908 ***
ExperimentUntilled        3.711e-01  9.278e-02  1.333e+01   4.000 0.001443 ** 
Year                     -7.643e-03  5.653e-03  2.546e+03  -1.352 0.176498    
Fert1:ExperimentUntilled -3.261e-01  9.019e-02  2.057e+01  -3.615 0.001665 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects:
            (Intr) Fert1  ExprmU Year  
Fert1       -0.012                     
ExprmntUntl -0.055  0.328              
Year        -1.000  0.009  0.051       
Frt1:ExprmU -0.006 -0.717 -0.529  0.009

Effect sizes

emm = emmeans(lmer.result_fert, pairwise ~ Fert*Experiment)
contrast(emm, 'tukey')

 contrast                        estimate     SE   df t.ratio p.value
 Fert0 Tilled - Fert1 Tilled      -0.2825 0.0787 33.1  -3.587  0.0056
 Fert0 Tilled - Fert0 Untilled    -0.3711 0.0932 14.3  -3.983  0.0063
 Fert0 Tilled - Fert1 Untilled    -0.3275 0.0931 27.9  -3.516  0.0078
 Fert1 Tilled - Fert0 Untilled    -0.0886 0.1001 36.4  -0.885  0.8126
 Fert1 Tilled - Fert1 Untilled    -0.0450 0.0897 23.3  -0.502  0.9577
 Fert0 Untilled - Fert1 Untilled   0.0436 0.0650 19.0   0.671  0.9069

Degrees-of-freedom method: kenward-roger 
P value adjustment: tukey method for comparing a family of 4 estimates 

Compute delta r-squared values

lmer.result_fert_noint = update(lmer.result_fert, formula = ~ . -Fert:Experiment)
lmer.result_fert_null = update(lmer.result_fert, formula = ~ . -Fert*Experiment)

r2 = r.squaredGLMM(lmer.result_fert)
#r2
r2_noint = r.squaredGLMM(lmer.result_fert_noint)
#r2_noint
r2_null = r.squaredGLMM(lmer.result_fert_null)
#r2_null

delta_r2_int = r2[1] - r2_noint[1] [1] # marginal r-squared, reflecting fixed effects only
#delta_r2_int

delta_r2_full = r2[1] - r2_null[1] [1] # marginal r-squared, reflecting fixed effects only
#delta_r2_full

R Code and figures for Statistical Thinking: Analyzing Data in an Uncertain World

Chapter 17: Practical statistical modeling

Contents