Chapter 14 Comparing means in R

14.1 Testing the value of a single mean (Section 14.1)

In this example, we will show multiple ways to test a hypothesis about the value of a single mean. As an example, let’s test whether the mean systolic blood pressure (BP) in the NHANES dataset (averaged over the three measurements that were taken for each person) is greater than 120 mm, which is the standard value for normal systolic BP.

First let’s perform a power analysis to see how large our sample would need to be in order to detect a small difference (Cohen’s d = .2).

pwr.result <- pwr.t.test(d=0.2, power=0.8, 
           type='one.sample', 
           alternative='greater')
pwr.result
## 
##      One-sample t test power calculation 
## 
##               n = 155.9256
##               d = 0.2
##       sig.level = 0.05
##           power = 0.8
##     alternative = greater

Based on this, we take a sample of 156 individuals from the dataset.

NHANES_BP_sample <- NHANES_adult %>%
  drop_na(BPSysAve) %>%
  dplyr::select(BPSysAve) %>%
  sample_n(pwr.result$n)

print('Mean BP:')
## [1] "Mean BP:"
meanBP <- NHANES_BP_sample %>%
  summarize(meanBP=mean(BPSysAve)) %>%
  pull()
meanBP
## [1] 123.1097

First let’s perform a sign test to see whether the observed mean of 123.11 is significantly different from zero. To do this, we count the number of values that are greater than the hypothesized mean, and then use a binomial test to ask how surprising that number is if the true proportion is 0.5 (as it would be if the distribution were centered at the hypothesized mean).

NHANES_BP_sample <- NHANES_BP_sample %>%
  mutate(BPover120=BPSysAve>120)

nOver120 <- NHANES_BP_sample %>%
  summarize(nOver120=sum(BPover120)) %>%
  pull()

binom.test(nOver120, nrow(NHANES_BP_sample), alternative='greater')
## 
##  Exact binomial test
## 
## data:  nOver120 and nrow(NHANES_BP_sample)
## number of successes = 84, number of trials = 155, p-value = 0.1676
## alternative hypothesis: true probability of success is greater than 0.5
## 95 percent confidence interval:
##  0.4727095 1.0000000
## sample estimates:
## probability of success 
##              0.5419355

This shows no significant difference. Next let’s perform a one-sample t-test:

t.test(NHANES_BP_sample$BPSysAve, mu=120, alternative='greater')
## 
##  One Sample t-test
## 
## data:  NHANES_BP_sample$BPSysAve
## t = 2.2045, df = 154, p-value = 0.01449
## alternative hypothesis: true mean is greater than 120
## 95 percent confidence interval:
##  120.7754      Inf
## sample estimates:
## mean of x 
##  123.1097

Here we see that the difference is not statistically signficant. Finally, we can perform a randomization test to test the hypothesis. Under the null hypothesis we would expect roughly half of the differences from the expected mean to be positive and half to be negative (assuming the distribution is centered around the mean), so we can cause the null hypothesis to be true on average by randomly flipping the signs of the differences.

nruns = 5000

# create a function to compute the 
# t on the shuffled values 
shuffleOneSample <- function(x,mu) {
  # randomly flip signs
  flip <- runif(length(x))>0.5
  diff <- x - mu
  diff[flip]=-1*diff[flip]
  # compute and return correlation 
  # with shuffled variable
  return(tibble(meanDiff=mean(diff)))
}

index_df <- tibble(id=seq(nruns)) %>%
  group_by(id)

shuffle_results <- index_df %>%
  do(shuffleOneSample(NHANES_BP_sample$BPSysAve,120))

observed_diff <- mean(NHANES_BP_sample$BPSysAve-120)
p_shuffle <- mean(shuffle_results$meanDiff>observed_diff)
p_shuffle
## [1] 0.014

This gives us a very similar p value to the one observed with the standard t-test.

We might also want to quantify the degree of evidence in favor of the null hypothesis, which we can do using the Bayes Factor:

ttestBF(NHANES_BP_sample$BPSysAve,
        mu=120,  
        nullInterval = c(-Inf, 0))
## Bayes factor analysis
## --------------
## [1] Alt., r=0.707 -Inf<d<0    : 0.02856856 ±0.04%
## [2] Alt., r=0.707 !(-Inf<d<0) : 1.848811   ±0%
## 
## Against denominator:
##   Null, mu = 120 
## ---
## Bayes factor type: BFoneSample, JZS

This tells us that our result doesn’t provide particularly strong evidence for either the null or alternative hypothesis; that is, it’s inconclusive.

14.2 Comparing two means (Section 14.2)

To compare two means from independent samples, we can use the two-sample t-test. Let’s say that we want to compare blood pressure of smokers and non-smokers; we don’t have an expectation for the direction, so we will use a two-sided test. First let’s perform a power analysis, again for a small effect:

power_results_2sample <- pwr.t.test(d=0.2, power=0.8,
                                    type='two.sample'
                                    )
power_results_2sample
## 
##      Two-sample t test power calculation 
## 
##               n = 393.4057
##               d = 0.2
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

This tells us that we need 394 subjects in each group, so let’s sample 394 smokers and 394 nonsmokers from the NHANES dataset, and then put them into a single data frame with a variable denoting their smoking status.

nonsmoker_df <- NHANES_adult %>%
  dplyr::filter(SmokeNow=="Yes") %>%
  drop_na(BPSysAve) %>%
  dplyr::select(BPSysAve,SmokeNow) %>%
  sample_n(power_results_2sample$n)

smoker_df <- NHANES_adult %>%
  dplyr::filter(SmokeNow=="No") %>%
  drop_na(BPSysAve) %>%
  dplyr::select(BPSysAve,SmokeNow) %>%
  sample_n(power_results_2sample$n)

sample_df <- smoker_df %>%
  bind_rows(nonsmoker_df)

Let’s test our hypothesis using a standard two-sample t-test. We can use the formula notation to specify the analysis, just like we would for lm().

t.test(BPSysAve ~ SmokeNow, data=sample_df)
## 
##  Welch Two Sample t-test
## 
## data:  BPSysAve by SmokeNow
## t = 4.2105, df = 774.94, p-value = 2.848e-05
## alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
## 95 percent confidence interval:
##  2.850860 7.831074
## sample estimates:
##  mean in group No mean in group Yes 
##          125.1603          119.8193

This shows us that there is a significant difference, though the direction is surprising: Smokers have lower blood pressure!

Let’s look at the Bayes factor to quantify the evidence:

sample_df <- sample_df %>%
  mutate(SmokeNowInt=as.integer(SmokeNow))
ttestBF(formula=BPSysAve ~ SmokeNowInt, 
        data=sample_df)
## Bayes factor analysis
## --------------
## [1] Alt., r=0.707 : 440.269 ±0%
## 
## Against denominator:
##   Null, mu1-mu2 = 0 
## ---
## Bayes factor type: BFindepSample, JZS

This shows that there is very strong evidence against the null hypothesis of no difference.

14.3 The t-test as a linear model (Section 14.3)

We can also use lm() to implement these t-tests.

The one-sample t-test is basically a test for whether the intercept is different from zero, so we use a model with only an intercept and apply this to the data after subtracting the null hypothesis mean (so that the expectation under the null hypothesis is an intercept of zero):

NHANES_BP_sample <- NHANES_BP_sample %>%
  mutate(BPSysAveDiff = BPSysAve-120)
lm_result <- lm(BPSysAveDiff ~ 1, data=NHANES_BP_sample)
summary(lm_result)
## 
## Call:
## lm(formula = BPSysAveDiff ~ 1, data = NHANES_BP_sample)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -36.11 -13.11  -1.11   9.39  67.89 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)    3.110      1.411   2.205    0.029 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.56 on 154 degrees of freedom

You will notice that this p-value is twice as big as the one obtained from the one-sample t-test above; this is because that was a one-tailed test, while lm() is performing a two-tailed test.

We can also run the two-sample t-test using lm():

lm_ttest_result <- lm(BPSysAve ~ SmokeNow, data=sample_df)
summary(lm_ttest_result)
## 
## Call:
## lm(formula = BPSysAve ~ SmokeNow, data = sample_df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -45.16 -11.16  -2.16   8.84 101.18 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  125.160      0.897  139.54  < 2e-16 ***
## SmokeNowYes   -5.341      1.268   -4.21 2.84e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.78 on 784 degrees of freedom
## Multiple R-squared:  0.02211,    Adjusted R-squared:  0.02086 
## F-statistic: 17.73 on 1 and 784 DF,  p-value: 2.844e-05

This gives the same p-value for the SmokeNowYes variable as it did for the two-sample t-test above.

14.4 Comparing paired observations (Section 14.4)

Let’s look at how to perform a paired t-test in R. In this case, let’s generate some data for a set of individuals on two tests, where each indivdual varies in their overall ability, but there is also a practice effect such that performance on the second test is generally better than the first.

First, let’s see how big of a sample we will require to find a medium (d=0.5) sized effect. Let’s say that we want to be extra sure in our results, so we will find the sample size that gives us 95% power to find an effect if it’s there:

paired_power <- pwr.t.test(d=0.5, power=0.95, type='paired', alternative='greater')
paired_power
## 
##      Paired t test power calculation 
## 
##               n = 44.67998
##               d = 0.5
##       sig.level = 0.05
##           power = 0.95
##     alternative = greater
## 
## NOTE: n is number of *pairs*

Now let’s generate a dataset with the required number of subjects:

subject_id <- seq(paired_power$n)
# we code the tests as 0/1 so that we can simply
# multiply this by the effect to generate the data
test_id <- c(0,1)
repeat_effect <- 5
noise_sd <- 5

subject_means <- rnorm(paired_power$n, mean=100, sd=15)
paired_data <- crossing(subject_id,test_id) %>%
  mutate(subMean=subject_means[subject_id],
         score=subject_means + 
           test_id*repeat_effect + 
           rnorm(paired_power$n, mean=noise_sd))

Let’s perform a paired t-test on these data. To do that, we need to separate the first and second test data into separate variables, which we can do by converting our long data frame into a wide data frame.

paired_data_wide <- paired_data %>%
  spread(test_id, score) %>%
  rename(test1=`0`,
         test2=`1`)

glimpse(paired_data_wide)
## Rows: 44
## Columns: 4
## $ subject_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, …
## $ subMean    <dbl> 116.45972, 94.94520, 103.11749, 91.0816…
## $ test1      <dbl> 120.53690, 107.58688, 101.69392, 94.071…
## $ test2      <dbl> 104.44096, 101.07566, 101.86726, 106.67…

Now we can pass those new variables into the t.test() function:

paired_ttest_result <- t.test(paired_data_wide$test1,
                              paired_data_wide$test2,
                              type='paired')
paired_ttest_result 
## 
##  Welch Two Sample t-test
## 
## data:  paired_data_wide$test1 and paired_data_wide$test2
## t = -1.2799, df = 73.406, p-value = 0.2046
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -10.536522   2.295414
## sample estimates:
## mean of x mean of y 
##  107.5863  111.7068

This analysis is a bit trickier to perform using the linear model, because we need to estimate a separate intercept for each subject in order to account for the overall differences between subjects. We can’t do this using lm() but we can do it using a function called lmer() from the lme4 package. To do this, we need to add (1|subject_id) to the formula, which tells lmer() to add a separate intercept (“1”) for each value of subject_id.

paired_test_lmer <- lmer(score ~ test_id + (1|subject_id),
                         data=paired_data)
summary(paired_test_lmer)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: score ~ test_id + (1 | subject_id)
##    Data: paired_data
## 
## REML criterion at convergence: 718.6
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.54238 -0.62142 -0.09286  0.73490  2.97925 
## 
## Random effects:
##  Groups     Name        Variance Std.Dev.
##  subject_id (Intercept)   0       0.0    
##  Residual               228      15.1    
## Number of obs: 88, groups:  subject_id, 44
## 
## Fixed effects:
##             Estimate Std. Error      df t value Pr(>|t|)    
## (Intercept)  107.586      2.277  86.000   47.26   <2e-16 ***
## test_id        4.121      3.220  86.000    1.28    0.204    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##         (Intr)
## test_id -0.707
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see help('isSingular')

This gives a similar answer to the standard paired t-test. The advantage is that it’s more flexible, allowing us to perform repeated measures analyses, as we will see below.

14.5 Analysis of variance (Section 14.5)

Often we want to compare several different means, to determine whether any of them are different from the others. In this case, let’s look at the data from NHANES to determine whether Marital Status is related to sleep quality. First we clean up the data:

NHANES_sleep_marriage <- 
  NHANES_adult %>%
  dplyr::select(SleepHrsNight, MaritalStatus, Age) %>%
  drop_na()

In this case we are going to treat the full NHANES dataset as our sample, with the goal of generalizing to the entire US population (from which the NHANES dataset is mean to be a representative sample). First let’s look at the distribution of the different values of the MaritalStatus variable:

NHANES_sleep_marriage %>%
  group_by(MaritalStatus) %>%
  summarize(n=n()) %>%
  kable()
MaritalStatus n
Divorced 437
LivePartner 370
Married 2434
NeverMarried 889
Separated 134
Widowed 329

There are reasonable numbers of most of these categories, but let’s remove the Separated category since it has relatively few members:

NHANES_sleep_marriage <-
  NHANES_sleep_marriage %>%
  dplyr::filter(MaritalStatus!="Separated")

Now let’s use lm() to perform an analysis of variance. Since we also suspect that Age is related to the amount of sleep, we will also include Age in the model.

lm_sleep_marriage <- lm(SleepHrsNight ~ MaritalStatus + Age,
                        data=NHANES_sleep_marriage)
summary(lm_sleep_marriage)
## 
## Call:
## lm(formula = SleepHrsNight ~ MaritalStatus + Age, data = NHANES_sleep_marriage)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0160 -0.8797  0.1065  1.0821  5.2821 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               6.517579   0.098024  66.490  < 2e-16 ***
## MaritalStatusLivePartner  0.143733   0.098692   1.456 0.145360    
## MaritalStatusMarried      0.234936   0.070937   3.312 0.000934 ***
## MaritalStatusNeverMarried 0.251721   0.084036   2.995 0.002756 ** 
## MaritalStatusWidowed      0.263035   0.103270   2.547 0.010896 *  
## Age                       0.003180   0.001415   2.248 0.024643 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.365 on 4453 degrees of freedom
## Multiple R-squared:  0.004583,   Adjusted R-squared:  0.003465 
## F-statistic:   4.1 on 5 and 4453 DF,  p-value: 0.001024

This tells us that there is a highly significant effect of marital status (based on the F test), though it accounts for a very small amount of variance (less than 1%).

It’s also useful to look in more detail at which groups differ from which others, which we can do by examining the estimated marginal means for each group using the emmeans() function.

# compute the differences between each of the means
leastsquare <- emmeans(lm_sleep_marriage, 
                      pairwise ~ MaritalStatus,
                      adjust="tukey")
 
# display the results by grouping using letters

cld(leastsquare$emmeans, 
    alpha=.05,  
    Letters=letters)
##  MaritalStatus emmean     SE   df lower.CL upper.CL .group
##  Divorced        6.67 0.0656 4453     6.54     6.80  a    
##  LivePartner     6.81 0.0725 4453     6.67     6.95  ab   
##  Married         6.90 0.0280 4453     6.85     6.96   b   
##  NeverMarried    6.92 0.0501 4453     6.82     7.02   b   
##  Widowed         6.93 0.0823 4453     6.77     7.09  ab   
## 
## Confidence level used: 0.95 
## P value adjustment: tukey method for comparing a family of 5 estimates 
## significance level used: alpha = 0.05 
## NOTE: If two or more means share the same grouping symbol,
##       then we cannot show them to be different.
##       But we also did not show them to be the same.

The letters in the group column tell us which individual conditions differ from which others; any pair of conditions that don’t share a group identifier (in this case, the letters a and b) are significantly different from one another. In this case, we see that Divorced people sleep less than Married or Widowed individuals; no other pairs differ significantly.

14.5.1 Repeated measures analysis of variance

The standard analysis of variance assumes that the observations are independent, which should be true for different people in the NHANES dataset, but may not be true if the data are based on repeated measures of the same individual. For example, the NHANES dataset involves three measurements of blood pressure for each individual. If we want to test whether there are any differences between those, then we would need to use a repeated measures analysis of variance. We can do this using lmer() as we did above. First, we need to create a “long” version of the dataset.

NHANES_bp_all <- NHANES_adult %>%
  drop_na(BPSys1,BPSys2,BPSys3) %>%
  dplyr::select(BPSys1,BPSys2,BPSys3, ID) %>%
  gather(test, BPsys, -ID)

Then we fit a model that includes a separate intercept for each individual.

repeated_lmer <-lmer(BPsys ~ test + (1|ID), data=NHANES_bp_all)
summary(repeated_lmer)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: BPsys ~ test + (1 | ID)
##    Data: NHANES_bp_all
## 
## REML criterion at convergence: 89301
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -4.5475 -0.5125 -0.0053  0.4948  4.1339 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  ID       (Intercept) 280.9    16.761  
##  Residual              16.8     4.099  
## Number of obs: 12810, groups:  ID, 4270
## 
## Fixed effects:
##              Estimate Std. Error        df t value Pr(>|t|)    
## (Intercept)  122.0037     0.2641 4605.7049  462.04   <2e-16 ***
## testBPSys2    -0.9283     0.0887 8538.0000  -10.47   <2e-16 ***
## testBPSys3    -1.6216     0.0887 8538.0000  -18.28   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##            (Intr) tsBPS2
## testBPSys2 -0.168       
## testBPSys3 -0.168  0.500

This shows us that the second and third tests are significant different from the first test (which was automatically assigned as the baseline by lmer()). We might also want to know whether there is an overall effect of test. We can determine this by comparing the fit of our model to the fit of a model that does not include the test variable, which we will fit here. We then compare the models using the anova() function, which performs an F test to compare the two models.

repeated_lmer_baseline <-lmer(BPsys ~ (1|ID), data=NHANES_bp_all)
anova(repeated_lmer,repeated_lmer_baseline)
## Data: NHANES_bp_all
## Models:
## repeated_lmer_baseline: BPsys ~ (1 | ID)
## repeated_lmer: BPsys ~ test + (1 | ID)
##                        npar   AIC   BIC logLik deviance  Chisq Df Pr(>Chisq)
## repeated_lmer_baseline    3 89630 89652 -44812    89624                     
## repeated_lmer             5 89304 89341 -44647    89294 330.15  2  < 2.2e-16
##                           
## repeated_lmer_baseline    
## repeated_lmer          ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This shows that blood pressure differs significantly across the three tests.