Chapter 12 Modeling continuous relationships in R

12.1 Computing covariance and correlation (Section 12.1)

Let’s first look at our toy example of covariance and correlation. For this example, we first start by generating a set of X values.

df <-
  tibble(x = c(3, 5, 8, 10, 12))

Then we create a related Y variable by adding some random noise to the X variable:

We compute the deviations and multiply them together to get the crossproduct:

And then we compute the covariance and correlation:

results_df <- tibble(
  covXY=sum(df$crossproduct) / (nrow(df) - 1),
  corXY= sum(df$crossproduct) / 
    ((nrow(df) - 1) * sd(df$x) * sd(df$y)))

kable(results_df)

covXY	corXY
17.05	0.894782

12.2 Hate crime example

Now we will look at the hate crime data from the fivethirtyeight package. First we need to prepare the data by getting rid of NA values and creating abbreviations for the states. To do the latter, we use the state.abb and state.name variables that come with R along with the match() function that will match the state names in the hate_crimes variable to those in the list.

hateCrimes <- 
  hate_crimes %>%
  mutate(state_abb = state.abb[match(state,state.name)]) %>%
  drop_na(avg_hatecrimes_per_100k_fbi, gini_index)

# manually fix the DC abbreviation
hateCrimes$state_abb[hateCrimes$state=="District of Columbia"] <- 'DC'

## 
##  Pearson's product-moment correlation
## 
## data:  hateCrimes$avg_hatecrimes_per_100k_fbi and hateCrimes$gini_index
## t = 3.2182, df = 48, p-value = 0.001157
## alternative hypothesis: true correlation is greater than 0
## 95 percent confidence interval:
##  0.2063067 1.0000000
## sample estimates:
##       cor 
## 0.4212719

Remember that we can also compute the p-value using randomization. To to this, we shuffle the order of one of the variables, so that we break the link between the X and Y variables — effectively making the null hypothesis (that the correlation is less than or equal to zero) true. Here we will first create a function that takes in two variables, shuffles the order of one of them (without replacement) and then returns the correlation between that shuffled variable and the original copy of the second variable.

Now we take the distribution of observed correlations after shuffling and compare them to our observed correlation, in order to obtain the empirical probability of our observed data under the null hypothesis.

mean(shuffleDist$cor >corr_results$estimate )

## [1] 0.0066

This value is fairly close (though a bit larger) to the one obtained using cor.test().

12.3 Robust correlations (Section 12.3)

In the previous chapter we also saw that the hate crime data contained one substantial outlier, which appeared to drive the significant correlation. To compute the Spearman correlation, we first need to convert the data into their ranks, which we can do using the order() function:

hateCrimes <- hateCrimes %>%
  mutate(hatecrimes_rank = order(avg_hatecrimes_per_100k_fbi),
         gini_rank = order(gini_index))

We can then compute the Spearman correlation by applying the Pearson correlation to the rank variables”

cor(hateCrimes$hatecrimes_rank,
  hateCrimes$gini_rank)

## [1] 0.05690276

We see that this is much smaller than the value obtained using the Pearson correlation on the original data. We can assess its statistical signficance using randomization:

## [1] 0.0014

Here we see that the p-value is substantially larger and far from significance.