Chapter 6 Sampling in R

First we load the necessary libraries and set up the NHANES adult dataset

6.2 Central limit theorem

The central limit theorem tells us that the sampling distribution of the mean becomes normal as the sample size grows. Let’s test this by sampling a clearly non-normal variable and look at the normality of the results using a Q-Q plot. We saw in Figure ?? that the variable AlcoholYear is distributed in a very non-normal way. Let’s first look at the Q-Q plot for these data, to see what it looks like. We will use the stat_qq() function from ggplot2 to create the plot for us.

We can see from this figure that the distribution is highly non-normal, as the Q-Q plot diverges substantially from the unit line.

Now let’s repeatedly sample and compute the mean, and look at the resulting Q-Q plot. We will take samples of various sizes to see the effect of sample size. We will use a function from the dplyr package called do(), which can run a large number of analyses at once.

Now let’s create separate Q-Q plots for the different sample sizes.

This shows that the results become more normally distributed (i.e. following the straight line) as the samples get larger.

6.3 Confidence intervals (Section 9.1)

Remember that confidence intervals are intervals that will contain the population parameter on a certain proportion of times. In this example we will walk through the simulation that was presented in Section 9.1 to show that this actually works properly. Here we will use a function called do() that lets us