Chapter 7 Resampling and simulation in R
In this chapter we will use R to undestand how to resample data and perform numerical simulations.
7.1 Generating random samples (Section 7.1)
Here we will generate random samples from a number of different distributions and plot their histograms.
nsamples <- 10000
nhistbins <- 100
# uniform distribution
p1 <-
tibble(
x = runif(nsamples)
) %>%
ggplot((aes(x))) +
geom_histogram(bins = nhistbins) +
labs(title = "Uniform")
# normal distribution
p2 <-
tibble(
x = rnorm(nsamples)
) %>%
ggplot(aes(x)) +
geom_histogram(bins = nhistbins) +
labs(title = "Normal")
# Chi-squared distribution
p3 <-
tibble(
x = rnorm(nsamples)
) %>%
ggplot(aes(x)) +
geom_histogram(bins = nhistbins) +
labs(title = "Normal")
# Chi-squared distribution
p3 <-
tibble(
x = rchisq(nsamples, df=1)
) %>%
ggplot(aes(x)) +
geom_histogram(bins = nhistbins) +
labs(title = "Chi-squared")
# Poisson distribution
p4 <-
tibble(
x = rbinom(nsamples, 20, 0.25)
) %>%
ggplot(aes(x)) +
geom_histogram(bins = nhistbins) +
labs(title = "Binomial (p=0.25, 20 trials)")
plot_grid(p1, p2, p3, p4, ncol = 2)
7.2 Simulating the maximum finishing time
Let’s simulate 150 samples, collecting the maximum value from each sample, and then plotting the distribution of maxima.
# sample maximum value 5000 times and compute 99th percentile
nRuns <- 5000
sampSize <- 150
sampleMax <- function(sampSize = 150) {
samp <- rnorm(sampSize, mean = 5, sd = 1)
return(tibble(max=max(samp)))
}
input_df <- tibble(id=seq(nRuns)) %>%
group_by(id)
maxTime <- input_df %>% do(sampleMax())
cutoff <- quantile(maxTime$max, 0.99)
ggplot(maxTime,aes(max)) +
geom_histogram(bins = 100) +
geom_vline(xintercept = cutoff, color = "red")
7.3 The bootstrap
The bootstrap is useful for creating confidence intervals in cases where we don’t have a parametric distribution. One example is for the median; let’s look at how that works. We will start by implementing it by hand, to see more closely how it works. We will start by collecting a sample of individuals from the NHANES dataset, and the using the bootstrap to obtain confidence intervals on the median for the Height variable.
Lower CI limit | Median | Upper CI limit |
---|---|---|
161.6 | 167.65 | 171.1 |