Resampling and simulation#

Generating random samples#

Here we will generate random samples from a number of different distributions and plot their histograms. We could write out separate commands to plot each of our functions of interest, but that would involve repeating a lot of code, so instead we will take advantage of the fact that Python allows us to treat modules as variables. We will specify the module that creates each distribution, and then loop through them, each time incrementing the panel number. Some distributions also take specific parameters; for example, the Chi-squared distribution requires specifying the degrees of freedom. We will store those in a separate dictionary and use them as needed.

import scipy.stats
import matplotlib.pyplot as plt

num_samples = 20000

plt.figure(figsize=(8, 8))

generators = {'Uniform': scipy.stats.uniform,
              'Normal': scipy.stats.norm,
              'Exponential': scipy.stats.expon,
              'Chi-squared': scipy.stats.chi2}

generator_parameters = {'Chi-squared': 10}
panel_num = 1
for distribution in generators:
    plt.subplot(2, 2, panel_num)
    if distribution in generator_parameters:
        sample = generators[distribution].rvs(
            generator_parameters[distribution], size=num_samples)
    else:
        sample = generators[distribution].rvs(size=num_samples)
    plt.hist(sample, bins=100, density=True)
    plt.title(distribution)
    plt.xlabel('Value')
    plt.ylabel('Density')
    # the following function prevents the labels from overlapping
    plt.tight_layout()
    panel_num += 1
plt.show()
_images/16a63c8514d96940332722f39dab6eb741ed3fcc673c613c6bc1495d56064bc0.png

Simulating the maximum finishing time#

Let’s simulate 5000 samples of 150 observations, collecting the maximum value from each sample, and then plotting the distribution of maxima.

import numpy as np
import pandas as pd

num_runs = 5000
sample_size = 150


def sample_and_return_max(sample_size,
                          distribution=None):
    """
    function to sample from a distribution and return maximum
    """

    # if distribution is not specified, then use the normal
    if distribution is None:
        distribution = scipy.stats.norm

    sample = distribution.rvs(size=sample_size)
    return(np.max(sample))


sample_max_df = pd.DataFrame({'max': np.zeros(num_runs)})

for i in range(num_runs):
    sample_max_df.loc[i, 'max'] = sample_and_return_max(sample_size, distribution=scipy.stats.norm(loc=5,scale=1))

Now let’s find the 99th percentile of the maximum distriibution. There is a built-in function in the scipy.stats module, called scoreatpercentile that will do this for us:

cutoff = scipy.stats.scoreatpercentile(sample_max_df['max'], 99)

Plot the histogram of the maximum values, along with a vertical line at the 95th percentile.

hist = plt.hist(sample_max_df['max'], bins=100)
plt.ylabel('Count')
plt.xlabel('Maximum value')
_ = plt.axvline(x=cutoff, ymax=np.max(hist[0]), color='k')
plt.show()
_images/a68066a0ff80f5a6ffdfe1be71e074a17dd279c0f204fc5b9e0b8c0c4cc36c15.png

The bootstrap#

The bootstrap is useful for creating confidence intervals in cases where we don’t have a parametric distribution. One example is for the median; let’s look at how that works. We will start by implementing it by hand, to see more closely how it works. We will start by collecting a sample of individuals from the NHANES dataset, and the using the bootstrap to obtain confidence intervals on the median for the Height variable.

#+
! pip install nhanes
from nhanes.load import load_NHANES_data
nhanes_data = load_NHANES_data()
adult_nhanes_data = nhanes_data.query('AgeInYearsAtScreening > 17')
adult_nhanes_data = adult_nhanes_data.dropna(subset=['StandingHeightCm']).rename(columns={'StandingHeightCm': 'Height'})

num_runs = 5000
sample_size = 100

# Take a sample for which we will perform the bootstrap

nhanes_sample = adult_nhanes_data.sample(sample_size)

# Perform the resampling

bootstrap_df = pd.DataFrame({'mean': np.zeros(num_runs)})
for sampling_run in range(num_runs):
    bootstrap_sample = nhanes_sample.sample(sample_size, replace=True)
    bootstrap_df.loc[sampling_run, 'mean'] = bootstrap_sample['Height'].mean()

# Compute the 2.5% and 97.5% percentiles of the distribution


bootstrap_ci = [scipy.stats.scoreatpercentile(bootstrap_df['mean'], 2.5),
                scipy.stats.scoreatpercentile(bootstrap_df['mean'], 97.5)]

#-
Requirement already satisfied: nhanes in /miniconda/lib/python3.12/site-packages (0.5.1)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

Let’s compare the bootstrap distribution to the sampling distribution that we would expect given the sample mean and standard deviation:

hist = plt.hist(bootstrap_df['mean'], 100, density=True)

hist_bin_min = np.min(hist[1])
hist_bin_max = np.max(hist[1])
step_size = 0.01
x_values = np.arange(hist_bin_min, hist_bin_max, step_size)
normal_values = scipy.stats.norm.pdf(
    x_values,
    loc=nhanes_sample['Height'].mean(),
    scale=nhanes_sample['Height'].std()/np.sqrt(sample_size))
plt.plot(x_values, normal_values, color='r')
plt.legend([' Normal distribution based on sample mean and SEM','Means of bootstrap samples'])
plt.xlabel('Height (cm)')
plt.show()
_images/e6c5febbe536bfda9e8d711295656b6fe239292e86fa92dd695fc84027a60ac6.png

This shows that the bootstrap sampling distrbution does a good job of recapitulating the theoretical sampling distribution in this case.