{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Fitting simple models\n",
"\n",
"In this chapter we will focus on how to compute the measures of central tendency and variability that were covered in the previous chapter. Most of these can be computed using a built-in Python function, but we will show how to do them manually in order to give some intuition about how they work. First let's load the NHANES data that we will use for our examples."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from nhanes.load import load_NHANES_data\n",
"nhanes_data = load_NHANES_data()\n",
"adult_nhanes_data = nhanes_data.query('AgeInYearsAtScreening > 17')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since we will be analyzing the `StandingHeightCm` variable, we should exclude any observations that are missing this measurement. We will also recode the variable to be called `Height` in order to simplify the coding later."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"lines_to_next_cell": 2
},
"outputs": [],
"source": [
"adult_nhanes_data = adult_nhanes_data.dropna(subset=['StandingHeightCm']).rename(columns={'StandingHeightCm': 'Height'})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Mean\n",
"The mean is defined as the sum of values divided by the number of values being summed:\n",
"$$\n",
"\\bar{X} = \\frac{\\sum_{i=1}^{n}x_i}{n}\n",
"$$\n",
"Let's say that we want to obtain the mean height for adults in the NHANES database (contained in the data `Height` that we generated above). We would sum the individual heights (using the `.sum()` operator) and then divide by the number of values:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"adult_nhanes_data['Height'].sum() / adult_nhanes_data['Height'].shape[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There is, of course, a built-in operator for the data frame called `.mean()` that will compute the mean. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"lines_to_next_cell": 2
},
"outputs": [],
"source": [
"adult_nhanes_data['Height'].mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Median\n",
"The median is the middle value after sorting the entire set of values. First we sort the data in order of their values:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"lines_to_next_cell": 2
},
"outputs": [],
"source": [
"height_sorted = adult_nhanes_data['Height'].sort_values()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next we find the median value. If there is an odd number of values in the list, then this is just the value in the middle, whereas if the number of values is even then we take the average of the two middle values. We can determine whether the number of items is even by dividing the length by two and seeing if there is a remainder; we do this using the `%%` operator, which is known as the *modulus* and returns the remainder:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"height_length_mod_2 = height_sorted.shape[0] % 2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we will test whether the remainder is equal to one; if it is, then we will take the middle value, otherwise we will take the average of the two middle values. We can do this using an if/else structure, which executes different processes depending on which of the arguments are true. Here is a simple example:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"if 1 > 2:\n",
" print('1 > 2')\n",
"else:\n",
" print('1 is not greater than two!')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For our example, we can use an if statement to determine how to compute the median, depending on whether there is an odd or even number of data points."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"lines_to_next_cell": 2
},
"outputs": [],
"source": [
"import numpy as np\n",
"if height_length_mod_2 == 1: \n",
" # odd number values - take the single midpoint\n",
" midpoint = int(np.ceil(height_sorted.shape[0] / 2))\n",
" median = height_sorted[midpoint]\n",
"else:\n",
" # even number of values - need to average the two middle points\n",
" midpoints = [int((height_sorted.shape[0] / 2) - 1),\n",
" int(height_sorted.shape[0] / 2)]\n",
" median = height_sorted.iloc[midpoints].mean()"
]
},
{
"cell_type": "markdown",
"metadata": {
"lines_to_next_cell": 2
},
"source": [
"There is a lot going on there, so let's unpack it. The first line of the if statement asks whether the remainder is equal to one --- if so, then it executes the lines that are indented below it. Python uses indentation as part of its syntax, so you always need to be very careful about indentation. If the remainder is one, that means that the number of observations is odd, and thus that we can simply take the single middle point. We determine this by dividing the number of observations by two, and then rounding up (which is what the `np.ceil()` function does). Finally, we have to convert this number into an integer using the `int()` function, since we can only use integers to index a data frame. \n",
"If the first test is false --- that is, if the remainder is zero --- then, the second section of code (after the `else` statement) will be executed instead. Here we need to find the two midpoints and average them, so we create a new list containing those two points, and then use that index our data and then take the mean."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Mode\n",
"The mode is the most frequent value that occurs in a variable. For example, let's say that we had the following data:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"toy_data = pd.DataFrame({'myvar': ['a', 'a', 'b', 'c']})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see by eye that the mode is \"a\" since it occurs more often than the others. To find it computationally, let's use the `.value_counts()` operator to find the frequency of each value:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"myvar_frequencies = toy_data['myvar'].value_counts()\n",
"myvar_frequencies"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's find the highest frequency, using the `.max()` operator:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"max_frequency = myvar_frequencies.max()\n",
"max_frequency"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can find the values that have the maximum frequency:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"mode = myvar_frequencies.loc[myvar_frequencies == max_frequency].index.values"
]
},
{
"cell_type": "markdown",
"metadata": {
"lines_to_next_cell": 2
},
"source": [
"## Creating functions\n",
"It is often useful to create our own custom *function* in order to perform a particular action. Let's do that for our mode function:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"lines_to_next_cell": 2
},
"outputs": [],
"source": [
"def my_mode_function(input):\n",
" \"\"\"\n",
" A function to compute the mode. \n",
"\n",
" Inputs:\n",
" ------\n",
" input: a pandas Series\n",
"\n",
" Outputs:\n",
" --------\n",
" mode: an array containing the mode values\n",
" \"\"\"\n",
"\n",
" # make sure the input is a pandas series\n",
" input = pd.Series(input)\n",
"\n",
" # compute the frequency distribution\n",
" frequencies = input.value_counts()\n",
"\n",
" # compute the maximum frequency\n",
" max_frequency = frequencies.max()\n",
"\n",
" # find the values matching the maximum frequency (i.e. the mode)\n",
" mode = frequencies.loc[\n",
" frequencies == max_frequency].index.values\n",
"\n",
" return(mode)"
]
},
{
"cell_type": "markdown",
"metadata": {
"lines_to_next_cell": 2
},
"source": [
"Let's look at this one section at a time.\n",
"The first row tells Python to define a new function, called \"my_mode_function\", which takes in a single variable that will be called \"input\". This variable only exists inside the function; you can't access it from the outside. \n",
"The next section, surrounded by triple-quotes, is known as a *docstring*, and it provides documentation about our function. It's always a good idea to write a docstring that describes what the function does, what kinds of inputs it expects, and what kind of output it produces.\n",
"The next line converts the input to a particular kind of variable called a pandas *Series*; this is the same kind of variable as a column in a data frame. Including this command allows our function to take in various types of variables (including Series and lists) and treat them as if they were a Series, using the operators that are available such as `.value_counts()`.\n",
"The remaining lines perform the computations that we performed above to compute the mean.\n",
"The final line tells Python to return the value of the mode when the function is called. Let's see this in action:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"lines_to_next_cell": 2
},
"outputs": [],
"source": [
"my_mode_function(['a', 'a', 'b', 'c'])"
]
},
{
"cell_type": "markdown",
"metadata": {
"lines_to_next_cell": 2
},
"source": [
"Let's also make sure that it works properly if there are multiple modes:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"lines_to_next_cell": 2
},
"outputs": [],
"source": [
"my_mode_function(['a', 'a', 'b', 'c', 'c'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Variability\n",
"Let's first compute the *variance*, which is the average squared difference between each value and the mean. Let's do this with our cleaned-up version of the height data, but instead of working with the entire dataset, let's take a random sample of 150 individuals:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sample_size = 150\n",
"height_sample = adult_nhanes_data.sample(sample_size)['Height']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We could have simply entered the number 150 into the sample function, but by first creating a new variable called `sample_size` and setting it to 150, we make it clearer to the reader of the code exactly what this number refers to. It's always good practice to create a new variable rather than typing a number directly into a formula.\n",
"\n",
"To compute the variance we need we need to first compute the sum of squared errors from the mean. In Python, we can square a vector using `**2`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"lines_to_next_cell": 2
},
"outputs": [],
"source": [
"sum_of_squared_errors = np.sum((height_sample - height_sample.mean())**2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then we divide by N - 1 to get the estimated variance:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"variance_estimate = sum_of_squared_errors / (height_sample.shape[0] - 1)\n",
"variance_estimate"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can compare this to the built-in `.var()` operator:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"height_sample.var()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can get the *standard deviation* by simply taking the square root of the variance:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"std_dev_estimate = np.sqrt(variance_estimate)\n",
"std_dev_estimate"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Which is the same value obtained using the built-in `.std()` operator:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"lines_to_next_cell": 2
},
"outputs": [],
"source": [
"height_sample.std()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Z-scores\n",
"A Z-score is obtained by first subtracting the mean and then dividing by the standard deviation of a distribution. Let's do this for the `height_sample` data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"lines_to_next_cell": 2
},
"outputs": [],
"source": [
"mean_height = height_sample.mean()\n",
"sd_height = height_sample.std()\n",
"\n",
"z_height = (height_sample - mean_height) / sd_height"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's plot the histogram of Z-scores alongside the histogram for the original values. Matplotlib allows us to create a grid of figures using the `plt.subplot()` function. Let's see this in action:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\n",
"\n",
"plt.figure(figsize=(12, 6))\n",
"plt.subplot(1, 2, 1)\n",
"plt.hist(height_sample)\n",
"\n",
"plt.subplot(1, 2, 2)\n",
"plt.hist(z_height)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You will notice that the shapes of the histograms are exactly the same. We can also see this by plotting the two variables against one another in a scatterplot:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.scatter(height_sample, z_height)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You see here that they fall along a straight line, meaning that they are perfectly related to each other exactly --- the only difference is where they are located on the number line."
]
}
],
"metadata": {
"jupytext": {
"formats": "ipynb,py:percent"
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}