Summarizing Data

Summarizing Data#

This chapter will introduce you to how to summarize data using data frames in Pandas.

Working with data frames#

In the last chapter you were introduced to the concept of a data frame, which we will use throughout much of this book.
In particular, we will use a dataset known as NHANES for several of our examples, so let’s load the library that provides us access to the data. This is a large dataset collected from a sample of individuals in the United States every two years, which measures many different aspects of their health and lifestyles. To access the data, we will use a Python package called nhanes that contains a function to load a cleaned-up version of the dataset.

First, we need to install the package. The escalation mark ! means that we are running this command in the computer system, instead of within the python environment.

! pip install nhanes

WARNING: The directory '/github/home/.cache/pip' or its parent directory is not owned or is not writable by the current user. The cache has been disabled. Check the permissions and owner of that directory. If executing pip with sudo, you should use sudo's -H flag.

Requirement already satisfied: nhanes in /miniconda/lib/python3.12/site-packages (0.5.1)

WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

from nhanes.load import load_NHANES_data
nhanes_data = load_NHANES_data()

We can use the .head() method to look inside the data frame:

nhanes_data.head()

	GeneralHealthCondition	EverBreastfedOrFedBreastmilk	AgeStoppedBreastfeedingdays	AgeFirstFedFormuladays	AgeStoppedReceivingFormuladays	AgeStartedOtherFoodbeverage	AgeFirstFedMilkdays	TypeOfMilkFirstFedWholeMilk	TypeOfMilkFirstFed2Milk	TypeOfMilkFirstFed1Milk	...	DaysSmokedCigsDuringPast30Days	AvgCigarettesdayDuringPast30Days	TriedToQuitSmoking	TimesStoppedSmokingCigarettes	HowLongWereYouAbleToStopSmoking	UnitOfMeasureDayweekmonthyear_2_SMQ	CurrentSelfreportedHeightInches	CurrentSelfreportedWeightPounds	TriedToLoseWeightInPastYear	TimesLost10LbsOrMoreToLoseWeight
SEQN
93703.0	NaN	1.0	273.0	1.0	365.0	152.0	365.0	Whole or regular milk	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
93704.0	NaN	1.0	60.0	3.0	365.0	126.0	365.0	Whole or regular milk	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
93705.0	Good	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	63.0	165.0	0.0	11 times or more
93706.0	Very good	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	68.0	145.0	0.0	Never
93707.0	Good	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 197 columns

Let’s look at the structure of this dataset. We can see the shape of the data frame – that is, the number of rows and columns – using the .shape method:

nhanes_data.shape

(8366, 197)

We see that the dataset has many more rows than columns. Let’s look more closely at what is contained in the rows and columns. To obtain the labels for the rows, we use the .index operator:

nhanes_data.index

Index([ 93703.0,  93704.0,  93705.0,  93706.0,  93707.0,  93708.0,  93709.0,
        93711.0,  93712.0,  93713.0,
       ...
       102946.0, 102947.0, 102948.0, 102949.0, 102951.0, 102952.0, 102953.0,
       102954.0, 102955.0, 102956.0],
      dtype='float64', name='SEQN', length=8366)

The index contains a bunch of numbers, each of which refers to one of the individuals in the NHANES data set. In a data frame, the rows always refer to observations, by which mean that each row reflects an individual unit of data. In the case of a dataset like NHANES, the observations would usually refer to individual people, though as we will see later, we sometimes want the rows to be even more specific. We can also look at the content of the columns, which we can access using the .columns operator:

nhanes_data.columns

Index(['GeneralHealthCondition', 'EverBreastfedOrFedBreastmilk',
       'AgeStoppedBreastfeedingdays', 'AgeFirstFedFormuladays',
       'AgeStoppedReceivingFormuladays', 'AgeStartedOtherFoodbeverage',
       'AgeFirstFedMilkdays', 'TypeOfMilkFirstFedWholeMilk',
       'TypeOfMilkFirstFed2Milk', 'TypeOfMilkFirstFed1Milk',
       ...
       'DaysSmokedCigsDuringPast30Days', 'AvgCigarettesdayDuringPast30Days',
       'TriedToQuitSmoking', 'TimesStoppedSmokingCigarettes',
       'HowLongWereYouAbleToStopSmoking',
       'UnitOfMeasureDayweekmonthyear_2_SMQ',
       'CurrentSelfreportedHeightInches', 'CurrentSelfreportedWeightPounds',
       'TriedToLoseWeightInPastYear', 'TimesLost10LbsOrMoreToLoseWeight'],
      dtype='object', length=197)

Each of the columns contains a different variable – that is, a different thing that is measured on each observation.

Selecting rows (observations) and columns (variables)#

We often want to select a subset of rows from a data frame. You saw in the last chapter how we can access specific rows using the .loc operator. This operator requires us to refer to the row names (that is, the index) and column names. For example, if we wanted to know the value of the GeneralHealthCondition variable for the indivdiual labeled 93707, we could use the following:

nhanes_data.loc[93707, 'GeneralHealthCondition']

'Good'

If there were several variables that we wanted to see for this individual, we could include the names of those variables in a list:

nhanes_data.loc[93707, ['GeneralHealthCondition', 'Gender', 'AgeInYearsAtScreening']]

GeneralHealthCondition    Good
Gender                    Male
AgeInYearsAtScreening     13.0
Name: 93707.0, dtype: object

This shows us the values for each of those variables for this individual. We could also do the same if there were several individuals that we were interested in:

nhanes_data.loc[[102951, 102955, 93707], ['GeneralHealthCondition', 'Gender', 'AgeInYearsAtScreening']]

	GeneralHealthCondition	Gender	AgeInYearsAtScreening
SEQN
102951.0	NaN	Male	4.0
102955.0	Very good	Female	14.0
93707.0	Good	Male	13.0

Missing values#

You will notice that the GeneralHealthCondition variable for the first individual in the previous cell contained NaN, which stands for “not a number”. This is generally used to denote that the data are missing for this particular observation; perhaps they declined to answer the question, or the interviewer failed to properly record the answer. Missing data are common when we are working with real data. There are many sophsticated ways to deal with missing data in statistics, but for the moment we will just remove observations that have a missing data on one of our variables of interest, which we can do using the .dropna() operator:

my_subset = nhanes_data.loc[[102951, 102955, 93707], ['GeneralHealthCondition', 'Gender', 'AgeInYearsAtScreening']]
my_subset.dropna()

	GeneralHealthCondition	Gender	AgeInYearsAtScreening
SEQN
102955.0	Very good	Female	14.0
93707.0	Good	Male	13.0

This operator removes any rows that have missing values for any of the variables in the data frame.

Selecting rows by value#

Let’s say that we want to analyze NHANES data, but only for those individuals who are over 50 years of age. We can use the query operator on a data frame to find rows that match particular conditions:

over_50_df = nhanes_data.query('AgeInYearsAtScreening >= 50')
over_50_df.shape

(2898, 197)

This shows that there were 2898 observations that matched our criterion. We can also search for specific values: For example, let’s say that we want to find anyone who reported that their general health condition was “Good”. This one is a bit tricky, because we are searching for a string of text, which we have to embed in our query, which is also a string of text. Fortunately, there are two different quotation marks that we can use (' or ") and Python treats them as distinct operators, so we can surround our text within double quotes, inside a string surrounded by single quotes:

good_health_df = nhanes_data.query('GeneralHealthCondition == "Good"')
good_health_df.shape

(2454, 197)

Creating new variables#

We can add a new variable to a data frame easily, by simply giving it a new name. Let’s say that we wanted to convert the weight value in NHANES (stored in the WeightKg variable) from kilograms to pounds.

nhanes_data['WeightLbs'] = nhanes_data['WeightKg'] * 2.205

This shows another way to refer to a particular variable in a dataframe: simply put its name in square brackets. Pandas also has the ability to replace particular values in a variable. First, let’s look at the values of the Gender variable in the data frame, to see what values it takes. We can see all of the unique values of a variable using the .unique() operator:

nhanes_data['Gender'].unique()

array(['Female', 'Male'], dtype=object)

Now let’s say that we wanted to recode the Gender variable so that instead of “Female” and “Male” its values were “F” and “M”. One way to do this would be to use the .replace() operator on the data frame:

nhanes_data['GenderMF'] = nhanes_data['Gender'].replace({'Female': 'F', 'Male': 'M'})
nhanes_data['GenderMF'].unique()

array(['F', 'M'], dtype=object)

Understanding your data#

Let’s say that we want to learn more about the variable labeled GeneralHealthCondition in the dataset. We can load some information about that variable using the open_variable_page() function from the nhanes package:

from nhanes.load import open_variable_page
open_variable_page('GeneralHealthCondition')

This shows us the question that was asked (“Would you say {your/SP’s} health in general is…”).

Summarizing data using a frequency distribution#

Let’s say that we want to know the frequencies of all of the different answers to the GeneralHealthCondition question. We can do this using the .value_counts() method of the data frame:

nhanes_data['GeneralHealthCondition'].value_counts()

GeneralHealthCondition
Good         2454
Very good    1544
Fair or      1175
Excellent     619
Poor?         172
Name: count, dtype: int64

It’s usually more helpful to present a relative frequency distribution, which shows proportions rather than counts. We can obtain that by simply dividing the frequency distribution by the total number of cases, which we can obtain using the .sum() operator:

GeneralHealthCondition_frequency_dist = nhanes_data['GeneralHealthCondition'].value_counts()
GeneralHealthCondition_frequency_dist / GeneralHealthCondition_frequency_dist.sum()

GeneralHealthCondition
Good         0.411469
Very good    0.258887
Fair or      0.197015
Excellent    0.103789
Poor?        0.028840
Name: count, dtype: float64

Data Cleaning#

When we work with real data, they often have problems that we have to fix before we can analyze them properly. An example is the GeneralHealthCondition variable that we worked with in the previous example. You may have noticed that the values of the variable had some extranous information, which were held over from the way that the question is worded (including “Fair or” and “Poor?”). We can clean these up by replacing the problematic values using the .replace() method:

nhanes_data['GeneralHealthConditionFixed'] = nhanes_data['GeneralHealthCondition'].replace({'Fair or': 'Fair', 'Poor?': 'Poor'})
nhanes_data['GeneralHealthConditionFixed'].unique()

array([nan, 'Good', 'Very good', 'Fair', 'Excellent', 'Poor'],
      dtype=object)

Now let’s look at a more complex example. Let’s say that we want to know who is currently a smokker in the NHANES sample. If we look more closely at the details of the smoking questionnaire, we will see that not all individuals got the same questions; for example, if a person said that they had not smoked more than 100 cigarettes in their life (recorded in the SmokedAtLeast100CigarettesInLife variable), then they were not asked the question about whether they currently smoked cigarettes (stored as DoYouNowSmokeCigarettes). We can see this in the number of respondents to each question:

print(nhanes_data['SmokedAtLeast100CigarettesInLife'].value_counts())

SmokedAtLeast100CigarettesInLife
0.0    3301
1.0    2232
Name: count, dtype: int64

nhanes_data['DoYouNowSmokeCigarettes'].value_counts()

DoYouNowSmokeCigarettes
Not at all    1260
Every day      763
Some days      209
Name: count, dtype: int64

Looking at these two variables, it’s plain to see that there are many more values included in the first question than in the second. Let’s look at bit more closely at the data to see what’s going on:

nhanes_data[['SmokedAtLeast100CigarettesInLife', 'DoYouNowSmokeCigarettes']].head(10)

	SmokedAtLeast100CigarettesInLife	DoYouNowSmokeCigarettes
SEQN
93703.0	NaN	NaN
93704.0	NaN	NaN
93705.0	1.0	Not at all
93706.0	0.0	NaN
93707.0	NaN	NaN
93708.0	0.0	NaN
93709.0	1.0	Every day
93711.0	0.0	NaN
93712.0	1.0	Some days
93713.0	1.0	Every day

Here we can see that anyone who said No to the question about having smoked at least 100 cigarettes in their life (which is coded as zero) has a missing value for the question about current smoking (since that question wasn’t asked to these individuals). To clean up these data, we need to do two things. First, we need to remove the individuals who have NaN for both questions, then we need to recode the NaN’s for the second question for those people who said no on the first question. To do this, let’s first create a new data frame that contains just the variables we are interested in:

smoking_df = nhanes_data[['SmokedAtLeast100CigarettesInLife', 'DoYouNowSmokeCigarettes']]
smoking_df.shape

(8366, 2)

First let’s remove any rows that have NaN in both columns. We can do this using the .dropna() method, which has an option that allows us to specify that we only drop rows that are all NaN (by setting how='all'):

smoking_df = smoking_df.dropna(how='all')
smoking_df.shape

(5533, 2)

Next we need to recode the NaN values for the second question, for those individuals who said no to the first question. We will replace them with the answer ‘Not at all’. To do this, we can use the .loc operator with a test for the value of the first column:

smoking_df.loc[smoking_df['SmokedAtLeast100CigarettesInLife'] == 0, 'DoYouNowSmokeCigarettes'] = 'Not at all'
smoking_df.head()

	SmokedAtLeast100CigarettesInLife	DoYouNowSmokeCigarettes
SEQN
93705.0	1.0	Not at all
93706.0	0.0	Not at all
93708.0	0.0	Not at all
93709.0	1.0	Every day
93711.0	0.0	Not at all

This replaced the NaN values in the second question, but only for those individuals who said no to the first question. Now we can summarize the frequency of smoking across the entire group:

smoking_df['DoYouNowSmokeCigarettes'].value_counts() / smoking_df['DoYouNowSmokeCigarettes'].value_counts().sum()

DoYouNowSmokeCigarettes
Not at all    0.824327
Every day     0.137900
Some days     0.037773
Name: count, dtype: float64

In the next chapter you will learn how to visualize data like these using statistical graphs.