Modeling categorical relationships in Python

So far we have discussed the general concept of statistical modeling and hypothesis testing, and applied them to some simple analyses. In this chapter we will focus on the modeling of categorical relationships, by which we mean relationships between variables that are measured qualitatively. These data are usually expressed in terms of counts; that is, for each value of the variable (or combination of values of multiple variables), how many observations take that value? For example, when we count how many people from each major are in our class, we are fitting a categorical model to the data. As an example, we will use the NHANES dataset to ask whether there is a relationship between being a smoker and having ever had cancer (of any type).

from nhanes.load import load_NHANES_data
nhanes_data = load_NHANES_data()
adult_nhanes_data = nhanes_data.query('AgeInYearsAtScreening > 17')

# clean up smoking variables
adult_nhanes_data.loc[adult_nhanes_data['SmokedAtLeast100CigarettesInLife'] == 0, 'DoYouNowSmokeCigarettes'] = 'Not at all'
adult_nhanes_data.loc[:, 'SmokeNow'] = (adult_nhanes_data['DoYouNowSmokeCigarettes'] != 'Not at all')

categorical_df = adult_nhanes_data[['SmokeNow', 'EverToldYouHadCancerOrMalignancy']].dropna().astype('int').rename(columns={'EverToldYouHadCancerOrMalignancy': 'HadCancer'})
/opt/conda/lib/python3.8/site-packages/pandas/core/indexing.py:966: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s
/opt/conda/lib/python3.8/site-packages/pandas/core/indexing.py:845: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)

The Pearson Chi-squared test

The Pearson Chi-squared test is used to test for an association between two categorical variables, against the null hypothesis of independence. We will use the statsmodels.stats.Table function for this, which has a number of useful features.

import statsmodels.api as sm
table = sm.stats.Table.from_data(categorical_df, shift_zeros=False)
table.table_orig
HadCancer 0 1
SmokeNow
0 3836 470
1 876 81

We can also see the predicted frequencies under the null hypothesis of independence, which are stored in the .fittedvalues element:

table.fittedvalues
HadCancer 0 1
SmokeNow
0 3855.191336 450.808664
1 856.808664 100.191336

Using these, we can compute the chi-squared statistic:

import numpy as np
orig_vector = np.ravel(table.table_orig)
independence_vector = np.ravel(table.fittedvalues)
squared_resid = (orig_vector - independence_vector)**2
chi2 = np.sum(squared_resid/independence_vector)
chi2
5.018427710130295

We can confirm this by comparing it to the result from the built-in function to compute the association:

chi2_result = table.test_nominal_association() print(chi2_result)

We can also see the standardized residuals:

table.standardized_resids
HadCancer 0 1
SmokeNow
0 -2.240185 2.240185
1 2.240185 -2.240185

This shows that there is an unexpectedly large number of people who smoke but don’t have cancer, and similarly an unexpectedly low number of smokers who report having had cancer before. Does this tell us that smoking results in lower rates of cancer?