# Modeling categorical relationships in Python¶

So far we have discussed the general concept of statistical modeling and hypothesis testing, and applied them to some simple analyses. In this chapter we will focus on the modeling of categorical relationships, by which we mean relationships between variables that are measured qualitatively. These data are usually expressed in terms of counts; that is, for each value of the variable (or combination of values of multiple variables), how many observations take that value? For example, when we count how many people from each major are in our class, we are fitting a categorical model to the data. As an example, we will use the NHANES dataset to ask whether there is a relationship between being a smoker and having ever had cancer (of any type).

from nhanes.load import load_NHANES_data

# clean up smoking variables


/opt/conda/lib/python3.8/site-packages/pandas/core/indexing.py:966: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self.obj[item] = s
/opt/conda/lib/python3.8/site-packages/pandas/core/indexing.py:845: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self.obj[key] = _infer_fill_value(value)


## The Pearson Chi-squared test¶

The Pearson Chi-squared test is used to test for an association between two categorical variables, against the null hypothesis of independence. We will use the statsmodels.stats.Table function for this, which has a number of useful features.

import statsmodels.api as sm
table = sm.stats.Table.from_data(categorical_df, shift_zeros=False)
table.table_orig

SmokeNow
0 3836 470
1 876 81

We can also see the predicted frequencies under the null hypothesis of independence, which are stored in the .fittedvalues element:

table.fittedvalues

SmokeNow
0 3855.191336 450.808664
1 856.808664 100.191336

Using these, we can compute the chi-squared statistic:

import numpy as np
orig_vector = np.ravel(table.table_orig)
independence_vector = np.ravel(table.fittedvalues)
squared_resid = (orig_vector - independence_vector)**2
chi2 = np.sum(squared_resid/independence_vector)
chi2

5.018427710130295


We can confirm this by comparing it to the result from the built-in function to compute the association:

chi2_result = table.test_nominal_association() print(chi2_result)

We can also see the standardized residuals:

table.standardized_resids