# Modeling categorical relationships in Python¶

So far we have discussed the general concept of statistical modeling and hypothesis testing, and applied them to some simple analyses. In this chapter we will focus on the modeling of *categorical* relationships, by which we mean relationships between variables that are measured qualitatively. These data are usually expressed in terms of counts; that is, for each value of the variable (or combination of values of multiple variables), how many observations take that value? For example, when we count how many people from each major are in our class, we are fitting a categorical model to the data.
As an example, we will use the NHANES dataset to ask whether there is a relationship between being a smoker and having ever had cancer (of any type).

```
from nhanes.load import load_NHANES_data
nhanes_data = load_NHANES_data()
adult_nhanes_data = nhanes_data.query('AgeInYearsAtScreening > 17')
# clean up smoking variables
adult_nhanes_data.loc[adult_nhanes_data['SmokedAtLeast100CigarettesInLife'] == 0, 'DoYouNowSmokeCigarettes'] = 'Not at all'
adult_nhanes_data.loc[:, 'SmokeNow'] = (adult_nhanes_data['DoYouNowSmokeCigarettes'] != 'Not at all')
categorical_df = adult_nhanes_data[['SmokeNow', 'EverToldYouHadCancerOrMalignancy']].dropna().astype('int').rename(columns={'EverToldYouHadCancerOrMalignancy': 'HadCancer'})
```

/opt/conda/lib/python3.8/site-packages/pandas/core/indexing.py:966: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self.obj[item] = s /opt/conda/lib/python3.8/site-packages/pandas/core/indexing.py:845: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self.obj[key] = _infer_fill_value(value)

## The Pearson Chi-squared test¶

The Pearson Chi-squared test is used to test for an association between two categorical variables, against the null hypothesis of independence. We will use the `statsmodels.stats.Table`

function for this, which has a number of useful features.

```
import statsmodels.api as sm
table = sm.stats.Table.from_data(categorical_df, shift_zeros=False)
table.table_orig
```

HadCancer | 0 | 1 |
---|---|---|

SmokeNow | ||

0 | 3836 | 470 |

1 | 876 | 81 |

We can also see the predicted frequencies under the null hypothesis of independence, which are stored in the `.fittedvalues`

element:

```
table.fittedvalues
```

HadCancer | 0 | 1 |
---|---|---|

SmokeNow | ||

0 | 3855.191336 | 450.808664 |

1 | 856.808664 | 100.191336 |

Using these, we can compute the chi-squared statistic:

```
import numpy as np
orig_vector = np.ravel(table.table_orig)
independence_vector = np.ravel(table.fittedvalues)
squared_resid = (orig_vector - independence_vector)**2
chi2 = np.sum(squared_resid/independence_vector)
chi2
```

```
5.018427710130295
```

We can confirm this by comparing it to the result from the built-in function to compute the association:

chi2_result = table.test_nominal_association() print(chi2_result)

We can also see the standardized residuals:

```
table.standardized_resids
```

HadCancer | 0 | 1 |
---|---|---|

SmokeNow | ||

0 | -2.240185 | 2.240185 |

1 | 2.240185 | -2.240185 |

This shows that there is an unexpectedly large number of people who smoke but don’t have cancer, and similarly an unexpectedly low number of smokers who report having had cancer before. Does this tell us that smoking results in lower rates of cancer?