# The General Linear Model¶

In this chapter we will explore how to fit general linear models in Python. We will focus on the tools provided by the `statsmodels`

package.

```
from nhanes.load import load_NHANES_data
nhanes_data = load_NHANES_data()
adult_nhanes_data = nhanes_data.query('AgeInYearsAtScreening > 17')
```

## Linear regression¶

To perform linear regression in Python, we use the `OLS()`

function (which stands for *ordinary least squares*) from the `statsmodels`

package. Let’s generate some simulated data and use this function to compute the linear regression solution.

```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
def generate_linear_data(slope, intercept,
noise_sd=1, x=None,
npoints=100, seed=None):
"""
generate data with a given slope and intercept
and add normally distributed noise
if x is passed as an argument then a given x will be used,
otherwise it will be generated randomly
Returns:
--------
a pandas data frame with variables x and y
"""
if seed is not None:
np.random.seed(seed)
if x is None:
x = np.random.randn(npoints)
y = x * slope + intercept + np.random.randn(x.shape[0]) * noise_sd
return(pd.DataFrame({'x': x, 'y': y}))
slope = 1
intercept = 10
noise_sd = 1
simulated_data = generate_linear_data(slope, intercept, noise_sd, seed=1)
plt.scatter(simulated_data['x'], simulated_data['y'])
```

```
<matplotlib.collections.PathCollection at 0x7febd5275be0>
```

We can then perform linear regression on these data using the `ols`

function. This function doesn’t automatically include an intercept in its model, so we need to add one to the design. Fitting the model using this function is a two-step process. First, we set up the model and store it to a variable (which we will call `ols_model`

). Then, we actually fit the model, which generates the results that we store to a different variable called `ols_results`

, and view a summary using the `.summary()`

method of the results variable.

```
from statsmodels.formula.api import ols
ols_model = ols(formula='y ~ x + 1', data=simulated_data)
ols_result = ols_model.fit()
ols_result.summary()
```

Dep. Variable: | y | R-squared: | 0.522 |
---|---|---|---|

Model: | OLS | Adj. R-squared: | 0.517 |

Method: | Least Squares | F-statistic: | 107.0 |

Date: | Thu, 22 Dec 2022 | Prob (F-statistic): | 2.20e-17 |

Time: | 15:26:37 | Log-Likelihood: | -134.44 |

No. Observations: | 100 | AIC: | 272.9 |

Df Residuals: | 98 | BIC: | 278.1 |

Df Model: | 1 | ||

Covariance Type: | nonrobust |

coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|

Intercept | 10.1470 | 0.094 | 107.973 | 0.000 | 9.961 | 10.334 |

x | 1.0954 | 0.106 | 10.342 | 0.000 | 0.885 | 1.306 |

Omnibus: | 0.898 | Durbin-Watson: | 2.157 |
---|---|---|---|

Prob(Omnibus): | 0.638 | Jarque-Bera (JB): | 0.561 |

Skew: | -0.172 | Prob(JB): | 0.755 |

Kurtosis: | 3.127 | Cond. No. | 1.15 |

Warnings:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

We should see three things in these results:

The estimate of the Intercept in the model should be very close to the intercept that we specified

The estimate for the x parameter should be very close to the slope that we specified

The residual standard deviation should be roughly similar to the noise standard deviation that we specified. The summary doesn’t report the residual standard deviation directly but we can compute it using the residuals that are stored in the

`.resid`

element in the result output:

```
ols_result.resid.std()
```

```
0.9328350998082221
```

## Model criticism and diagnostics¶

Once we have fitted the model, we want to look at some diagnostics to determine whether the model is actually fitting properly.

The first thing to examine is to make sure that the residuals are (at least roughly) normally distributed. We can do this using a Q-Q plot:

```
import seaborn as sns
import scipy.stats
_ = scipy.stats.probplot(ols_result.resid, plot=sns.mpl.pyplot)
```

This looks pretty good, in the sense that the residual data points fall very close to the unit line. This is not surprising, since we generated the data with normally distributed noise. We should also plot the predicted (or *fitted*) values against the residuals, to make sure that the model does work systematically better for some predicted values versus others.

```
plt.scatter(ols_result.fittedvalues, ols_result.resid)
plt.xlabel('Fitted value')
plt.ylabel('Residual')
```

```
Text(0, 0.5, 'Residual')
```

As expected, we see no clear relationship.

## Examples of problematic model fit¶

Let’s say that there was another variable at play in this dataset, which we were not aware of. This variable causes some of the cases to have much larger values than others, in a way that is unrelated to the X variable. We play a trick here using the `seq()`

function to create a sequence from zero to one, and then threshold those 0.5 (in order to obtain half of the values as zero and the other half as one) and then multiply by the desired effect size:

```
simulated_data.loc[:, 'x2'] = (simulated_data.index < (simulated_data.shape[0] / 2)).astype('int')
hidden_effect_size = 10
simulated_data.loc[:, 'y2'] = simulated_data['y'] + simulated_data['x2'] * hidden_effect_size
```

Now we fit the model again, and examine the residuals:

```
ols_model2 = ols(formula='y2 ~ x + 1', data=simulated_data)
ols_result2 = ols_model2.fit()
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
scipy.stats.probplot(ols_result2.resid, plot=sns.mpl.pyplot)
plt.subplot(1, 2, 2)
plt.scatter(ols_result2.fittedvalues, ols_result2.resid)
plt.xlabel('Fitted value')
plt.ylabel('Residual')
```

```
Text(0, 0.5, 'Residual')
```

The lack of normality is clear from the Q-Q plot, and we can also see that there is obvious structure in the residuals.

Let’s look at another potential problem, in which the y variable is nonlinearly related to the X variable. We can create these data by squaring the X variable when we generate the Y variable:

```
noise_sd = 0.1
simulated_data['y3'] = (simulated_data['x']**2) * slope + intercept + np.random.randn(simulated_data.shape[0]) * noise_sd
plt.scatter(simulated_data['x'], simulated_data['y3'])
```

```
<matplotlib.collections.PathCollection at 0x7febccb2f850>
```

```
ols_model3 = ols(formula='y3 ~ x + 1', data=simulated_data)
ols_result3 = ols_model3.fit()
ols_result3.summary()
```

Dep. Variable: | y3 | R-squared: | 0.008 |
---|---|---|---|

Model: | OLS | Adj. R-squared: | -0.002 |

Method: | Least Squares | F-statistic: | 0.8369 |

Date: | Thu, 22 Dec 2022 | Prob (F-statistic): | 0.363 |

Time: | 15:26:38 | Log-Likelihood: | -149.89 |

No. Observations: | 100 | AIC: | 303.8 |

Df Residuals: | 98 | BIC: | 309.0 |

Df Model: | 1 | ||

Covariance Type: | nonrobust |

coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|

Intercept | 10.7813 | 0.110 | 98.295 | 0.000 | 10.564 | 10.999 |

x | 0.1131 | 0.124 | 0.915 | 0.363 | -0.132 | 0.358 |

Omnibus: | 62.137 | Durbin-Watson: | 1.699 |
---|---|---|---|

Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 203.933 |

Skew: | 2.285 | Prob(JB): | 5.21e-45 |

Kurtosis: | 8.297 | Cond. No. | 1.15 |

Warnings:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Now we see that there is no significant linear relationship between \(X^2\) and Y/ But if we look at the residuals the problem with the model becomes clear:

plt.figure(figsize=(12, 6)) plt.subplot(1, 2, 1) scipy.stats.probplot(ols_result3.resid, plot=sns.mpl.pyplot)

plt.subplot(1, 2, 2) plt.scatter(ols_result3.fittedvalues, ols_result3.resid) plt.xlabel(‘Fitted value’) plt.ylabel(‘Residual’)

In this case we can see the clearly nonlinear relationship between the predicted and residual values, as well as the clear lack of normality in the residuals.

As we noted in the previous chapter, the “linear” in the general linear model doesn’t refer to the shape of the response, but instead refers to the fact that model is linear in its parameters — that is, the predictors in the model only get multiplied the parameters (e.g., rather than being raised to a power of the parameter). Here is how we would build a model that could account for the nonlinear relationship, by using `x**2`

in the model:

```
simulated_data.loc[:, 'x_squared'] = simulated_data['x'] ** 2
ols_model4 = ols(formula='y3 ~ x_squared + 1', data=simulated_data)
ols_result4 = ols_model4.fit()
ols_result4.summary()
```

Dep. Variable: | y3 | R-squared: | 0.992 |
---|---|---|---|

Model: | OLS | Adj. R-squared: | 0.992 |

Method: | Least Squares | F-statistic: | 1.243e+04 |

Date: | Thu, 22 Dec 2022 | Prob (F-statistic): | 4.87e-105 |

Time: | 15:26:38 | Log-Likelihood: | 92.204 |

No. Observations: | 100 | AIC: | -180.4 |

Df Residuals: | 98 | BIC: | -175.2 |

Df Model: | 1 | ||

Covariance Type: | nonrobust |

coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|

Intercept | 10.0215 | 0.012 | 841.568 | 0.000 | 9.998 | 10.045 |

x_squared | 0.9740 | 0.009 | 111.473 | 0.000 | 0.957 | 0.991 |

Omnibus: | 0.981 | Durbin-Watson: | 2.260 |
---|---|---|---|

Prob(Omnibus): | 0.612 | Jarque-Bera (JB): | 1.073 |

Skew: | 0.216 | Prob(JB): | 0.585 |

Kurtosis: | 2.735 | Cond. No. | 2.09 |

Warnings:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Now we see that the effect of \(X^2\) is significant, and if we look at the residual plot we should see that things look much better:

```
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
scipy.stats.probplot(ols_result4.resid, plot=sns.mpl.pyplot)
plt.subplot(1, 2, 2)
plt.scatter(ols_result4.fittedvalues, ols_result4.resid)
plt.xlabel('Fitted value')
plt.ylabel('Residual')
```

```
Text(0, 0.5, 'Residual')
```

Not perfect, but much better than before!

## Extending regression to binary outcomes.¶

```
from statsmodels.formula.api import logit
diabetes_df = adult_nhanes_data.query(
'DoctorToldYouHaveDiabetes != "Borderline"').dropna(
subset=['DoctorToldYouHaveDiabetes', 'AgeInYearsAtScreening', 'BodyMassIndexKgm2']).rename(
columns={'DoctorToldYouHaveDiabetes': 'Diabetes', 'AgeInYearsAtScreening': 'Age', 'BodyMassIndexKgm2': 'BMI'})
diabetes_df.loc[:, 'Diabetes'] = diabetes_df['Diabetes'].astype('int')
```

Now we would like to build a model that allows us to predict who has diabetes, based on their age and Body Mass Index (BMI). However, you may have noticed that the Diabetes variable is a binary variable; because linear regression assumes that the residuals from the model will be normally distributed, and the binary nature of the data will violate this, we instead need to use a different kind of model, known as a *logistic regression* model, which is built to deal with binary outcomes. We can fit this model using the `logit()`

function:

```
logitfit = logit(formula = 'Diabetes ~ Age + BMI', data = diabetes_df).fit(disp=0)
logitfit.summary()
```

Dep. Variable: | Diabetes | No. Observations: | 5267 |
---|---|---|---|

Model: | Logit | Df Residuals: | 5264 |

Method: | MLE | Df Model: | 2 |

Date: | Thu, 22 Dec 2022 | Pseudo R-squ.: | 0.1661 |

Time: | 15:26:38 | Log-Likelihood: | -1895.1 |

converged: | True | LL-Null: | -2272.6 |

Covariance Type: | nonrobust | LLR p-value: | 1.122e-164 |

coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|

Intercept | -7.3101 | 0.276 | -26.478 | 0.000 | -7.851 | -6.769 |

Age | 0.0622 | 0.003 | 21.763 | 0.000 | 0.057 | 0.068 |

BMI | 0.0693 | 0.005 | 12.769 | 0.000 | 0.059 | 0.080 |

This looks very similar to the output from the `ols()`

function, and it shows us that there is a significant relationship between the age, weight, and diabetes. The model provides us with a predicted probability that each individual will have diabetes; if this is greater than 0.5, then that means that the model predicts that the individual is more likely than not to have diabetes.

We can start by simply comparing those predictions to the actual outcomes.

```
diabetes_df.loc[:, 'LogitPrediction'] = (logitfit.predict() > 0.5).astype('int')
pd.crosstab(diabetes_df['Diabetes'], diabetes_df['LogitPrediction'])
```

LogitPrediction | 0 | 1 |
---|---|---|

Diabetes | ||

0 | 4400 | 50 |

1 | 756 | 61 |

This table shows that the model did somewhat well, in that it labeled most non-diabetic people as non-diabetic, and most diabetic people as diabetic. However, it also made a lot of mistakes, mislabeling nearly half of all diabetic people as non-diabetic.

We would often like a single number that tells us how good our prediction is. We could simply ask how many of our predictions are correct on average:

```
(diabetes_df['LogitPrediction'] == diabetes_df['Diabetes']).mean()
```

```
0.8469717106512246
```

This tells us that we are doing fairly well at prediction, with over 80% accuracy. However, this measure is problematic, because most people in the sample don’t have diabetes. This means that we could get relatively high accuracy if we simply said that no one has diabetes:

```
(np.zeros(diabetes_df.shape[0]) == diabetes_df['Diabetes']).mean()
```

```
0.844883235238276
```

One commonly used value when we have a graded prediction (as we do here, with the probabiilty that is predicted by the model) is called the *area under the receiver operating characteristic* or *AUROC*. This is a number that ranges from zero to one, where 0.5 means that we are guessing, and one means that our predictions are perfect. Let’s see what that comes out to for this dataset, using the `roc_auc_score`

from the scikit-learn package:

```
from sklearn.metrics import roc_auc_score
rocscore = roc_auc_score(diabetes_df['Diabetes'], logitfit.predict())
rocscore
```

```
0.7854712362301102
```

Our model performs relatively well according to this score. What if we wanted to know whether this is better than chance? One option would be to create a null model, in which we purposely break the relationship between our variables. We could then ask how likely our observed score would be if there is no true relationship.

```
from sklearn.utils import shuffle
shuffled_df = diabetes_df.copy()
num_runs = 1000
roc_scores = pd.DataFrame({'auc': np.zeros(num_runs)})
for simulation_run in range(num_runs):
# shuffle the diabetes labels in order to break the relationship
shuffled_df.loc[:, 'Diabetes'] = shuffle(shuffled_df['Diabetes'].values)
randomfit = logit(formula = 'Diabetes ~ Age + BMI', data = shuffled_df).fit(disp=0)
roc_scores.loc[simulation_run, 'auc'] = roc_auc_score(shuffled_df['Diabetes'], randomfit.predict())
pvalue = (100 - scipy.stats.percentileofscore(roc_scores['auc'], rocscore))/100
pvalue
```

```
0.0
```

This shows us that our observed score is higher than all of 1000 scores obtained using random permutations. Thus, we can conclude that our accuracy is greater than chance. However, this doesn’t tell us how well we can predict whether a *new* individual will have diabetes. This is what we turn to next.

## Cross-validation¶

Cross-validation is a powerful technique that allows us to estimate how well our results will generalize to a new dataset. Here we will build our own crossvalidation code to see how it works, continuing the logistic regression example from the previous section.
In cross-validation, we want to split the data into several subsets and then iteratively train the model while leaving out each subset (which we usually call *folds*) and then test the model on that held-out fold.
We can use one of the tools from the scikit-learn package to create our cross-validation folds for us. Let’s start by using 10-fold crossvalidation, in which we split the data into 10 parts, and the fit the model while holding out one of those parts and then testing it on the held-out data.

```
from sklearn.model_selection import KFold
kf = KFold(n_splits=10, shuffle=True)
diabetes_df['Predicted'] = np.nan
for train_index, test_index in kf.split(diabetes_df):
train_data = diabetes_df.iloc[train_index, :]
test_data = diabetes_df.iloc[test_index, :]
model = logit(formula = 'Diabetes ~ Age + BMI', data = train_data)
trainfit = model.fit(disp=0)
diabetes_df['Predicted'].iloc[list(test_index)] = trainfit.predict(
test_data[['Age', 'BMI']])
print(pd.crosstab(diabetes_df['Diabetes'], diabetes_df['Predicted']>0.5))
roc_auc_score(diabetes_df['Diabetes'], diabetes_df['Predicted'])
```

```
Predicted False True
Diabetes
0 4397 53
1 759 58
```

/opt/conda/lib/python3.8/site-packages/pandas/core/indexing.py:671: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self._setitem_with_indexer(indexer, value)

```
0.7841971861977913
```

This result shows that our model is able to generalize to new individuals relatively well — in fact, almost as well as the original model. This is because our sample size is very large; with smaller samples, the generalization performance is usually much less using crossvalidation than using the full sample.