# Probability¶

In this chapter we will go over how to perform probability computations in Python.

## Basic probability calculations¶

Let’s create a vector of outcomes from one to 6, using the `np.arange()`

function to create such a sequence. This function takes the minimum and maximum values as its inputs, but note that the maximum is not included in the sequence; that is, the sequence goes up to but not including the maximum. Thus, we would have to give 1 and 7 as the minimum and maximum in order to get a sequence of numbers from 1 to 6:

```
import numpy as np
outcomes = np.arange(1, 7)
outcomes
```

```
array([1, 2, 3, 4, 5, 6])
```

Now let’s create a vector of logical values based on whether the outcome in each position is equal to 1. Remember that `==`

tests for equality of each element in a vector:

```
outcome1isTrue = outcomes == 1
outcome1isTrue
```

```
array([ True, False, False, False, False, False])
```

Remember that the simple probability of an outcome is number of occurrences of the outcome divided by the total number of events. To compute a probability, we can take advantage of the fact that TRUE/FALSE are equivalent to 1/0 in Python. The formula for the mean (sum of values divided by the number of values) is thus exactly the same as the formula for the simple probability! So, we can compute the probability of the event by simply taking the mean of the logical vector.

```
p1isTrue = np.mean(outcome1isTrue)
p1isTrue
```

```
0.16666666666666666
```

## Empirical frequency¶

Let’s walk through how we computed empirical frequency of rain in San Francisco.

First we load the data:

```
#+
import pandas as pd
SFrain = pd.read_csv('https://raw.githubusercontent.com/statsthinking21/statsthinking21-python/master/notebooks/data/SanFranciscoRain.csv')
# we will remove the STATION and NAME variables
# since they are identical for all rows
SFrain = SFrain.drop(columns=['STATION', 'NAME'])
SFrain
#-
```

DATE | PRCP | |
---|---|---|

0 | 2017-01-01 | 0.05 |

1 | 2017-01-02 | 0.10 |

2 | 2017-01-03 | 0.40 |

3 | 2017-01-04 | 0.89 |

4 | 2017-01-05 | 0.01 |

... | ... | ... |

360 | 2017-12-27 | 0.00 |

361 | 2017-12-28 | 0.00 |

362 | 2017-12-29 | 0.00 |

363 | 2017-12-30 | 0.00 |

364 | 2017-12-31 | 0.00 |

365 rows × 2 columns

We see that the data frame contains a variable called `PRCP`

which denotes the amount of rain each day. Let’s create a new variable called `rainToday`

that denotes whether the amount of precipitation was above zero:

```
SFrain['rainToday'] = SFrain['PRCP'] > 0
SFrain
```

DATE | PRCP | rainToday | |
---|---|---|---|

0 | 2017-01-01 | 0.05 | True |

1 | 2017-01-02 | 0.10 | True |

2 | 2017-01-03 | 0.40 | True |

3 | 2017-01-04 | 0.89 | True |

4 | 2017-01-05 | 0.01 | True |

... | ... | ... | ... |

360 | 2017-12-27 | 0.00 | False |

361 | 2017-12-28 | 0.00 | False |

362 | 2017-12-29 | 0.00 | False |

363 | 2017-12-30 | 0.00 | False |

364 | 2017-12-31 | 0.00 | False |

365 rows × 3 columns

Now we will summarize the data to compute the probability of rain:

```
pRainInSF = SFrain['rainToday'].mean()
pRainInSF
```

```
0.2
```

## Conditional probability¶

Let’s determine the conditional probability of someone having hearing problems, given that they are over 70 years of age, using the NHANES dataset. First, let’s create a new variable called `Over70`

that denotes whether each individual is over 70 or not.

```
from nhanes.load import load_NHANES_data
nhanes_data = load_NHANES_data()
nhanes_data['Over70'] = nhanes_data['AgeInYearsAtScreening'] > 70
```

Now let’s create a cleaned-up dataset that only includes the over70 variable along with the variable called `HaveSeriousDifficultyHearing`

that denotes whether a person reports having serious hearing difficulty (coded as 1 for “yes” and 0 for “no”).

```
hearing_data = nhanes_data[['Over70', 'HaveSeriousDifficultyHearing']].dropna()
hearing_data
```

Over70 | HaveSeriousDifficultyHearing | |
---|---|---|

SEQN | ||

93703.0 | False | 0.0 |

93704.0 | False | 0.0 |

93705.0 | False | 0.0 |

93706.0 | False | 0.0 |

93707.0 | False | 0.0 |

... | ... | ... |

102952.0 | False | 0.0 |

102953.0 | False | 0.0 |

102954.0 | False | 0.0 |

102955.0 | False | 0.0 |

102956.0 | False | 0.0 |

8365 rows × 2 columns

First, what’s the probability of being over 70?

```
p_over_70 = hearing_data['Over70'].mean()
p_over_70
```

```
0.10531978481769277
```

Second, what’s the probability of having hearing problems?

```
p_hearing_problem = hearing_data['HaveSeriousDifficultyHearing'].mean()
p_hearing_problem
```

```
0.06276150627615062
```

What’s the probability for each combination of hearing problems/no problems and over 70/ not? We can create a table that finds the joint probability for each combination, using the `pd.crosstab()`

function:

```
joint_table = pd.crosstab(hearing_data.Over70, hearing_data['HaveSeriousDifficultyHearing'], normalize=True)
joint_table
```

HaveSeriousDifficultyHearing | 0.0 | 1.0 |
---|---|---|

Over70 | ||

False | 0.859653 | 0.035027 |

True | 0.077585 | 0.027735 |

Finally, what’s the probability of someone having hearing problems, given that they are over 70 years of age? To do this, we limit the computation of the probability of having hearing problems to only include those people who are over 70:

```
p_hearingproblem_given_over_70 = hearing_data.query('Over70 == True')['HaveSeriousDifficultyHearing'].mean()
p_hearingproblem_given_over_70
```

```
0.2633371169125993
```

Now compute the opposite: What is the probability of being over 70 given that one has a hearing problem?

```
p_over_70_given_hearingproblem = hearing_data.query('HaveSeriousDifficultyHearing == True')['Over70'].mean()
p_over_70_given_hearingproblem
```

```
0.4419047619047619
```