Chapter 5 Probability in R (with Lucy King)

In this chapter we will go over probability computations in R.

5.1 Basic probability calculations

Let’s create a vector of outcomes from one to 6, using the seq() function to create such a sequence:

outcomes <- seq(1, 6)
outcomes

## [1] 1 2 3 4 5 6

Now let’s create a vector of logical values based on whether the outcome in each position is equal to 1. Remember that == tests for equality of each element in a vector:

outcome1isTrue <- outcomes == 1 
outcome1isTrue

## [1]  TRUE FALSE FALSE FALSE FALSE FALSE

Remember that the simple probability of an outcome is number of occurrences of the outcome divided by the total number of events. To compute a probability, we can take advantage of the fact that TRUE/FALSE are equivalent to 1/0 in R. The formula for the mean (sum of values divided by the number of values) is thus exactly the same as the formula for the simple probability! So, we can compute the probability of the event by simply taking the mean of the logical vector.

p1isTrue <- mean(outcome1isTrue)
p1isTrue

## [1] 0.1666667

5.2 Empirical frequency (Section 5.2)

Let’s walk through how we computed empirical frequency of rain in San Francisco.

First we load the data:

# we will remove the STATION and NAME variables 
# since they are identical for all rows
SFrain <- read_csv("data/SanFranciscoRain/1329219.csv") %>% 
  dplyr::select(-STATION, -NAME)
  
glimpse(SFrain)

## Rows: 365
## Columns: 2
## $ DATE <date> 2017-01-01, 2017-01-02, 2017-01-03, 2017-01-…
## $ PRCP <dbl> 0.05, 0.10, 0.40, 0.89, 0.01, 0.00, 0.82, 1.4…

We see that the data frame contains a variable called PRCP which denotes the amount of rain each day. Let’s create a new variable called rainToday that denotes whether the amount of precipitation was above zero:

SFrain <- 
  SFrain %>%
  mutate(rainToday = as.integer(PRCP > 0))

glimpse(SFrain)

## Rows: 365
## Columns: 3
## $ DATE      <date> 2017-01-01, 2017-01-02, 2017-01-03, 201…
## $ PRCP      <dbl> 0.05, 0.10, 0.40, 0.89, 0.01, 0.00, 0.82…
## $ rainToday <int> 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0…

Now we will summarize the data to compute the probability of rain:

pRainInSF <- 
  SFrain %>%
  summarize(
    pRainInSF = mean(rainToday)
  ) %>%
  pull()

pRainInSF

## [1] 0.2

5.3 Conditional probability (Section 5.3)

Let’s determine the conditional probability of someone being unhealthy, given that they are over 70 years of age, using the NHANES dataset. Let’s create a new data frame that only contains people over 70 years old.

healthDataFrame <-
  NHANES %>%
  mutate(
    Over70 = Age > 70,
    Unhealthy = DaysPhysHlthBad > 0
  ) %>%
  dplyr::select(Unhealthy, Over70) %>%
  drop_na()

glimpse(healthDataFrame)

## Rows: 4,891
## Columns: 2
## $ Unhealthy <lgl> FALSE, FALSE, FALSE, TRUE, FALSE, TRUE, …
## $ Over70    <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…

First, what’s the probability of being over 70?

pOver70 <- 
  healthDataFrame %>%
  summarise(pOver70 = mean(Over70)) %>% 
  pull()

# to obtain the specific value, we need to extract it from the data frame

pOver70

## [1] 0.1106113

Second, what’s the probability of being unhealthy?

pUnhealthy <- 
  healthDataFrame %>%
  summarise(pUnhealthy = mean(Unhealthy)) %>% 
  pull()

pUnhealthy

## [1] 0.3567778

What’s the probability for each combination of unhealthy/healthy and over 70/ not? We can create a new variable that finds the joint probability by multiplying the two individual binary variables together; since anything times zero is zero, this will only have the value 1 for any case where both are true.

pBoth <- healthDataFrame %>% 
  mutate(
    both = Unhealthy*Over70
  ) %>%
  summarise(
    pBoth = mean(both)) %>% 
  pull()

pBoth

## [1] 0.04252709

Finally, what’s the probability of someone being unhealthy, given that they are over 70 years of age?

pUnhealthyGivenOver70 <-
  healthDataFrame %>%
  filter(Over70 == TRUE) %>% # limit to Over70
  summarise(pUnhealthy = mean(Unhealthy)) %>% 
  pull()

pUnhealthyGivenOver70

## [1] 0.3844732

# compute the opposite:
# what the probability of being over 70 given that 
# one is unhealthy?
pOver70givenUnhealthy <-
  healthDataFrame %>%
  filter(Unhealthy == TRUE) %>% # limit to Unhealthy
  summarise(pOver70 = mean(Over70)) %>% 
  pull()

pOver70givenUnhealthy

## [1] 0.1191977