Preface

This book is a companion to Statistical Thinking for the 21st Century, an open source statistical textbook. It focuses on the use of the R statistical programming language for statistics and data analysis.

0.1 Why R?

In my statistics course, students learn to analyze data hands-on using the R language. The question “Why R?” could be interpreted to mean “Why R instead of a graphical software package like (insert name here)?”. After all, most of the students who enroll in my class have never programmed before, so teaching them to program is going to take away from instruction in statistical concepts. My answer is that I think that the best way to learn statistical tools is to work directly with data, and that working with graphical packages insulates one from the data and methods in way that impedes true understanding. In addition, for many of the students in my class this may be the only course in which they are exposed to programming; given that programming is an essential ability in a growing number of academic fields, I think that providing these students with basic programming literacy is critical to their future success, and will hopefully inspire at least a few of them to learn more.

The question could also be interpreted to mean “Why R instead of (insert language here)?”. On this question I am much more conflicted, because I deeply dislike R as a programming language (I greatly prefer to use Python for my own work). Why then do I use R? The first reason is that R has become the “lingua franca” for statistical analysis. There are a number of tools that I use in this book (such as the linear modeling tools in the lme4 package and the Bayes factor tools in the BayesFactor package) that are simply not available in other languages.

The second reason is that the free Rstudio software makes using R relatively easy for new users. In particular, I like the RMarkdown Notebook feature that allows the mixing of narrative and executable code with integrated output. It’s similar in spirit to the Jupyter notebooks that many of us use for Python programming, but I find it easier to deal with because you edit it as a plain text file, rather than through an HTML interface. In my class, I give students a skeleton RMarkdown file for problem sets, and they submit the file with their solution added, which I then score using a set of automated grading scripts.

That said, there are good reasons to prefer a real programming language over R; my preferred language is Python, and a parallel Python companion to this one is currently under development.

0.2 The golden age of data

Throughout this book I have tried when possible to use examples from real data. This is now very easy because we are swimming in open datasets, as governments, scientists, and companies are increasingly making data freely available. I think that using real datasets is important because it prepares students to work with real data rather than toy datasets, which I think should be one of the major goals of statistical training. It also helps us realize (as we will see at various points throughout the book) that data don’t always come to us ready to analyze, and often need wrangling to help get them into shape. Using real data also shows that the idealized statistical distributions often assumed in statistical methods don’t always hold in the real world – for example, as we will see in Chapter 2, distributions of some real-world quantities (like the number of friends on Facebook) can have very long tails that can break many standard assumptions.

I apologize up front that the datasets are heavily US-centric. This is primarily because the best dataset for many of the demonstrations is the National Health and Nutrition Examination Surveys (NHANES) dataset that is available as an R package, and because many of the other complex datasets included in R (such as those in the fivethirtyeight package) are also based in the US. If you have suggestions for datasets from other regions, please pass them along to me!

0.3 An open source book

This book is meant to be a living document, which is why its source is available online at https://github.com/statsthinking21/statsthinking21-R. If you find any errors in the book or want to make a suggestion for how to improve it, please open an issue on the Github site. Even better, submit a pull request with your suggested change.

This book is licensed using the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) License. Please see the terms of that license for more details.

0.4 Acknowledgements

Thanks to everyone who has contributed to this project.

An R companion to Statistical Thinking for the 21st Century