Reproducible Research

A core challenge of science is the (in)ability to reproduce other scholars’ research. Empowering others to reproduce our research helps us identify and correct shortcomings in our own analysis. Incorporating reproducible research in our own analyses encourages us to be rigorous and transparent in our scholarship. As a scholarly community, we are aware of the benefits of reproducible research. Paradoxically, though, we have done little to make it easier for others to attempt replicate our studies. Our inaction has, according to a 2016 study in Nature, led to a “crisis” in scholarship.

One way to mitigate this crisis is to “bake” reproducibility into our analyses. R Markdown does exactly this. R Markdown – the text editor I am using to create this file – allows you to both write your papers and conduct your analysis all in one place. You can choose to include or not include your analysis code in your compiled document. For the final draft of a journal publication, for example, you might not include your code. For your personal website or github page, though, you may want to include a version of the article with the code as well as the data you used in the article. You can also share the R Markdown plain text from which you compiled your article. When you share an R Markdown file along with your data, other scholars can immediately attempt to reproduce your research!

R Markdown has other advantages. It allows you to use LaTeX when you need it but use infinitely easier Markdown for the majority of your writing. I noted in our R / LaTeX lab that I use LaTeX’s slide typesetting class, Beamer, to produce all my slides. This is true… but only part of the story. While it’s relatively easy to write articles in LaTeX, the amount of code required to set up slides in beamer gets tiring. R Markdown allows you to write your slides in the simpler markdown format, but then pass these slides through a beamer template. I have one LaTeX beamer template for my academic presentations – a template I hacked from Professor McGrath–which is, in turn, a template Professor McGrath hacked from Professor Miller.

R Markdown compiles to multiple formats – html, MS Word, and PDF. R Markdown allows you to quickly create updated reports. Imagine, for example, you needed to brief the president every day on country and state-level changes in COVID-19 infections and fatalities. Rather than loading the updated data set into your stats program, rerunning your analysis, and outputting new tables and graphs into your slideware or word processing program, with R Markdown you simply need to run one command, “knit,” and you will have an updated report with the latest COVID-19 data. As long as you have configured R to pull the latest data from a dynamic website – for example, Johns Hopkins Corona Virus Resource Center – your reports will always be up to date.

There are start up costs to R Markdown. R and R Markdown do take time to learn. If you are studying for comprehensive or field exams, trying to make it through your first year in the PhD program, or simultaneously working, going to grad school, and homeschooling kids, you likely have more pressing demands that deserve your attention. If you enjoy coding or think you might enjoy learning how to code – and want to substitute movie entertainment with coding fun – then you may find the dive into R and R Markdown to be rewarding and, down the road, a potential time saver.

There is an overwhelming amount of resources for opensource applications like R and R Markdown. A good place to start is a recent entertaining talk Garrett Grolemund,senior data scientist at R Studio, delivered.



Grolemund is also co-author of the excellent and free R Markdown: The Definitive Guide.

Enough about R Markdown… on to the analysis!

Analysis of Washington Post’s Police Shootings Data – Introduction

We did not have time to cover the Post’s Police Shootings data. I present the analysis of this data here. The Post’s data is available at:

https://github.com/washingtonpost/data-police-shootings/releases/download/v0.1/fatal-police-shootings-data.csv

The codebook can be found in this readme.md file: https://github.com/washingtonpost/data-police-shootings/blob/master/README.md

The overall GitHub repository is here: https://github.com/washingtonpost/data-police-shootings

I load the data into my R session using the following:

fatal <- read.csv("https://github.com/washingtonpost/data-police-shootings/releases/download/v0.1/fatal-police-shootings-data.csv", na.strings=c("","NA"))

A note about the above function – I am reading the csv (comma-separated values) file into R, assigning it to the object, fatal, and telling R that blank cells as well as any cells that have NA values should be treated as NA. NA is what we use to indicate missing data.

I will be using the tidyverse package for analysis. You can find more on tidyverse here: https://www.tidyverse.org/. If you have not installed tidyverse, you can by running the following command: install.packages(“tidyverse”). I have tidyverse already installed. I only need to load it through the command:

library(tidyverse)
library(ggplot2)

Let’s take a look at our data frame. In the Global Environment you will see that this is a large data frame – (5680 observations on 17 variables). We don’t want to look at all of these. The head function will allow us to look at the first few observations.

head(fatal)
##   id               name       date  manner_of_death      armed age gender race
## 1  3         Tim Elliot 2015-01-02             shot        gun  53      M    A
## 2  4   Lewis Lee Lembke 2015-01-02             shot        gun  47      M    W
## 3  5 John Paul Quintero 2015-01-03 shot and Tasered    unarmed  23      M    H
## 4  8    Matthew Hoffman 2015-01-04             shot toy weapon  32      M    W
## 5  9  Michael Rodriguez 2015-01-04             shot   nail gun  39      M    H
## 6 11  Kenneth Joe Brown 2015-01-04             shot        gun  18      M    W
##            city state signs_of_mental_illness threat_level        flee
## 1       Shelton    WA                    True       attack Not fleeing
## 2         Aloha    OR                   False       attack Not fleeing
## 3       Wichita    KS                   False        other Not fleeing
## 4 San Francisco    CA                    True       attack Not fleeing
## 5         Evans    CO                   False       attack Not fleeing
## 6       Guthrie    OK                   False       attack Not fleeing
##   body_camera longitude latitude is_geocoding_exact
## 1       False  -123.122   47.247               True
## 2       False  -122.892   45.487               True
## 3       False   -97.281   37.695               True
## 4       False  -122.422   37.763               True
## 5       False  -104.692   40.384               True
## 6       False   -97.423   35.877               True

Data Analysis

Now that we have a sense of the data frame, we can begin exploring the data. Let’s begin by looking at frequencies for the age of fatal victims:

table(fatal$age)
## 
##   6  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30 
##   2   1   1   3  13  31  51  98  86  97 107 122 134 162 189 160 182 163 177 167 
##  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50 
## 195 176 177 175 163 164 156 143 136 117 125 102  96  85 116  98  98  94  90  88 
##  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70 
##  81  65  70  61  61  63  53  52  62  35  38  39  31  21  26  16  20  16  17  15 
##  71  72  73  74  75  76  77  78  79  80  81  82  83  84  86  88  89  91 
##  10   6   6   5   4  11   5   1   1   2   3   2   3   4   2   1   1   1
table(fatal$age, useNA = "always") 
## 
##    6   12   13   14   15   16   17   18   19   20   21   22   23   24   25   26 
##    2    1    1    3   13   31   51   98   86   97  107  122  134  162  189  160 
##   27   28   29   30   31   32   33   34   35   36   37   38   39   40   41   42 
##  182  163  177  167  195  176  177  175  163  164  156  143  136  117  125  102 
##   43   44   45   46   47   48   49   50   51   52   53   54   55   56   57   58 
##   96   85  116   98   98   94   90   88   81   65   70   61   61   63   53   52 
##   59   60   61   62   63   64   65   66   67   68   69   70   71   72   73   74 
##   62   35   38   39   31   21   26   16   20   16   17   15   10    6    6    5 
##   75   76   77   78   79   80   81   82   83   84   86   88   89   91 <NA> 
##    4   11    5    1    1    2    3    2    3    4    2    1    1    1  259

Look at the two commands above. What is different about the code of these commands? And what is different between the results for the two commands above?

NA means “Not Available” – this indicates missing data. The second command, table(fatal$age, useNA = “always”), tells R to show us the missing data. Here we have 254 cases with no information on the age variable.

Now let’s see what the average victim age is..

mean(fatal$age) 
## [1] NA

This command returns NA b.c. there is missing data in the age variable. R will not calculate averages using mean function if NAs are included. Let’s remove the NAs using the na.rm = TRUE option and try again.

mean(fatal$age, na.rm = TRUE)
## [1] 37.1422

Another way to get descriptive statistics is by using the summary function

summary(fatal$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    6.00   27.00   35.00   37.14   46.00   91.00     259

Note in the output that the summary function R automatically handles the NAs

What if we were interested in the mean victim age of not the entire population, but just women? Just men?

mean(fatal$age[fatal$gender=="F"], na.rm = TRUE) # women
## [1] 37.36992
mean(fatal$age[fatal$gender=="M"], na.rm = TRUE) # men 
## [1] 37.13454

What about means for men by race? First we need to know how the race variable is coded. The table function will give us a quick sense:

table(fatal$race, useNA = "always")
## 
##    A    B    H    N    O    W <NA> 
##   94 1344  946   80   47 2593  605

I’m not entirely sure about the coding here. Whenever you are not sure, turn to the codebook! Thankfully, the Post explains the codes in the readme.md file in the repository. The readme.md file explains:

race:

Know that we know how the race variable is coded, we can analyze:

mean(fatal$age[fatal$gender=="M" & fatal$race=="B"], na.rm = TRUE)
## [1] 32.51415
mean(fatal$age[fatal$gender=="M" & fatal$race=="W"], na.rm = TRUE)
## [1] 39.95759
mean(fatal$age[fatal$gender=="M" & fatal$race=="H"], na.rm = TRUE)
## [1] 33.61926
mean(fatal$age[fatal$gender=="M" & fatal$race=="A"], na.rm = TRUE)
## [1] 36.48276
mean(fatal$age[fatal$gender=="M" & fatal$race=="N"], na.rm = TRUE)
## [1] 31.86486

There is some interesting variation here…. maybe a puzzle worth exploring.

Data Visualization

Note: I am not the first person to play with this data. Code for plots here unabashedly scraped and modified from code available on the web, for example, this lab assignment.

Let’s look at fatalities by race

ggplot(data = fatal, aes(y = race)) + 
        geom_bar(aes(fill = ..count..)) +
        theme_minimal(base_size = 13) +
        theme(legend.position = "none") +
        scale_x_continuous(expand=c(0,0)) +
        scale_fill_gradient(low = "yellow", high = "red") +
        labs(y = NULL, x = "Number of Deaths")

Deaths by age:

ggplot(data = fatal, aes(x = age)) + 
        geom_histogram(aes(fill = ..count..), bins = 20) +
        theme_minimal(base_size = 13) +
        theme(legend.position = "none") +
        scale_x_continuous(expand=c(0,0)) +
        scale_fill_gradient(low = "yellow", high = "red") +
        labs(x = "Age at Death", y = "Number of Deaths")
## Warning: Removed 259 rows containing non-finite values (stat_bin).

Deaths by gender:

ggplot(data = fatal, aes(y = gender)) + 
        geom_bar(aes(fill = ..count..)) +
        theme_minimal(base_size = 13) +
        theme(legend.position = "none") +
        scale_x_continuous(expand=c(0,0)) +
        scale_fill_gradient(low = "yellow", high = "red") +
        labs(y = NULL, x = "Number of Deaths")

Fatal shootings of suspects with / without mental illness:

ggplot(data = fatal, aes(y = signs_of_mental_illness)) + 
        geom_bar(aes(fill = ..count..)) +
        theme_minimal(base_size = 13) +
        theme(legend.position = "none") +
        scale_x_continuous(expand=c(0,0)) +
        scale_fill_gradient(low = "yellow", high = "red") +
        labs(y = NULL, x = "Number of Deaths")

Was officer wearing a camera?

ggplot(data = fatal, aes(y = body_camera)) + 
        geom_bar(aes(fill = ..count..)) +
        theme_minimal(base_size = 13) +
        theme(legend.position = "none") +
        scale_x_continuous(expand=c(0,0)) +
        scale_fill_gradient(low = "yellow", high = "red") +
        labs(y = NULL, x = "Number of Deaths")

Fatal Shootings by State

stateinfo <- fatal %>% group_by(state) %>% summarise(n = n()) %>% ## creating new dataframe that summarizes fatal shootings by state
        arrange(desc(n)) %>% top_n(15) %>% 
        mutate(state = factor(state, levels = rev(unique(state))))

stateinfo # take a look at dataframe
## # A tibble: 15 x 2
##    state     n
##    <fct> <int>
##  1 CA      852
##  2 TX      506
##  3 FL      379
##  4 AZ      262
##  5 CO      209
##  6 GA      197
##  7 OK      171
##  8 NC      168
##  9 WA      163
## 10 OH      159
## 11 TN      149
## 12 MO      146
## 13 LA      116
## 14 PA      113
## 15 NM      112
# visualize

ggplot(stateinfo, aes(x = n, y = state)) +        
        geom_bar(stat="identity", aes(fill = n)) +
        #geom_stateface(aes(y=state, x=7, label=as.character(state)), colour="white", size=8) +
        geom_text(aes(x = 17, y = state, label=as.character(state)), color="white", size=4) +
        labs(y = NULL, x = "Number of Deaths") +
        scale_fill_gradient(low = "green", high = "red") +
        theme_minimal(base_size = 13) +
        theme(axis.text.y=element_blank()) +
        theme(legend.position = "none") +
        scale_x_continuous(expand=c(0,0))

How Was the Person Killed?

ggplot(data = fatal, aes(y = manner_of_death)) + 
        geom_bar(aes(fill = ..count..)) +
        theme_minimal(base_size = 13) +
        theme(legend.position = "none") +
        scale_x_continuous(expand=c(0,0)) +
        scale_fill_gradient(low = "yellow", high = "red") +
        labs(y = NULL, x = "Number of Deaths")

Concluding Thoughts

R is an incredibly powerful and flexible platform. It is also a platform that, when combined with R Markdown, greatly facilitates transparency and reproducibility.

R, admittedly, can be overwhelming. The R universe is vast and one can be sucked into black holes. Many find the adventure, though at times disorienting, worth it. I certainly have.

Live long, and prosper.