A core challenge of science is the (in)ability to reproduce other scholars’ research. Empowering others to reproduce our research helps us identify and correct shortcomings in our own analysis. Incorporating reproducible research in our own analyses encourages us to be rigorous and transparent in our scholarship. As a scholarly community, we are aware of the benefits of reproducible research. Paradoxically, though, we have done little to make it easier for others to attempt replicate our studies. Our inaction has, according to a 2016 study in Nature, led to a “crisis” in scholarship.
One way to mitigate this crisis is to “bake” reproducibility into our analyses. R Markdown does exactly this. R Markdown – the text editor I am using to create this file – allows you to both write your papers and conduct your analysis all in one place. You can choose to include or not include your analysis code in your compiled document. For the final draft of a journal publication, for example, you might not include your code. For your personal website or github page, though, you may want to include a version of the article with the code as well as the data you used in the article. You can also share the R Markdown plain text from which you compiled your article. When you share an R Markdown file along with your data, other scholars can immediately attempt to reproduce your research!
R Markdown has other advantages. It allows you to use LaTeX when you need it but use infinitely easier Markdown for the majority of your writing. I noted in our R / LaTeX lab that I use LaTeX’s slide typesetting class, Beamer, to produce all my slides. This is true… but only part of the story. While it’s relatively easy to write articles in LaTeX, the amount of code required to set up slides in beamer gets tiring. R Markdown allows you to write your slides in the simpler markdown format, but then pass these slides through a beamer template. I have one LaTeX beamer template for my academic presentations – a template I hacked from Professor McGrath–which is, in turn, a template Professor McGrath hacked from Professor Miller.
R Markdown compiles to multiple formats – html, MS Word, and PDF. R Markdown allows you to quickly create updated reports. Imagine, for example, you needed to brief the president every day on country and state-level changes in COVID-19 infections and fatalities. Rather than loading the updated data set into your stats program, rerunning your analysis, and outputting new tables and graphs into your slideware or word processing program, with R Markdown you simply need to run one command, “knit,” and you will have an updated report with the latest COVID-19 data. As long as you have configured R to pull the latest data from a dynamic website – for example, Johns Hopkins Corona Virus Resource Center – your reports will always be up to date.
There are start up costs to R Markdown. R and R Markdown do take time to learn. If you are studying for comprehensive or field exams, trying to make it through your first year in the PhD program, or simultaneously working, going to grad school, and homeschooling kids, you likely have more pressing demands that deserve your attention. If you enjoy coding or think you might enjoy learning how to code – and want to substitute movie entertainment with coding fun – then you may find the dive into R and R Markdown to be rewarding and, down the road, a potential time saver.
There is an overwhelming amount of resources for opensource applications like R and R Markdown. A good place to start is a recent entertaining talk Garrett Grolemund,senior data scientist at R Studio, delivered.
Grolemund is also co-author of the excellent and free R Markdown: The Definitive Guide.
Enough about R Markdown… on to the analysis!
We did not have time to cover the Post’s Police Shootings data. I present the analysis of this data here. The Post’s data is available at:
The codebook can be found in this readme.md file: https://github.com/washingtonpost/data-police-shootings/blob/master/README.md
The overall GitHub repository is here: https://github.com/washingtonpost/data-police-shootings
I load the data into my R session using the following:
fatal <- read.csv("https://github.com/washingtonpost/data-police-shootings/releases/download/v0.1/fatal-police-shootings-data.csv", na.strings=c("","NA"))
A note about the above function – I am reading the csv (comma-separated values) file into R, assigning it to the object, fatal, and telling R that blank cells as well as any cells that have NA values should be treated as NA. NA is what we use to indicate missing data.
I will be using the tidyverse package for analysis. You can find more on tidyverse here: https://www.tidyverse.org/. If you have not installed tidyverse, you can by running the following command: install.packages(“tidyverse”). I have tidyverse already installed. I only need to load it through the command:
library(tidyverse)
library(ggplot2)
Let’s take a look at our data frame. In the Global Environment you will see that this is a large data frame – (5680 observations on 17 variables). We don’t want to look at all of these. The head function will allow us to look at the first few observations.
head(fatal)
## id name date manner_of_death armed age gender race
## 1 3 Tim Elliot 2015-01-02 shot gun 53 M A
## 2 4 Lewis Lee Lembke 2015-01-02 shot gun 47 M W
## 3 5 John Paul Quintero 2015-01-03 shot and Tasered unarmed 23 M H
## 4 8 Matthew Hoffman 2015-01-04 shot toy weapon 32 M W
## 5 9 Michael Rodriguez 2015-01-04 shot nail gun 39 M H
## 6 11 Kenneth Joe Brown 2015-01-04 shot gun 18 M W
## city state signs_of_mental_illness threat_level flee
## 1 Shelton WA True attack Not fleeing
## 2 Aloha OR False attack Not fleeing
## 3 Wichita KS False other Not fleeing
## 4 San Francisco CA True attack Not fleeing
## 5 Evans CO False attack Not fleeing
## 6 Guthrie OK False attack Not fleeing
## body_camera longitude latitude is_geocoding_exact
## 1 False -123.122 47.247 True
## 2 False -122.892 45.487 True
## 3 False -97.281 37.695 True
## 4 False -122.422 37.763 True
## 5 False -104.692 40.384 True
## 6 False -97.423 35.877 True
Now that we have a sense of the data frame, we can begin exploring the data. Let’s begin by looking at frequencies for the age of fatal victims:
table(fatal$age)
##
## 6 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
## 2 1 1 3 13 31 51 98 86 97 107 122 134 162 189 160 182 163 177 167
## 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
## 195 176 177 175 163 164 156 143 136 117 125 102 96 85 116 98 98 94 90 88
## 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70
## 81 65 70 61 61 63 53 52 62 35 38 39 31 21 26 16 20 16 17 15
## 71 72 73 74 75 76 77 78 79 80 81 82 83 84 86 88 89 91
## 10 6 6 5 4 11 5 1 1 2 3 2 3 4 2 1 1 1
table(fatal$age, useNA = "always")
##
## 6 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
## 2 1 1 3 13 31 51 98 86 97 107 122 134 162 189 160
## 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
## 182 163 177 167 195 176 177 175 163 164 156 143 136 117 125 102
## 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58
## 96 85 116 98 98 94 90 88 81 65 70 61 61 63 53 52
## 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
## 62 35 38 39 31 21 26 16 20 16 17 15 10 6 6 5
## 75 76 77 78 79 80 81 82 83 84 86 88 89 91 <NA>
## 4 11 5 1 1 2 3 2 3 4 2 1 1 1 259
Look at the two commands above. What is different about the code of these commands? And what is different between the results for the two commands above?
NA means “Not Available” – this indicates missing data. The second command, table(fatal$age, useNA = “always”), tells R to show us the missing data. Here we have 254 cases with no information on the age variable.
Now let’s see what the average victim age is..
mean(fatal$age)
## [1] NA
This command returns NA b.c. there is missing data in the age variable. R will not calculate averages using mean function if NAs are included. Let’s remove the NAs using the na.rm = TRUE option and try again.
mean(fatal$age, na.rm = TRUE)
## [1] 37.1422
Another way to get descriptive statistics is by using the summary function
summary(fatal$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 6.00 27.00 35.00 37.14 46.00 91.00 259
Note in the output that the summary function R automatically handles the NAs
What if we were interested in the mean victim age of not the entire population, but just women? Just men?
mean(fatal$age[fatal$gender=="F"], na.rm = TRUE) # women
## [1] 37.36992
mean(fatal$age[fatal$gender=="M"], na.rm = TRUE) # men
## [1] 37.13454
What about means for men by race? First we need to know how the race variable is coded. The table function will give us a quick sense:
table(fatal$race, useNA = "always")
##
## A B H N O W <NA>
## 94 1344 946 80 47 2593 605
I’m not entirely sure about the coding here. Whenever you are not sure, turn to the codebook! Thankfully, the Post explains the codes in the readme.md file in the repository. The readme.md file explains:
race
:
W
: White, non-Hispanic
B
: Black, non-Hispanic
A
: Asian
N
: Native American
H
: Hispanic
O
: Other
None
: unknown
Know that we know how the race variable is coded, we can analyze:
mean(fatal$age[fatal$gender=="M" & fatal$race=="B"], na.rm = TRUE)
## [1] 32.51415
mean(fatal$age[fatal$gender=="M" & fatal$race=="W"], na.rm = TRUE)
## [1] 39.95759
mean(fatal$age[fatal$gender=="M" & fatal$race=="H"], na.rm = TRUE)
## [1] 33.61926
mean(fatal$age[fatal$gender=="M" & fatal$race=="A"], na.rm = TRUE)
## [1] 36.48276
mean(fatal$age[fatal$gender=="M" & fatal$race=="N"], na.rm = TRUE)
## [1] 31.86486
There is some interesting variation here…. maybe a puzzle worth exploring.
Note: I am not the first person to play with this data. Code for plots here unabashedly scraped and modified from code available on the web, for example, this lab assignment.
Let’s look at fatalities by race
ggplot(data = fatal, aes(y = race)) +
geom_bar(aes(fill = ..count..)) +
theme_minimal(base_size = 13) +
theme(legend.position = "none") +
scale_x_continuous(expand=c(0,0)) +
scale_fill_gradient(low = "yellow", high = "red") +
labs(y = NULL, x = "Number of Deaths")
Deaths by age:
ggplot(data = fatal, aes(x = age)) +
geom_histogram(aes(fill = ..count..), bins = 20) +
theme_minimal(base_size = 13) +
theme(legend.position = "none") +
scale_x_continuous(expand=c(0,0)) +
scale_fill_gradient(low = "yellow", high = "red") +
labs(x = "Age at Death", y = "Number of Deaths")
## Warning: Removed 259 rows containing non-finite values (stat_bin).
Deaths by gender:
ggplot(data = fatal, aes(y = gender)) +
geom_bar(aes(fill = ..count..)) +
theme_minimal(base_size = 13) +
theme(legend.position = "none") +
scale_x_continuous(expand=c(0,0)) +
scale_fill_gradient(low = "yellow", high = "red") +
labs(y = NULL, x = "Number of Deaths")
Fatal shootings of suspects with / without mental illness:
ggplot(data = fatal, aes(y = signs_of_mental_illness)) +
geom_bar(aes(fill = ..count..)) +
theme_minimal(base_size = 13) +
theme(legend.position = "none") +
scale_x_continuous(expand=c(0,0)) +
scale_fill_gradient(low = "yellow", high = "red") +
labs(y = NULL, x = "Number of Deaths")
ggplot(data = fatal, aes(y = body_camera)) +
geom_bar(aes(fill = ..count..)) +
theme_minimal(base_size = 13) +
theme(legend.position = "none") +
scale_x_continuous(expand=c(0,0)) +
scale_fill_gradient(low = "yellow", high = "red") +
labs(y = NULL, x = "Number of Deaths")
stateinfo <- fatal %>% group_by(state) %>% summarise(n = n()) %>% ## creating new dataframe that summarizes fatal shootings by state
arrange(desc(n)) %>% top_n(15) %>%
mutate(state = factor(state, levels = rev(unique(state))))
stateinfo # take a look at dataframe
## # A tibble: 15 x 2
## state n
## <fct> <int>
## 1 CA 852
## 2 TX 506
## 3 FL 379
## 4 AZ 262
## 5 CO 209
## 6 GA 197
## 7 OK 171
## 8 NC 168
## 9 WA 163
## 10 OH 159
## 11 TN 149
## 12 MO 146
## 13 LA 116
## 14 PA 113
## 15 NM 112
# visualize
ggplot(stateinfo, aes(x = n, y = state)) +
geom_bar(stat="identity", aes(fill = n)) +
#geom_stateface(aes(y=state, x=7, label=as.character(state)), colour="white", size=8) +
geom_text(aes(x = 17, y = state, label=as.character(state)), color="white", size=4) +
labs(y = NULL, x = "Number of Deaths") +
scale_fill_gradient(low = "green", high = "red") +
theme_minimal(base_size = 13) +
theme(axis.text.y=element_blank()) +
theme(legend.position = "none") +
scale_x_continuous(expand=c(0,0))
ggplot(data = fatal, aes(y = manner_of_death)) +
geom_bar(aes(fill = ..count..)) +
theme_minimal(base_size = 13) +
theme(legend.position = "none") +
scale_x_continuous(expand=c(0,0)) +
scale_fill_gradient(low = "yellow", high = "red") +
labs(y = NULL, x = "Number of Deaths")
R is an incredibly powerful and flexible platform. It is also a platform that, when combined with R Markdown, greatly facilitates transparency and reproducibility.
R, admittedly, can be overwhelming. The R universe is vast and one can be sucked into black holes. Many find the adventure, though at times disorienting, worth it. I certainly have.
Live long, and prosper.