Code: Report on Police Shootings


title: “Reproducible Research and Police Shootings Analysis”
author: “Eric McGlinchey”
date: “10/14/2020”
output:
# pdf_document: default
html_document: default
subtitle: Discussion, Code, Analysis, and Visualization

“`{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
“`

## Reproducible Research

A core challenge of science is the (in)ability to reproduce other scholars’ research. Empowering others to reproduce our research helps us identify and correct shortcomings in our own analysis. Incorporating reproducible research in our own analyses encourages us to be rigorous and transparent in our scholarship. As a scholarly community, we are aware of the benefits of reproducible research. Paradoxically, though, we have done little to make it easier for others to attempt replicate our studies. Our inaction has, according to a [2016 study in Nature](https://www.nature.com/news/1-500-scientists-lift-the-lid-on-reproducibility-1.19970), led to a “crisis” in scholarship.

One way to mitigate this crisis is to “bake” reproducibility into our analyses. R Markdown does exactly this. R Markdown — the text editor I am using to create this file — allows you to both write your papers and conduct your analysis all in one place. You can choose to include or not include your analysis code in your compiled document. For the final draft of a journal publication, for example, you might not include your code. For your personal website or github page, though, you may want to include a version of the article with the code as well as the data you used in the article. You can also share the R Markdown plain text from which you compiled your article. When you share an R Markdown file along with your data, other scholars can immediately attempt to reproduce your research!

R Markdown has other advantages. It allows you to use LaTeX when you need it but use infinitely easier Markdown for the majority of your writing. I noted in our R / LaTeX lab that I use LaTeX’s slide typesetting class, [Beamer](https://ctan.org/pkg/beamer?lang=en), to produce all my slides. This is true… but only part of the story. While it’s relatively easy to write articles in LaTeX, the amount of code required to set up slides in beamer gets tiring. R Markdown allows you to write your slides in the simpler markdown format, but then pass these slides through a beamer template. I have one LaTeX beamer template for my academic presentations — a template I hacked from [Professor McGrath](http://mcgrath.gmu.edu/)–which is, in turn, a template Professor McGrath hacked from [Professor Miller](http://svmiller.com/).

R Markdown compiles to multiple formats — html, MS Word, and PDF. R Markdown allows you to quickly create updated reports. Imagine, for example, you needed to brief the president every day on country and state-level changes in COVID-19 infections and fatalities. Rather than loading the updated data set into your stats program, rerunning your analysis, and outputting new tables and graphs into your slideware or word processing program, with R Markdown you simply need to run one command, “knit,” and you will have an updated report with the latest COVID-19 data. As long as you have configured R to pull the latest data from a dynamic website — for example, [Johns Hopkins Corona Virus Resource Center](https://coronavirus.jhu.edu/) — your reports will always be up to date.

There are start up costs to R Markdown. R and R Markdown do take time to learn. If you are studying for comprehensive or field exams, trying to make it through your first year in the PhD program, or simultaneously working, going to grad school, and homeschooling kids, you likely have more pressing demands that deserve your attention. If you enjoy coding or think you might enjoy learning how to code — and want to substitute movie entertainment with coding fun — then you may find the dive into R and R Markdown to be rewarding and, down the road, a potential time saver.

There is an overwhelming amount of resources for opensource applications like R and R Markdown. A good place to start is a recent entertaining talk Garrett Grolemund, senior data scientist at R Studio, delivered.

<br />

<iframe width=”560″ height=”315″ src=”https://www.youtube.com/embed/s9aWmU0atlQ” frameborder=”0″ allow=”accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture” allowfullscreen></iframe>

<br />
Grolemund is also co-author of the excellent and free [R Markdown: The Definitive Guide](https://bookdown.org/yihui/rmarkdown/garrett-grolemund.html).

Enough about R Markdown… on to the analysis!

## Analysis of Washington Post’s Police Shootings Data — Introduction

We did not have time to cover the *Post’s* Police Shootings data. I present the analysis of this data here. The *Post’s* data is available at:

https://github.com/washingtonpost/data-police-shootings/releases/download/v0.1/fatal-police-shootings-data.csv

The codebook can be found in this readme.md file: https://github.com/washingtonpost/data-police-shootings/blob/master/README.md

The overall GitHub repository is here: https://github.com/washingtonpost/data-police-shootings

I load the data into my R session using the following:

“`{r, message=FALSE}
fatal <- read.csv(“https://github.com/washingtonpost/data-police-shootings/releases/download/v0.1/fatal-police-shootings-data.csv”, na.strings=c(“”,”NA”))
“`

A note about the above function — I am reading the csv (comma-separated values) file into R, assigning it to the object, **fatal**, and telling R that *blank* cells as well as any cells that have NA values should be treated as NA. NA is what we use to indicate missing data.

I will be using the tidyverse package for analysis. You can find more on tidyverse here: https://www.tidyverse.org/. If you have not installed tidyverse, you can by running the following command: install.packages(“tidyverse”). I have tidyverse already installed. I only need to load it through the command:

“`{r, message=FALSE}
library(tidyverse)
library(ggplot2)
“`

Let’s take a look at our data frame. In the Global Environment you will see that this is a large data frame — (5680 observations on 17 variables). We don’t want to look at all of these. The head function will allow us to look at the first few observations.

“`{r}
head(fatal)
“`

## Data Analysis

Now that we have a sense of the data frame, we can begin exploring the data. Let’s begin by looking at frequencies for the age of fatal victims:

“`{r}
table(fatal$age)
table(fatal$age, useNA = “always”)
“`

Look at the two commands above. What is different about the code of these commands? And what is different between the results for the two commands above?

NA means “Not Available” — this indicates missing data. The second command, table(fatal$age, useNA = “always”), tells R to show us the missing data. Here we have 254 cases with no information on the age variable.

Now let’s see what the average victim age is..
“`{r}
mean(fatal$age)
“`

This command returns NA b.c. there is missing data in the age variable. R will not calculate averages using mean function if NAs are included. Let’s remove the NAs using the na.rm = TRUE option and try again.
“`{r}
mean(fatal$age, na.rm = TRUE)
“`

Another way to get descriptive statistics is by using the summary function

“`{r}
summary(fatal$age)
“`

Note in the output that the summary function R automatically handles the NAs

What if we were interested in the mean victim age of not the entire population, but just women? Just men?

“`{r}
mean(fatal$age[fatal$gender==”F”], na.rm = TRUE) # women
mean(fatal$age[fatal$gender==”M”], na.rm = TRUE) # men
“`

What about means for men by race? First we need to know how the race variable is coded. The table function will give us a quick sense:

“`{r}
table(fatal$race, useNA = “always”)
“`

I’m not entirely sure about the coding here. Whenever you are not sure, turn to the codebook! Thankfully, the *Post* explains the codes in the [readme.md](https://github.com/washingtonpost/data-police-shootings/blob/master/README.md) file in the repository. The readme.md file explains:

</ul>
<p><code>race</code>:</p>
<ul>
<li><code>W</code>: White, non-Hispanic</li>
<li><code>B</code>: Black, non-Hispanic</li>
<li><code>A</code>: Asian</li>
<li><code>N</code>: Native American</li>
<li><code>H</code>: Hispanic</li>
<li><code>O</code>: Other</li>
<li><code>None</code>: unknown</li>
</ul>

Know that we know how the race variable is coded, we can analyze:

“`{r}
mean(fatal$age[fatal$gender==”M” & fatal$race==”B”], na.rm = TRUE)
mean(fatal$age[fatal$gender==”M” & fatal$race==”W”], na.rm = TRUE)
mean(fatal$age[fatal$gender==”M” & fatal$race==”H”], na.rm = TRUE)
mean(fatal$age[fatal$gender==”M” & fatal$race==”A”], na.rm = TRUE)
mean(fatal$age[fatal$gender==”M” & fatal$race==”N”], na.rm = TRUE)
“`

There is some interesting variation here…. maybe a puzzle worth exploring.

## Data Visualization

Note: I am not the first person to play with this data. Code for plots here unabashedly scraped and modified from code available on the web, for example, this [lab assignment](https://rstudio-pubs-static.s3.amazonaws.com/214093_b106cec9ba354de9acb175cf8693a749.html).

Let’s look at fatalities by race

“`{r}
ggplot(data = fatal, aes(y = race)) +
geom_bar(aes(fill = ..count..)) +
theme_minimal(base_size = 13) +
theme(legend.position = “none”) +
scale_x_continuous(expand=c(0,0)) +
scale_fill_gradient(low = “yellow”, high = “red”) +
labs(y = NULL, x = “Number of Deaths”)
“`

Deaths by age:

“`{r}
ggplot(data = fatal, aes(x = age)) +
geom_histogram(aes(fill = ..count..), bins = 20) +
theme_minimal(base_size = 13) +
theme(legend.position = “none”) +
scale_x_continuous(expand=c(0,0)) +
scale_fill_gradient(low = “yellow”, high = “red”) +
labs(x = “Age at Death”, y = “Number of Deaths”)

“`

Deaths by gender:

“`{r}
ggplot(data = fatal, aes(y = gender)) +
geom_bar(aes(fill = ..count..)) +
theme_minimal(base_size = 13) +
theme(legend.position = “none”) +
scale_x_continuous(expand=c(0,0)) +
scale_fill_gradient(low = “yellow”, high = “red”) +
labs(y = NULL, x = “Number of Deaths”)
“`

Fatal shootings of suspects with / without mental illness:

“`{r}
ggplot(data = fatal, aes(y = signs_of_mental_illness)) +
geom_bar(aes(fill = ..count..)) +
theme_minimal(base_size = 13) +
theme(legend.position = “none”) +
scale_x_continuous(expand=c(0,0)) +
scale_fill_gradient(low = “yellow”, high = “red”) +
labs(y = NULL, x = “Number of Deaths”)
“`

## Was officer wearing a camera?

“`{r}
ggplot(data = fatal, aes(y = body_camera)) +
geom_bar(aes(fill = ..count..)) +
theme_minimal(base_size = 13) +
theme(legend.position = “none”) +
scale_x_continuous(expand=c(0,0)) +
scale_fill_gradient(low = “yellow”, high = “red”) +
labs(y = NULL, x = “Number of Deaths”)
“`

## Fatal Shootings by State

“`{r, message=FALSE}
stateinfo <- fatal %>% group_by(state) %>% summarise(n = n()) %>% ## creating new dataframe that summarizes fatal shootings by state
arrange(desc(n)) %>% top_n(15) %>%
mutate(state = factor(state, levels = rev(unique(state))))

stateinfo # take a look at dataframe

# visualize

ggplot(stateinfo, aes(x = n, y = state)) +
geom_bar(stat=”identity”, aes(fill = n)) +
#geom_stateface(aes(y=state, x=7, label=as.character(state)), colour=”white”, size=8) +
geom_text(aes(x = 17, y = state, label=as.character(state)), color=”white”, size=4) +
labs(y = NULL, x = “Number of Deaths”) +
scale_fill_gradient(low = “green”, high = “red”) +
theme_minimal(base_size = 13) +
theme(axis.text.y=element_blank()) +
theme(legend.position = “none”) +
scale_x_continuous(expand=c(0,0))

“`

## How Was the Person Killed?

“`{r}
ggplot(data = fatal, aes(y = manner_of_death)) +
geom_bar(aes(fill = ..count..)) +
theme_minimal(base_size = 13) +
theme(legend.position = “none”) +
scale_x_continuous(expand=c(0,0)) +
scale_fill_gradient(low = “yellow”, high = “red”) +
labs(y = NULL, x = “Number of Deaths”)
“`

## Concluding Thoughts

R is an incredibly powerful and flexible platform. It is also a platform that, when combined with R Markdown, greatly facilitates transparency and reproducibility.

R, admittedly, can be overwhelming. The R universe is vast and one can be sucked into black holes. Many find the adventure, though at times disorienting, worth it. I certainly have.

Live long, and prosper.