Case Studies (1&2)
Case Study 1: Give your Visualization Wings to Fly
Background
You just started your internship at a big firm in New York, and your manager gave you an extensive file of flights that departed from JFK, LGA, or EWR in 2013. From this data (which you can obtain in R) your manager wants you to answer several questions.
Being Readings
The being reading for this case study is
Read the entire article, and come to class with two or three things to share. These could be a favorite quote, a question you had while reading, a thought or idea that spring boarded from the reading, etc.
Resources
Here are quick links to the doing readings and accompanying R cheatsheets that you previously encountered in this unit.
- R4DS Chapter 3: Data Visualization
- R4DS Chapter 5: Data Transformation
- R4DS Chapter 28: Graphics for Communication
- Data visualization with ggplot2 cheatsheet
- Data transformation with dplyr cheat sheet
Feel free to make use of Google searches, stack overflow, etc., as you wrangle and visualize the data.
Tasks
-
Get the data using this R code:
#uncomment and run the line below once, if you haven't installed this package yet. #install.packages("nycflights13") library(nycflights13) #Run this line to learn more about the flights data set. #?flights
-
Address each of the following questions from your manager. Each answer should contain at least one chart and a description where you state the answer.
- If I am leaving before noon, which two airlines do you recommend at each airport (JFK, LGA, EWR) that will have the lowest arrival delay time at the 75th percentile?
- Why the 75th percentile? The minimum and maximum arrival delay times could be skewed by a single flight. The median arrival delay time would help understand what happens with about 50% of flights. By analyzing the 75th percentile, we’re making comparisons that take into account 75% of the flights.
- Which origin airport is best to minimize my probability of a late arrival when I am using Delta Airlines?
- This will require you to categorize each flight as late or not late.
- Which destination airport is the worst airport for arrival time?
- You decide on the metric for “worst.”
- If I am leaving before noon, which two airlines do you recommend at each airport (JFK, LGA, EWR) that will have the lowest arrival delay time at the 75th percentile?
Adapt one of your visualizations above so that it shows the complexity of the data (i.e. individual flights, and not only broad summaries). Share your new chart, and then discuss the pros and/or cons of including the individual flights in your visualization.
-
Create an R Markdown report that has the graphs and discussions mentioned above.
- Write an introduction section that describes your results.
- Have a section for each question.
- Make sure your code is in the report but defaults to hidden.
Case Study 2: Reducing Gun Deaths
Background
The world is a dangerous place. During 2015 and 2016 there was a lot of discussion in the news about police shootings. FiveThirtyEight reported on gun deaths in 2016. As leaders in data journalism, they have posted a clean version of this data in their GitHub repo called full_data.csv for us to use. Load the data with the following command (you’ll need to assign the data to a variable).
read_csv("https://github.com/fivethirtyeight/guns-data/blob/master/full_data.csv?raw=true")
FiveThirtyEight’s visualizations focused on yearly averages. Your task is broader in scope. You are working for a client who wants to create a marketing campaign that helps reduce gun deaths in the US. The client would like to identify several target audiences that could benefit from such a campaign, as well identify any seasonal trends in gun deaths in these target audiences. Your challenge is to provide recommendations about target audiences, exploring the variables in this data (intent, sex, age, rage, education, etc.), and then summarize and visualize seasonal trends (if any) for these audiences.
Being Readings
The being readings for this case study are:
Read both articles and come to class with two or three things to share. These could be a favorite quote, a question you had while reading, a thought or idea inspired by the reading, etc.
Resources
Here are quick links to the doing readings and accompanying R cheatsheets that you previously encountered in this unit:
- R4DS Chapter 3: Data Visualization
- R4DS Chapter 5: Data Transformation
- R4DS Chapter 28: Graphics for Communication
- Data visualization with ggplot2 cheatsheet
- Data transformation with dplyr cheat sheet
Feel free to make use of Google searches, stack overflow, etc., as you wrangle and visualize the data.
Tasks
-
Read the FiveThirtyEight article. Create one chart that provides similar insights to the visualization in the article.
- Your chart does not need to look the one in the article, nor does it need to be interactive.
-
Explore the data set to identify multiple target audiences for an add campaign.
- Your job is to help your client understand as much as you can about their audience. Focusing on just one aspect is insufficient. As an example, you’ll quickly discover that white is the race with the most gun deaths (not surprising as white is the dominant race in the U.S.). This isn’t sufficient. Narrow down the target audience a bit more. Explore what happens whe also consider intent, sex, age, race, and/or education. Don’t stop at one variable, rather try to incorporate 2 or 3 characteristics as you pick a target audience.
- After picking one target audience, look for another target audience that will help reach a large number of those you may have missed in your previous choice. Repeat this process to select a few target audiences where we can focus our campaign efforts and make a large impact.
- Your exploratory code/charts do not need to be included in your final report, rather explore the data quite a bit with lots of quick charts before you decide on which target audiences to use.
Construct a presentation-worthy visualization(s) that helps your client understand the rationale behind the target audiences you chose. Clearly state the target audiences your client should focus on.
Provide 2-4 presentation-worthy charts that help your client understand seasonal trends (or lack thereof) in your chosen audiences. These charts should demonstrate the
ggplot2
skills you’ve learned thus far, including clear labels and customized themes.Each chart in your report should be accompanied by a written description of how the insights from the chart could benefit the marketing campaign.
Create an R Markdown report that has the charts and descriptions mentioned above. Make sure your code is in the report but defaults to hidden.