Case Studies (5&6)
Case Study 5: Extra, extra, code all about it
Background
You are working for management consulting agency A.T. Kearney which produces the Global Cities report. They have put you on a team in charge of developing a new report - the USA Cities report - which identifies the most influential cities in the United States.
Your manager would like to explore using the frequency with which a city appears in news headlines as a contributing factor to their city rankings. You find data from two major news outlets: one in California (KCRA) and one in New York (ABC7NY). The data spans July 18, 2017 - Jan 16, 2018. You will use the headlines to find which cities are mentioned most in the news.
Specifically, you should identify the 15 cities with the highest headline count overall. You are curious if these cities have sustained headlines over time or if there was a singular event that spiked the headline count. You will also use headline counts to compare individual cities.
After completing the tasks, you might like to run this code again with updated information. Make sure you are writing reproducible code that would work with a larger, more up-to-date dataset.
Being Readings
The being reading for this case study is:
Note: The chapter covers pages 21 to 37 in the pdf. You can skip the “On Studies” and “Doing Experiements” sections.
Read the chapter and come to class with two or three things to share. These could be a favorite quote, a question you had while reading, a thought or idea inspired by the reading, etc.
Resources
Here are quick links to the doing readings and accompanying R cheat sheets that you previously encountered in this unit:
- R4DS: Strings
- stringr cheat sheet
- RVerbalExpressions package
- regexr.com
- Regular Expression examples
- Regular Expression support applet
Feel free to make use of Google searches, stack overflow, etc., as you wrangle and visualize the data. These readings may also help you complete the tasks below.
- Regular Expressions in R
-
Populating missing values (The
complete()
function to be helpful at making implicit missing values explicit.)
Tasks
-
Load the headlines data from the two news outlets and combine them into one dataset:
- ABC7NY: https://storybench.org/reinventingtv/abc7ny.csv (Links to an external site.)
- KCRA: https://storybench.org/reinventingtv/kcra.csv (Links to an external site.)
-
For each headline, identify the name of the city mentioned in the headline (if any).
- Hint: You can get a list of US cities from the
maps
package. Install themaps
package, then use theus.cities
dataset within that package. You may have to do some wrangling to clean the names and get them into a format useful for pattern matching. - Hint: You may want to consider how to deal with the boroughs of New York City.
- Hint: You can get a list of US cities from the
-
Answer the following questions:
- Question 1: For the 15 cities with the most mentions overall, create a graphic that summarizes their mentions. Write a paragraph in which you discuss the results. Do they make sense? Do you need to make changes? If something looks wrong, fix your code and run it again to find the new top 15.
- Question 2: For those same 15 cities, create a graphic to show the headline count for each city for each month. Write a paragraph to discuss meaningful insights from the graph about headlines over time for certain cities and/or other features and trends you notice in the graph.
- Question 3: Create a graphic specifically dedicated to comparing the headlines generated about Houston, TX and Charlotte, NC over time (by month). What trends do you notice?
Create an R Markdown report that has the charts and descriptions mentioned above.
Case Study 6: It’s about time
Background
We have transaction data for a few businesses that have been in operation for three months. Each of these companies has come to your investment company for a loan to expand their business. Your boss has asked you to go through the transactions for each business and provide daily, weekly, and monthly gross revenue summaries and comparisons. Your boss would like a short write up with tables and visualizations that help with the decision of which company did the best over the three month period. You will also need to provide a short paragraph with your recommendation after building your analysis.
Note: In this course we only try to understand and visualize recorded time series data. We do not forecast. If you would like to learn more about forecasting I would recommend Forecasting: Principles and Practice and for a quick introduction see here.
Being Readings
The being readings for this case study are:
Read the article(s) and come to class with two or three things to share. These could be a favorite quote, a question you had while reading, a thought or idea inspired by the reading, etc.
Resources
Here are quick links to the doing readings and accompanying R cheat sheets that you previously encountered in this unit:
- R4DS: Dates and times
-
https://lubridate.tidyverse.org/ - includes a
lubridate
cheatsheet - lubridate Vignette
- Time Series Visualization Gallery
Feel free to make use of Google searches, stack overflow, etc., as you wrangle and visualize the data.
Tasks
-
Read in the data from
https://byuistats.github.io/M335/data/sales.csv
and format it for visualization and analysis.- The data has time recorded in UTC, but come from businesses in the mountain time zone. Make sure to convert!
- This is point of sale (pos) data, so you will need to use
library(lubridate)
to create the correct time aggregations. - Check the data for any inaccuracies.
-
Help your boss understand which business is the best investment through visualizations.
- Provide visualizations that show gross revenue over time for each company (Choose if you want to aggregate at the daily, the weekly, or the monthly level).
- Provide a visualization that gives insight into hours of operation for each company.
- We don’t have employee numbers, but customer traffic (number of transactions) may be helpful. Provide a visualization on customer traffic for each company.
Write a short paragraph with your final recommendation. Which company do you think performed the best over the three months?
Create an R Markdown report that has the charts and descriptions mentioned above.