Case Studies (3&4)
Case Study 3: Combining Heights
Background
The Scientific American argues that humans have been getting taller over the years. As data scientists in training, we would like to use data to validate this hypothesis. This case study will use many different datasets, each with male heights from different years and countries. Our challenge is to combine the datasets and visualize heights across the centuries.
This project is not as severe as the two quotes below, but it will give you a taste of pulling various data and file formats together into “tidy” data for visualization and analysis.
- “Classroom data are like teddy bears and real data are like a grizzly bear with salmon blood dripping out its mouth.” - Jenny Bryan
- “Up to 80% of data analysis is spent on the process of cleaning and preparing data” - Hadley Wickham
Being Readings
The being reading for this case study is:
Note that reading academic papers is an whole different skill than reading blog posts or news articles. I would recommend reading Section 1 and Section 2 closely to get the main ideas of the paper, then skimming the rest to understand what content the paper contains.
Come to class with two or three things to share. These could be a favorite quote, a question you had while reading, a thought or idea inspired by the reading, etc.
Resources
You might find these resources helpful and you work with the data sets:
- R4DS: Chapter 12 - Tidy Data
- foreign R pakcage and
read.dbf()
- R4DS: Chapter 18 - Pipes
- R4DS: Chapter 20 - Vectors
Feel free to make use of Google searches, stack overflow, etc., as you wrangle and visualize the data.
Tasks
-
Use the correct functions from
library(haven)
,library(readr)
,library(foreign)
, andlibrary(readxl)
(NOT fromlibrary(rio)
) to load the five data sets listed below. All data sources should be read directly from the web sources given below. Make sure your report shows the code you used to import each data set.- German male soldiers in Bavaria, 19th centery: Stata (.dta) format
- Heights of Bavarian male soldiers, 19th century: Stata (.dta) format
-
Heights of south-east and south-west German soldiers born in the 18th century: DBF format
- This file is zipped. You can download it with
download()
andtempfile()
. Then useunzip()
andread.dbf()
to load the data into R. - Can you tell which column is the birth year? Google translate may be helpful.
- This file is zipped. You can download it with
-
Bureau of Labor Statistics Height Data: csv format
- Note: There is no birth year, so just assume mid-20th century and use 1950 as birth year
-
University of Wisconsin National Survey Data: SPSS (.sav) format
- You’ll want to look at the codebook to understand this dataset and know which columns to use: National Survey Codebook
- Read the note at the top of the codebook and notice that -1 and -2 values represent missing data.
- You will have to add 1900 to the variable representing birth year.
-
Combine the 5 datasets into one tidy dataset.
- We want to examine how heights have changed over time. Three of the data sets consist of just men, while 2 have both men and women. Filter to just focus on men.
- Wrangle each dataset so that it contains the following columns:
birth_year
,height.in
,height.cm
, andstudy
. You will have to do some conversions between inches and centimeters. You need to create the “study” column yourself to identify which dataset the rows came from. - Appropriately deal with any outliers in your data. It’s impossible for people to have a negative height, or be inhumanly tall.
- You can use the
bind_rows()
function to combine your five individual datasets into one dataset.
Write a short paragraph summarizing the data wrangling process you had to go through to create your tidy dataset. Include in that discussion any decisions you had to make about what data to exclude.
-
Make two graphs to examine the question of height distribution across centuries.
- One graph should show individual heights (not summaries) and be faceted by study.
- The other graph is a graph of your choice that tries to answer the question.You could try creating a “decade” column and showing height summaries by decade.
Write a short paragraph summarizing how/if your charts answer the research question.
Create an R Markdown report that has the code, charts, and descriptions mentioned above. Make sure your report is well organized and clearly labeled.
Case Study 4: Take me out to the ball game
Background
Over the campfire, you and a friend get into a debate about which college in Utah has had the best MLB success. As an avid BYU fan, you want to prove your point and decide to use data to settle the debate.
You need a chart that summarizes the success of BYU college players compared to other Utah college players that have played in the major leagues. It would also be helpful to have a chart showing success of individual players that you can reference. For both of these charts, you decide to use player salary as a stand in for “success”.
The library(Lahman)
package has a comprehensive set of baseball data. It is great for testing out your relational data skills. You will also need a function to adjust player salaries due to inflation, so you’ll use the library(priceR)
package.
Being Readings
The being readings for this case study are:
Read the article(s) and come to class with two or three things to share. These could be a favorite quote, a question you had while reading, a thought or idea inspired by the reading, etc.
Resources
Please make use of any of the resources we’ve discussed in this unit.
Feel free to make use of Google searches, stack overflow, etc., as you wrangle and visualize the data.
Tasks
-
Install
library(Lahman)
. Use the data sets provided by this package and your wrangling skills to create a new data set with the following properties. Include a preview of the data in your final report.- Only include players that attended at least one college in Utah
- Have a column for the player’s full name
- Have a column for the full name of the Utah college they attended most recently
- One row for each year the player earned a salary playing professional baseball.
- Have a column for salary, and columns for the associated year and league.
Install
library(priceR)
and use theadjust_for_inflation(price = your_earnings_vector, from_date = your_earnings_year_vector, country = "US", to_date = 2020)
function to get all salaries in 2020 dollars.Make a chart summarizing the success (the salaries) of players from BYU and comparing it to the success of players from other Utah school.
Make another chart that shows individual salaries of the players. The chart should draw attention to any outliers that help prove your point, using labels with full player names and/or full college names.
Write a paragraph to summarize your findings and explain what conclusions you can draw from your visualizations.
Create an R Markdown report that has the code, charts, and descriptions mentioned above.