Case Studies (3&4)

Case Study 3: Combining Heights

Background

The Scientific American argues that humans have been getting taller over the years. As data scientists in training, we would like to use data to validate this hypothesis. This case study will use many different datasets, each with male heights from different years and countries. Our challenge is to combine the datasets and visualize heights across the centuries.

This project is not as severe as the two quotes below, but it will give you a taste of pulling various data and file formats together into “tidy” data for visualization and analysis.

  • “Classroom data are like teddy bears and real data are like a grizzly bear with salmon blood dripping out its mouth.” - Jenny Bryan
  • “Up to 80% of data analysis is spent on the process of cleaning and preparing data” - Hadley Wickham

Being Readings

The being reading for this case study is:

Note that reading academic papers is an whole different skill than reading blog posts or news articles. I would recommend reading Section 1 and Section 2 closely to get the main ideas of the paper, then skimming the rest to understand what content the paper contains.

Come to class with two or three things to share. These could be a favorite quote, a question you had while reading, a thought or idea inspired by the reading, etc.

Resources

You might find these resources helpful and you work with the data sets:

Feel free to make use of Google searches, stack overflow, etc., as you wrangle and visualize the data.

Tasks

  1. Use the correct functions from library(haven), library(readr), library(foreign), and library(readxl) (NOT from library(rio)) to load the five data sets listed below. All data sources should be read directly from the web sources given below. Make sure your report shows the code you used to import each data set.

  2. Combine the 5 datasets into one tidy dataset.

    • We want to examine how heights have changed over time. Three of the data sets consist of just men, while 2 have both men and women. Filter to just focus on men.
    • Wrangle each dataset so that it contains the following columns: birth_year, height.in, height.cm, and study. You will have to do some conversions between inches and centimeters. You need to create the “study” column yourself to identify which dataset the rows came from.
    • Appropriately deal with any outliers in your data. It’s impossible for people to have a negative height, or be inhumanly tall.
    • You can use the bind_rows() function to combine your five individual datasets into one dataset.
  3. Write a short paragraph summarizing the data wrangling process you had to go through to create your tidy dataset. Include in that discussion any decisions you had to make about what data to exclude.

  4. Make two graphs to examine the question of height distribution across centuries.

    • One graph should show individual heights (not summaries) and be faceted by study.
    • The other graph is a graph of your choice that tries to answer the question.You could try creating a “decade” column and showing height summaries by decade.
  5. Write a short paragraph summarizing how/if your charts answer the research question.

  6. Create an R Markdown report that has the code, charts, and descriptions mentioned above. Make sure your report is well organized and clearly labeled.

Submission

  • Knit your .Rmd and push all knitted files for this case study to your class GitHub repository.
  • Head to I-Learn to complete the rest of this assignment. You’ll submit a link to your .md file, as well as write a short reflection.

Case Study 4: Take me out to the ball game

Background

Over the campfire, you and a friend get into a debate about which college in Utah has had the best MLB success. As an avid BYU fan, you want to prove your point and decide to use data to settle the debate.

You need a chart that summarizes the success of BYU college players compared to other Utah college players that have played in the major leagues. It would also be helpful to have a chart showing success of individual players that you can reference. For both of these charts, you decide to use player salary as a stand in for “success”.

The library(Lahman) package has a comprehensive set of baseball data. It is great for testing out your relational data skills. You will also need a function to adjust player salaries due to inflation, so you’ll use the library(priceR) package.

Being Readings

The being readings for this case study are:

Read the article(s) and come to class with two or three things to share. These could be a favorite quote, a question you had while reading, a thought or idea inspired by the reading, etc.

Resources

Please make use of any of the resources we’ve discussed in this unit.
Feel free to make use of Google searches, stack overflow, etc., as you wrangle and visualize the data.

Tasks

  1. Install library(Lahman). Use the data sets provided by this package and your wrangling skills to create a new data set with the following properties. Include a preview of the data in your final report.

    • Only include players that attended at least one college in Utah
    • Have a column for the player’s full name
    • Have a column for the full name of the Utah college they attended most recently
    • One row for each year the player earned a salary playing professional baseball.
    • Have a column for salary, and columns for the associated year and league.
  2. Install library(priceR) and use the adjust_for_inflation(price = your_earnings_vector, from_date = your_earnings_year_vector, country = "US", to_date = 2020) function to get all salaries in 2020 dollars.

  3. Make a chart summarizing the success (the salaries) of players from BYU and comparing it to the success of players from other Utah school.

  4. Make another chart that shows individual salaries of the players. The chart should draw attention to any outliers that help prove your point, using labels with full player names and/or full college names.

  5. Write a paragraph to summarize your findings and explain what conclusions you can draw from your visualizations.

  6. Create an R Markdown report that has the code, charts, and descriptions mentioned above.

Submission

  • Knit your .Rmd and push all knitted files for this case study to your class GitHub repository.
  • Head to I-Learn to complete the rest of this assignment. You’ll submit a link to your .md file, as well as write a short reflection.