Data Discovery

Finding data to answer your research questions is non-trivial. Except for your project, this class will shield you from this task. The large projects that data scientists work on can often require years to accumulate the necessary data to address our questions.

Data Digestion

After finding the correct data to address the research question is where the 80/20 rule1 happens. Every fancy software and programming language that data analysts use has to come face to face with data digestion. The data sets below were found at the following three websites.

  1. University of Tubingen Height Data Hub
  2. The R for Data Science GitHub repository
  3. The University of Wisconsin National Survey Data Hub

It looks like the University of Tubingen changes the download links at times. If a link is broken please post an issue.

Tubingen Height Data

The first file is under the Worldwide estimates of height by country and birth decade.

Three other files should be used for this case study from their website.

R for Data Science Data

  1. “Up to 80% of data analysis is spent on the process of cleaning and preparing data” - Hadley Wickham↩︎