12 Combining Height Files
Readings
Scan the reading to learn what types of files these packages are used to read in.
Be sure to do the Readings and Guided Instruction for the previous task if you have not yet.
Guided Instruction
Complete the following if you have not yet: Data structures and importing data
The Scientific American argues that humans have been getting taller over the years. As the data scientists that we are becoming, we would like to find data that validates or refutes this concept. Our challenge is to show different male heights across the centuries.
This time, instead of looking at the mean height per country over time like we did for the previous task, we have a few files that contain heights of individuals. Each file represents a different time and/or place from which the individuals are sampled. We will combine the data from these files into one dataset to facilitate our visualization.
Work with these datasets where each row represents an individual. Import these five datasets into R.
- German male conscripts in Bavaria, 19th century: Stata format.
- Heights of bavarian male conscripts, 19th century: Stata format.
- Heights of south-east and south-west german soldiers born in the 18th century: DBF format.
- This file is zipped. After downloading it with
download()
, trying usingunzip()
andread.dbf()
to load the data into R. - Can you tell which column is the birth year? HINT: Google translate may be helpful.
- This file is zipped. After downloading it with
- Bureau of Labor Statistics Height Data: csv format
- Note: There is no birth year, so just assume mid-20th century and use 1950 as birth year
- Don’t forget to filter for just males.
- University of Wisconsin National Survey Data: SPSS (.sav) format
- You’ll want to look here to understand this dataset and know which columns to use: National Survey Codebook
- There is no gender identifier in this survey, we will just work with the data, knowing there is likely a mix of genders.
Wrangle each dataset so that it contains the following columns:
birth_year
,height.in
,height.cm
, andstudy
.- You will have to potentially do some renaming and conversions between inches and centimeters.
- You need to create the “study” column yourself to identify which dataset the rows came from.
- For each dataset
select(birth_year, height.in, height.cm, study)
Use the
bind_rows()
function to combine your five individual datasets into one dataset.Each dataset must have the columns in the same order for this to work.Write a short paragraph summarizing the data wrangling process you had to go through to create your tidy dataset. Include in that discussion any decisions you had to make about what data to exclude.
Make a plot of the five studies containing individual heights to examine the question of height distribution across centuries.
Write at least two paragraphs to address the following:
- How does the story told by this data compare to the story told by the data in the previous task? Do they agree or do they contradict? If they contradict, reason through the contradiction and try to make sense of it.
- How would you respond to the assertion that humans are getting taller over time based on the datasets in these two tasks involving height?
- Be sure to provide an overall conclusion about where you stand on the question.
- How does the story told by this data compare to the story told by the data in the previous task? Do they agree or do they contradict? If they contradict, reason through the contradiction and try to make sense of it.
Render the
.qmd
file. Push all the files created in the rendering process into your GitHub repository.
Submit
In I-learn submit a link to the .md
file on GitHub.