Background
Survey data is notoriously difficult to handle. Even when the data is recorded cleanly the options for ‘write in questions’, ‘choose from multiple answers’, ‘pick all that are right’, and ‘multiple choice questions’ makes storing the data in a tidy format difficult.
In 2014, FiveThirtyEight surveyed over 1000 people to write the article titled, America’s Favorite ‘Star Wars’ Movies (And Least Favorite Characters). They have provided the data on GitHub - <https://github.com/fivethirtyeight/data/tree/master/star-wars-survey>.
A company would like to use this data to figure out if they can predict an interviewing job candidate’s current income based on a few responses about Star Wars movies.
Data:
Download: StarWars.csv
Information: Article
Readings:
- Python for Data Science: Tidy Data
- Python for Data Science: Graphics for Communication
- Python for Data Science: Strings
Grand Questions:
Shorten the column names and clean them up for easier use with pandas.
Filter the dataset to those that have seen at least one film.
Please validate that the data provided on GitHub lines up with the article by recreating 2 of their visuals and calculating 2 summaries that they report in the article.
Clean and format the data so that it can be used in a machine learning model. Please achieve the following requests and provide examples of the table with a short description the changes made in your report.
- Create an additional column that converts the age ranges to a number and drop the age range categorical column.
- Create an additional column that converts the school groupings to a number and drop the school categorical column.
- Create an additional column that converts the income ranges to a number and drop the income range categorical column.
- Create your target (also known as label) column based on the new income range column.
- One-hot encode all remaining categorical columns.
Build a machine learning model that predicts whether a person makes more than $50k.
Deliverables:
Use this template to submit your Client Report. The template has three sections (for additional details please see the instructional template):
- A 30 second elevator pitch as if you were in a job interview that describes the tools you used in this project.
- An “elevator pitch” that summarizes the entire case study.
- Answers to the grand questions that include text, pictures, and tables.
- An appendix that provides your commented code and justification for any programming that required you to choose an option.