A significant portion of a data scientist’s job is data cleaning. During these two weeks, we will not hide the data munging from you. We will practice data cleaning using a Star Wars survey from FiveThirtyEight. Survey data is notoriously difficult to handle. Even when the data is recorded cleanly the options for ‘write in questions’, ‘choose from multiple answers’, ‘pick all that are right’, and ‘multiple choice questions’ makes storing the data in a tidy format difficult.
Completed Readings: Python for Data Science: Tidy Data, Python for Data Science: Graphics for Communication, and Python for Data Science: Strings
Use the StarWars.csv from FiveThirtyEights Github account and read the article
Grand Questions
- Please validate that the data provided on GitHub lines up with the article by recreating 2 of their visuals and calculating 2 summaries that they report in the article.
- Shorten the column names and clean them up for easier use with pandas.
- Filter the dataset to those that have seen at least one film.
- Clean and format the data so that it can be used in a machine learning model. Please acheive the following requests and provide examples of the table with a short description the changes made in your report.
- One-hot encode all columns that have categories.
- Convert all yes/no responses to 1/0 numeric.
- Create an additional column that converts the income ranges to a number.
- Create an additional column that converts the age ranges to a number.
- Create an additional column that converts the school groupings to a number.
- Build a machine learning model that predicts whether a person makes more than $50k.