Project 5: The war with Star Wars

Background

Survey data is notoriously difficult to handle. Even when the data is recorded cleanly, the response options for write in questions, ‘choose from multiple answers’ questions, ‘pick all that are right’ questions, and multiple choice questions make storing the data in a tidy format difficult.

In 2014, FiveThirtyEight surveyed over 1000 people to write the article titled, America’s Favorite ‘Star Wars’ Movies (And Least Favorite Characters). They have provided the data on GitHub.

For this project, your client would like to use the Star Wars survey data to figure out if they can predict an interviewing job candidate’s current income based on a few responses about Star Wars movies.

Data

Download: StarWars.csv
Information: Article

Readings

Grand Questions and Tasks

  1. Shorten the column names and clean them up for easier use with pandas.

  2. Please validate that the data provided on GitHub lines up with the article by recreating 2 of their visuals and calculating 2 summaries that they report in the article.

  3. Clean and format the data so that it can be used in a machine learning model. As you format the data, you should complete each item listed below. In your final report provide example(s) of the reformatted data with a short description of the changes made.

    1. Filter the dataset to respondents that have seen at least one film.
    2. Create a new column that converts the age ranges to a single number. Drop the age range categorical column.
    3. Create a new column that converts the school groupings to a single number. Drop the school categorical column.
    4. Create a new column that converts the income ranges to a single number. Drop the income range categorical column.
    5. One-hot encode all remaining categorical columns.
    6. Create your target (also known as "y" or "label") column based on the new income range column.
  4. Build a machine learning model that predicts whether a person makes more than $50k. Describe your model and report the accuracy.

Deliverables

Use the provided template to submit your case study. The template has three sections:

  1. A short summary that describes the results of the project and the tools you used. (Think “elevator pitch”.)
  2. Answers to the grand questions. Each answer should include a written description of your results, and may also include charts or tables.
  3. An appendix that provides your commented code. Your code comments should justify any decisions you had to make while programming.