Project 5: The War with Star Wars

Published

May 1, 2020

Walkthrough

Background

Survey data is notoriously difficult to munge. Even when the data is recorded cleanly the options for ‘write in questions’, ‘choose from multiple answers’, ‘pick all that are right’, and ‘multiple choice questions’ makes storing the data in a tidy format difficult.

In 2014, FiveThirtyEight surveyed over 1000 people to write the article titled, America’s Favorite ‘Star Wars’ Movies (And Least Favorite Characters). They have provided the data on GitHub.

For this project, your client would like to use the Star Wars survey data to figure out if they can predict an interviewing job candidate’s current income based on a few responses about Star Wars movies.

Client Request

The Client is who performed the survey but outsourced the analitics to a 3rd party. They want you to clean up the data so you can: a. Validate the data provided on GitHub lines up with the article by recreating 2 of the visuals from the article a. Determine if you predict if a person from the survey makes more than $50k

Data

URL: StarWars.csv
Information: Article

Readings

P4DS: CH6 Tidy Data (Skim)
P4DS: CH14 Graphics for Communication (Skim)
P4DS: CH16 Numbers (Read)
P4DS: CH17 Strings and Text (Read)
P4DS: Ch18 Regular Expressions (Read)
P4DS: CH19 Categorical Data (Read)

Optional References

Why to not use get_dummies

Questions and Tasks (Core)

Note

This section lists the questions and tasks that need to be completed for the project. Your work on the project must be compiled into a report, rendered to HTML file in a Course Portfolio a link to that project page uploaded in Canvas.

There are two types of questions: Core and Stretch. Core questions are required for each project. The course syllabus competencies requires specic a number of projects having all the Stretch questions achived based on your goals for the grade level you are seeking.

Shorten the column names and clean them up for easier use with pandas. Provide a table or list that exemplifies how you fixed the names.
Clean and format the data so that it can be used in a machine learning model. As you format the data, you should complete each item listed below. In your final report provide example(s) of the reformatted data with a short description of the changes made.
1. Filter the dataset to respondents that have seen at least one film
2. Create a new column that converts the age ranges to a single number. Drop the age range categorical column
3. Create a new column that converts the education groupings to a single number. Drop the school categorical column
4. Create a new column that converts the income ranges to a single number. Drop the income range categorical column
5. Create your target (also known as “y” or “label”) column based on the new income range column
6. One-hot encode all remaining categorical columns
Validate that the data provided on GitHub lines up with the article by recreating 2 of the visuals from the article.
Build a machine learning model that predicts whether a person makes more than $50k. Describe your model and report the accuracy.

Questions and Tasks (Stretch)

Here is an example Stretch question(s) for this project. Your instructor may assign different Stretch question(s). You must comment in Canvas when submitting your project if you completed any of the Stretch questions.

Build a machine learning model that predicts whether a person makes more than $50k. With accuracy of at least 65%. Describe your model and report the accuracy.
Validate the data provided on GitHub lines up with the article by recreating a 3rd visual from the article.
Create a new column that converts the location groupings to a single number. Drop the location categorical column.

Submission:

Note

When you have completed the report, you will need to follow this process to submit your work:

Have the Course Work Portfolio open in VS Code and open Projects/Project5.qmd
Click Preview Button in VS Code in the top right of the screen
1. This will render the project but also entire course work portfolio into HTML files for review
2. Confirm everything displas as you would like it to
3. How you see it will be how it is viewed for grading
4. If there is an error in any cell of the quarto files, the rendering will stop and you will need to fix the error before rendering again (if you get stuck post your error in Slack)
Once the report is confirmed close the preview and open the GitHub Desktop application
Confirm you are in the correct repository in the top left corner of the screen
Confirm you are on the correct branch Main in the top left corner of the screen (Never change off the Main branch)
Type a summary of the changes in the Summary box
Click Commit to main blue button in the bottom left corner
Click Push origin blue button in the middle right of the screen
1. This will push all your changes in the project .qmd file to GitHub
2. The publish.yml file will kick off an automated process to render the project into HTML files
3. The HTML files will be published to GitHub pages in the gh-pages branch
4. The URL to the published project will be in the deployment section in GitHub
  1. In GitHub Desktop click Open in GitHub to navigete to the repository
  2. Click on the Actions tab and make sure there were no errors in the rendering process
  3. Click on the deployment section of the main page of the repository to find the URL
  4. Navigate to the URL and confirm it displays as you intended
  5. Copy the URL and submit it in Canvas

Deliverables:

Use this P5_template to submit your Client Report. The template has two sections:

A short elevator pitch that highlights key values or metrics from the results. Describing these key insights to interest or hook the reader to want to read more about your work. The writing style should be more technical with some creative elements. Do not summarize what you did.
Answers to the questions | tasks. Each should include a written description of your results, code cells with comments, charts and/or tables.
A short summary of work must be submitted in the comments in Canvas wwhen you submit the URL. Rate your own work on a scale of 1-5. 1 being poor and 5 being excellent. Include a short description of why you rated your work the way you did.

Feedback:

Note

You will recieve feedback and/or coaching notes in the form of a GitHub issue. You will need to address the feedback, re-render and resubmit the project, and mark the GitHub issue as closed.