Day 3: Validating data, cleaning columns

Welcome to class!

Announcements

  • Practice Coding Challenge quesitons

Spiritual Thought

  • Lotus eaters

Let’s validate some data!

Pick something from the Star Wars article you want to validate (“double check”).


Moving from categories to values.

  1. Create an additional column(s) that converts the income ranges to a number.
  2. Create an additional column(s) that converts the age ranges to a number.
  3. Create an additional column(s) that converts the school groupings to a number.

Validating visuals

You’re going to make a lot of bar charts!


Getting started on Question 3

One-hot encoding

Project 5 asks you to “one-hot encode all columns that have categories” and “convert all yes/no responses to 1/0 numeric”.

The get_dummies method can be used to create one-hot encoded variables. The pd.get_dummies documentation is a great place to start.

After reading the documentation, study the code below and get started on Grand Question #3.

#%%
# When we use machine learning to predict salary,
# let's only look at people that have seen at least
# one star wars film
starwars = starwars.query('have_seen_any == "Yes"')

# Discuss - what's a better way to filter out people 
# who haven't seen star wars?

# %%
# Format columns for machine learning

# Let's try this first: convert categories to "one-hot" encodings
shot_first_onehot = pd.get_dummies(starwars.shot_first)
shot_first_onehot

# What the difference between code above,
# and this? Which one is better?
shot_first_onehot = pd.get_dummies(starwars.shot_first, drop_first=True)
shot_first_onehot

# %%
# 'get_dummies()' can also be used to convert yes/no answers to 0/1

episode_i = pd.get_dummies(starwars.seen_film_i__the_phantom_menace)
episode_i

# %%
episode_i.value_counts()