Day 3: Validating data, cleaning columns

Welcome to class!

Announcements

Gratitude Journal

Validating visuals

You’re going to make a lot of bar charts!

Simple bar chart tutorial.
Make Altair do the counting for you! Tutorials here and here.

Getting started on Grand Question 3

One-hot encoding

Project 5 asks you to “one-hot encode all columns that have categories” and “convert all yes/no responses to 1/0 numeric”.

The get_dummies method can be used to create one-hot encoded variables. The pd.get_dummies documentation is a great place to start.

After reading the documentation, study the code below and get started on Grand Question #3.

#%%
# When we use machine learning to predict salary,
# let's only look at people that have seen at least
# one star wars film
starwars = starwars.query('have_seen_any == "Yes"')

# Discuss - what's a better way to filter out people 
# who haven't seen star wars?

# %%
# Format columns for machine learning

# Let's try this first: convert categories to "one-hot" encodings
shot_first_onehot = pd.get_dummies(starwars.shot_first)
shot_first_onehot

# What the difference between code above,
# and this? Which one is better?
shot_first_onehot = pd.get_dummies(starwars.shot_first, drop_first=True)
shot_first_onehot

# %%
# 'get_dummies()' can also be used to convert yes/no answers to 0/1

episode_i = pd.get_dummies(starwars.seen_film_i__the_phantom_menace)
episode_i

# %%
episode_i.value_counts()