Day 4: May the ML columns be with you

Welcome to class!

Announcements

Gratitude Journal

Getting the data ready for machine learning.

What are machine learning algorithms expecting to see?

We need to handle missing values and categorical features before feeding the data into a machine learning algorithm, because the mathematics underlying most machine learning models assumes that the data is numerical and contains no missing values. To reinforce this requirement, scikit-learn will return an error if you try to train a model using data that contain missing values or non-numeric values when working with models like linear regression and logistic regression. ref

We have some options when converting categorical features (columns) to numeric.

If the category contains numeric information (like a range of numbers) we can convert it to a numeric variable by taking the minimum, average, or maximum of the range.
If the category is an “ordinal” variable (meaning, there is an order to the categories) we can assign each category to an integer. (For example, good = 1, better = 2, best = 3.)
If the category is a “nominal” variable (without an order) then we need to use one-hot encoding (sometimes called “dummy variable encoding").
If the category is some version of True/False or Yes/No then we can simply convert the values to zeros and ones.

What’s our game plan for the Star Wars columns?

First: Limit the data to only people who answered “Yes” to the question “Have you seen any of the 6 films in the Star Wars franchise?”.

Then: Use the table below as a guide to prepare your data for machine learning.

Column	Original Format	Convert To
age	category (ordinal, age ranges)	number
income	category (ordinal, income ranges)	number
education	category (ordinal, name of degree)	number
shot_first	category (nominal)	one-hot
gender	category (nominal)	one-hot
location	category (nominal)	one-hot
fan_star_wars	Yes/No	0/1
expanded_universe	Yes/No	0/1
fan_exapanded	Yes/No	0/1
fan_star_trek	Yes/No	0/1
seen_i	Yes/No (name of movie/NaN)	0/1
seen_ii	Yes/No (name of movie/NaN)	0/1
seen_iii	Yes/No (name of movie/NaN)	0/1
seen_iv	Yes/No (name of movie/NaN)	0/1
seen_v	Yes/No (name of movie/NaN)	0/1
seen_vi	Yes/No (name of movie/NaN)	0/1
movie rankings	number	-
character rankings	category (ordinal)	one-hot or factorize

What functions can we use to convert the categorical columns to numeric?

Range of numbers: str.split() and astype()
Ordinal: str.replace()
Ordinal: pd.factorize() (can also be used for True/False)
Nominal: pd.get_dummies()

Using the drop_first = True option in get_dummies()

Question: When and why would we drop the first column when we convert a category using pd.get_dummies()?

Answer: Whenever your algorithm needs to calculate a matrix inverse.

The one-hot encoding creates one binary variable for each category.

The problem is that this representation includes redundancy. For example, if we know that [1, 0, 0] represents “blue” and [0, 1, 0] represents “green” we don’t need another binary variable to represent “red”, instead we could use 0 values for both “blue” and “green” alone, e.g. [0, 0].

This is called a dummy variable encoding, and always represents C categories with C-1 binary variables. In addition to being slightly less redundant, a dummy variable representation is required for some models.

For example, in the case of a linear regression model (and other regression models that have a bias term), a one hot encoding will case the matrix of input data to become singular, meaning it cannot be inverted and the linear regression coefficients cannot be calculated using linear algebra. For these types of models a dummy variable encoding must be used instead.

Source

Predicting income.

Grand Question 4 wants us to “build a machine learning model that predicts whether a person makes more than $50k”.

What is the target we’re interested in?

Aka, what is our “outcome” or “response” that we want to predict?

dat_ml.income > 50000

How to format the features (x) and target (y)

Remember not to include the answer (income) in your features!

x = dat_ml.drop(['income'], axis = 1)

The response needs to be saved as a 0/1 variable (at least, for binary classification algorithms).

y = (dat_ml.income > 50000) / 1