Day 4: May the ML columns be with you

Welcome to class!

Announcements

Gratitude Journal


Getting the data ready for machine learning.


What are machine learning algorithms expecting to see?

We need to handle missing values and categorical features before feeding the data into a machine learning algorithm, because the mathematics underlying most machine learning models assumes that the data is numerical and contains no missing values. To reinforce this requirement, scikit-learn will return an error if you try to train a model using data that contain missing values or non-numeric values when working with models like linear regression and logistic regression. ref

We have some options when converting categorical features (columns) to numeric.

  • If the category contains numeric information (like a range of numbers) we can convert it to a numeric variable by taking the minimum, average, or maximum of the range.
  • If the category is an “ordinal” variable (meaning, there is an order to the categories) we can assign each category to an integer. (For example, good = 1, better = 2, best = 3.)
  • If the category is a “nominal” variable (without an order) then we need to use one-hot encoding (sometimes called “dummy variable encoding").
  • If the category is some version of True/False or Yes/No then we can simply convert the values to zeros and ones.

What’s our game plan for the Star Wars columns?

First: Limit the data to only people who answered “Yes” to the question “Have you seen any of the 6 films in the Star Wars franchise?”.

Then: Use the table below as a guide to prepare your data for machine learning.

ColumnOriginal FormatConvert To
agecategory (ordinal, age ranges)number
incomecategory (ordinal, income ranges)number
educationcategory (ordinal, name of degree)number
shot_firstcategory (nominal)one-hot
gendercategory (nominal)one-hot
locationcategory (nominal)one-hot
fan_star_warsYes/No0/1
expanded_universeYes/No0/1
fan_exapandedYes/No0/1
fan_star_trekYes/No0/1
seen_iYes/No (name of movie/NaN)0/1
seen_iiYes/No (name of movie/NaN)0/1
seen_iiiYes/No (name of movie/NaN)0/1
seen_ivYes/No (name of movie/NaN)0/1
seen_vYes/No (name of movie/NaN)0/1
seen_viYes/No (name of movie/NaN)0/1
movie rankingsnumber-
character rankingscategory (ordinal)one-hot or factorize

What functions can we use to convert the categorical columns to numeric?

Question: When and why would we drop the first column when we convert a category using pd.get_dummies()?

Answer: Whenever your algorithm needs to calculate a matrix inverse.

The one-hot encoding creates one binary variable for each category.


The problem is that this representation includes redundancy. For example, if we know that [1, 0, 0] represents “blue” and [0, 1, 0] represents “green” we don’t need another binary variable to represent “red”, instead we could use 0 values for both “blue” and “green” alone, e.g. [0, 0].


This is called a dummy variable encoding, and always represents C categories with C-1 binary variables. In addition to being slightly less redundant, a dummy variable representation is required for some models.


For example, in the case of a linear regression model (and other regression models that have a bias term), a one hot encoding will case the matrix of input data to become singular, meaning it cannot be inverted and the linear regression coefficients cannot be calculated using linear algebra. For these types of models a dummy variable encoding must be used instead.

Source


Predicting income.

Grand Question 4 wants us to “build a machine learning model that predicts whether a person makes more than $50k”.

Aka, what is our “outcome” or “response” that we want to predict?

dat_ml.income > 50000

Remember not to include the answer (income) in your features!

x = dat_ml.drop(['income'], axis = 1)

The response needs to be saved as a 0/1 variable (at least, for binary classification algorithms).

y = (dat_ml.income > 50000) / 1