Welcome to class!
Announcements
Gratitude Journal
Getting the data ready for machine learning.
What are machine learning algorithms expecting to see?
We need to handle missing values and categorical features before feeding the data into a machine learning algorithm, because the mathematics underlying most machine learning models assumes that the data is numerical and contains no missing values. To reinforce this requirement, scikit-learn will return an error if you try to train a model using data that contain missing values or non-numeric values when working with models like linear regression and logistic regression. ref
We have some options when converting categorical features (columns) to numeric.
- If the category contains numeric information (like a range of numbers) we can convert it to a numeric variable by taking the minimum, average, or maximum of the range.
- If the category is an “ordinal” variable (meaning, there is an order to the categories) we can assign each category to an integer. (For example, good = 1, better = 2, best = 3.)
- If the category is a “nominal” variable (without an order) then we need to use one-hot encoding (sometimes called “dummy variable encoding").
- If the category is some version of True/False or Yes/No then we can simply convert the values to zeros and ones.
What’s our game plan for the Star Wars columns?
First: Limit the data to only people who answered “Yes” to the question “Have you seen any of the 6 films in the Star Wars franchise?”.
Then: Use the table below as a guide to prepare your data for machine learning.
Column | Original Format | Convert To |
---|---|---|
age | category (ordinal, age ranges) | number |
income | category (ordinal, income ranges) | number |
education | category (ordinal, name of degree) | number |
shot_first | category (nominal) | one-hot |
gender | category (nominal) | one-hot |
location | category (nominal) | one-hot |
fan_star_wars | Yes/No | 0/1 |
expanded_universe | Yes/No | 0/1 |
fan_exapanded | Yes/No | 0/1 |
fan_star_trek | Yes/No | 0/1 |
seen_i | Yes/No (name of movie/NaN) | 0/1 |
seen_ii | Yes/No (name of movie/NaN) | 0/1 |
seen_iii | Yes/No (name of movie/NaN) | 0/1 |
seen_iv | Yes/No (name of movie/NaN) | 0/1 |
seen_v | Yes/No (name of movie/NaN) | 0/1 |
seen_vi | Yes/No (name of movie/NaN) | 0/1 |
movie rankings | number | - |
character rankings | category (ordinal) | one-hot or factorize |
What functions can we use to convert the categorical columns to numeric?
- Range of numbers: str.split() and astype()
- Ordinal: str.replace()
- Ordinal: pd.factorize() (can also be used for True/False)
- Nominal: pd.get_dummies()
Question: When and why would we drop the first column when we convert a category using pd.get_dummies()
?
Answer: Whenever your algorithm needs to calculate a matrix inverse.
The one-hot encoding creates one binary variable for each category.
The problem is that this representation includes redundancy. For example, if we know that [1, 0, 0] represents “blue” and [0, 1, 0] represents “green” we don’t need another binary variable to represent “red”, instead we could use 0 values for both “blue” and “green” alone, e.g. [0, 0].
This is called a dummy variable encoding, and always represents C categories with C-1 binary variables. In addition to being slightly less redundant, a dummy variable representation is required for some models.
For example, in the case of a linear regression model (and other regression models that have a bias term), a one hot encoding will case the matrix of input data to become singular, meaning it cannot be inverted and the linear regression coefficients cannot be calculated using linear algebra. For these types of models a dummy variable encoding must be used instead.
Predicting income.
Grand Question 4 wants us to “build a machine learning model that predicts whether a person makes more than $50k”.
Aka, what is our “outcome” or “response” that we want to predict?
dat_ml.income > 50000
Remember not to include the answer (income) in your features!
x = dat_ml.drop(['income'], axis = 1)
The response needs to be saved as a 0/1 variable (at least, for binary classification algorithms).
y = (dat_ml.income > 50000) / 1