Day 16: Using scikit-learn for machine learning

Import dwellings_ml.csv and write a short sentence describing your data. Remember to explain an observation and what measurements we have on that observation.

Now try describing the modeling (machine learning) we are going to do in terms of features and targets. A. Are there any columns that are the target in disguise? B. Are the observational units unique in every row?

# the full imports
import pandas as pd 
import numpy as np
import seaborn as sns
import altair as alt

# %%
h_subset = dwellings_ml.filter(['livearea', 'finbsmnt', 
    'basement', 'yearbuilt', 'nocars', 'numbdrm', 'numbaths', 
    'stories', 'yrbuilt', 'before1980']).sample(500)

sns.pairplot(h_subset, hue = 'before1980')

corr = h_subset.drop(columns = 'before1980').corr()
# %%
sns.heatmap(corr)

What features of homes might have changed a bit over time?

square footage
number of bathrooms
basement size

Let’s create one chart using some of these variables.

What is scikit-learn?

About scikit-learn helps us see the history and funding. It should stay king of the hill for a long time.

Simple and efficient tools for predictive data analysis
Accessible to everybody, and reusable in various contexts
Built on NumPy, SciPy, and matplotlib
Open source, commercially usable - BSD license

Should I import scikit-learn?

scikit-learn is very large, with many submodules. To help the user of your .py script understand your code, the consensus is to use from .... import .....

from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics

What does the `train_test_split()` function do?

X_train, X_test, y_train, y_test = train_test_split(
    X_pred, 
    y_pred, 
    test_size = .34, 
    random_state = 76)

Read the documentation and tell me what is returned?

Function documentation

Why do we use `test_size` and `random_state`?

What is `X_pred` and `y_pred` in the above function example?

We need to take our data and build the feature and target data objects.

What columns should we remove from our features (X)?

What column should we use as our target (y)?

Let’s try a decision tree

from sklearn import tree

What method do we want from `tree`?

What is our target?
Don’t forget to set your arguments.

After we have our machine learning method, what do we want to do with the method?

fit
predict
evaluate metrics

clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
y_probs = clf.predict_proba(X_test)

Day 16: Using scikit-learn for machine learning

What is scikit-learn?

Should I import scikit-learn?

What does the train_test_split() function do?

Why do we use test_size and random_state?

What is X_pred and y_pred in the above function example?