Day 16: Using scikit-learn for machine learning

  1. Import dwellings_ml.csv and write a short sentence describing your data. Remember to explain an observation and what measurements we have on that observation.
  1. Now try describing the modeling (machine learning) we are going to do in terms of features and targets. A. Are there any columns that are the target in disguise? B. Are the observational units unique in every row?
# the full imports
import pandas as pd 
import numpy as np
import seaborn as sns
import altair as alt
# %%
h_subset = dwellings_ml.filter(['livearea', 'finbsmnt', 
    'basement', 'yearbuilt', 'nocars', 'numbdrm', 'numbaths', 
    'stories', 'yrbuilt', 'before1980']).sample(500)

sns.pairplot(h_subset, hue = 'before1980')

corr = h_subset.drop(columns = 'before1980').corr()
# %%
sns.heatmap(corr)

  • square footage
  • number of bathrooms
  • basement size

Let’s create one chart using some of these variables.

What is scikit-learn?

About scikit-learn helps us see the history and funding. It should stay king of the hill for a long time.

  • Simple and efficient tools for predictive data analysis
  • Accessible to everybody, and reusable in various contexts
  • Built on NumPy, SciPy, and matplotlib
  • Open source, commercially usable - BSD license

Should I import scikit-learn?

scikit-learn is very large, with many submodules. To help the user of your .py script understand your code, the consensus is to use from .... import .....

from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics

What does the train_test_split() function do?

X_train, X_test, y_train, y_test = train_test_split(
    X_pred, 
    y_pred, 
    test_size = .34, 
    random_state = 76)   

Read the documentation and tell me what is returned?

Function documentation

Why do we use test_size and random_state?

What is X_pred and y_pred in the above function example?

We need to take our data and build the feature and target data objects.

What columns should we remove from our features (X)?

What column should we use as our target (y)?

Let’s try a decision tree

from sklearn import tree

What method do we want from tree?

  • What is our target?
  • Don’t forget to set your arguments.

After we have our machine learning method, what do we want to do with the method?

  1. fit
  2. predict
  3. evaluate metrics
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
y_probs = clf.predict_proba(X_test)