Day 2: Intro to Machine Learning

Welcome to class!

alt text

Shire Reckoning

Announcements

Coding Challenge Practice - Thursday, March 7

Spiritual thought

Are facts true?

How do you distinguish between truth and error?
Joshua and Caleb

Building a Decision Tree

Import packages

Splitting the Data

1. Start with packages and data set

We’ll be using some parts of SKLEARN package and the Seaborn package.

# If you haven't already, install scikit-learn and seaborn
pip install scikit-learn seaborn

from types import GeneratorType
import pandas as pd
import altair as alt
import numpy as np
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics

What is the difference between dwellings_denver.csv and dwellings_ml.csv?

2. Choose which variables to use

How do we know which variables to use out of dwellings_ml.csv?

Question 1 will help you identify patterns (or lack of patterns) in the data.

3. Separate into features and target

Which Features?

# %%
h_subset = dwellings_ml.filter(['livearea', 'finbsmnt', 
    'basement', 'yearbuilt', 'nocars', 'numbdrm', 'numbaths', 
    'stories', 'yrbuilt', 'before1980']).sample(500)

sns.pairplot(h_subset, hue = 'before1980')

corr = h_subset.drop(columns = 'before1980').corr()
# %%
sns.heatmap(corr)

4. Split into training and testing sets

What does the “train_test_split()” function do?

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = #???, random_state = #???)

Read the documentation and tell me what is returned?

Function documentation

Why do we use “test_size” and “random_state”?

What is “x” and “y” in the above function example?

We need to take our data and build the feature and target data objects.

What columns should we remove from our features (X)?

What column should we use as our target (y)?

x = dwellings_ml.filter([#what variables will you use as "features"?])
y = dwellings_ml[#what variable is the "target"?]

Training a Classifier

Decision Tree Example


#%%
# Create a decision tree
classifier_DT = DecisionTreeClassifier(max_depth = 4)

# Fit the decision tree
classifier_DT.fit(x_train, y_train)

# Test the decision tree (make predictions)
y_predicted_DT = classifier_DT.predict(x_test)

# Evaluate the decision tree
print("Accuracy:", metrics.accuracy_score(y_test, y_predicted_DT))

How to Improve Accuracy

To improve the accuracy of your model, you could:

Change what variables are used in the features (x) data set
Change what type of model you are using
Tune (aka, “change” or “tweak”) the parameters of the model

Other Classification Models

Here are some other models you could try.

from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

Make Progress on Project 4

Do the project readings

Machine Learning Introduction

Step-by-step guide (mostly) for training a GaussianNB classifier. (The steps will be the same for any algorithm you use.)

Visual Introduction to Machine Learning

Machine learning identifies patterns using statistical learning and computers by unearthing boundaries in data sets. You can use it to make predictions.
One method for making predictions is called a decision trees, which uses a series of if-then statements to identify boundaries and define patterns in the data.
Overfitting happens when some boundaries are based on distinctions that don’t make a difference. You can see if a model overfits by having test data flow through the model.

Start working on Question 1

The goal of Grand Question 1 is to help us with “feature selection”.

“Overfitting” happens when some boundaries are based on on distinctions that don’t make a difference.
More data does not always lead to better models. (Occam’s Razor)

Common questions:

What is the 5000 rows error with Altair?

The best way around this is to look at a sub-sample of the data for exploratory purposes. For example, you can use “sample(500)”. But there are ways to expand VS Code’s limits.

MaxRowsError: How can I plot Large Datasets?

You may also save data to a local filesystem and reference the data by file path. Altair allows you to disable the max rows:

alt.data_transformers.disable_max_rows()
subset_data = denver.sample(n = 4999)

scikit-learn resources

Home page
Tutorials
Getting Started: What do you notice about the header portion of each of the script chunks?
- import vs from ... import