Day 2: Intro to Machine Learning

Welcome to class!

Announcements

Spiritual thought

Don’t wait for inspiration.

What is Machine Learning?

Splitting the Data

1. Start with a data set

What is the difference between dwellings_denver.csv and dwellings_ml.csv?

2. Choose which variables to use

How do we know which variables to use out of dwellings_ml.csv?

Question 1 will help you identify patterns (or lack of patterns) in the data.

3. Separate into features and target

x = dwellings_ml.filter([#what variables will you use as "features"?])
y = dwellings_ml[#what variable is the "target"?]

4. Split into training and testing sets

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = #???, random_state = #???)

Training a Classifier

Decision Tree Example

# create the model
classifier = DecisionTreeClassifier()

# train the model
classifier.fit(x_train, y_train)

# make predictions
y_predictions = classifier.predict(x_test)

# test how accurate predictions are
metrics.accuracy_score(y_test, y_predictions)

How to Improve Accuracy

To improve the accuracy of your model, you could:

Change what variables are used in the features (x) data set
Change what type of model you are using
Tune (aka, “change” or “tweak”) the parameters of the model

Other Classification Models

Here are some other models you could try.

from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

Make Progress on Project 4

Do the project readings

Machine Learning Introduction

Step-by-step guide (mostly) for training a GaussianNB classifier. (The steps will be the same for any algorithm you use.)

Visual Introduction to Machine Learning

Machine learning identifies patterns using statistical learning and computers by unearthing boundaries in data sets. You can use it to make predictions.
One method for making predictions is called a decision trees, which uses a series of if-then statements to identify boundaries and define patterns in the data.
Overfitting happens when some boundaries are based on distinctions that don’t make a difference. You can see if a model overfits by having test data flow through the model.

Start working on Question 1

The goal of Grand Question 1 is to help us with “feature selection”.

“Overfitting” happens when some boundaries are based on on distinctions that don’t make a difference.
More data does not always lead to better models. (Occam’s Razor)

Common questions:

What is the 5000 rows error with Altair?

MaxRowsError: How can I plot Large Datasets?

You may also save data to a local filesystem and reference the data by file path. Altair has a JSON data transformer that will do this transparently when enabled:

alt.data_transformers.enable('json')

scikit-learn resources

Home page
Tutorials
Getting Started: What do you notice about the header portion of each of the script chunks?
- import vs from ... import