Day 15: Understanding the Pavlov in Machine Learning

Read the project overview and Questions for understanding.

The big ML picture

AI is able to learn ‘rules’ from highly repetitive data. Sebastian Thrun

The single most important thing for AI to accomplish in the next ten years is to free us from the burden of repetitive work. Sebastian Thrun

Understanding Classification Machine Learning

As we review the following material, be prepared to address the following questions

  • What is the difference between a feature and a target?
  • What does it mean to classify?
  • What does it mean to create a machine learning model?
  • What does it mean to find ‘boundaries’ in our variables or features?
  • How does finding ‘boundaries’ help us in ML?
  • What is a histogram?
  • What are false positives?
  • What are false negatives?
  • What is accuracy?
  • What is training data?
  • What is test data?

Visual Introduction to Machine Learning

  1. Machine learning identifies patterns using statistical learning and computers by unearthing boundaries in data sets. You can use it to make predictions.
  2. One method for making predictions is called a decision trees, which uses a series of if-then statements to identify boundaries and define patterns in the data.
  3. Overfitting happens when some boundaries are based on distinctions that don’t make a difference. You can see if a model overfits by having test data flow through the model.

Evaluating Machine Learning Models

As we review the following material, be prepared to address the following questions

  • What is bias?
  • What is variance?
  • What does the ‘minimum node size’ impact?

Bias-Variance Tradeoff

  • The bias error is an error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting).
  • The variance is an error from sensitivity to small fluctuations in the training set. High variance may result from an algorithm modeling the random noise in the training data (overfitting). ref
  1. Models approximate real-life situations using limited data.
  2. In doing so, errors can arise due to assumptions that are overly simple (bias) or overly complex (variance).
  3. Building models is about making sure there’s a balance between the two.

Reviewing the reward/penalty in Machine Learning

What is the ‘Pavlovian bell’ in the machine learning model?

Some mathematical penalty/reward equation.

If your model is near perfect in its predictability, you might be cheating.

Watch out for transactional data!

  • Financial: orders, invoices, payments
  • Work: plans, activity records
  • School: Grades

scikit learn

Using our project data to understand features, targets, and samples.

Getting our packages for project 4

# install packages
import sys
!{sys.executable} -m pip install seaborn scikit-learn
  1. Import dwellings_ml.csv and write a short sentence describing your data. Remember to explain an observation and what measurements we have on that observation.
# the full imports
import pandas as pd 
import numpy as np
import seaborn as sns
import altair as alt
# the from imports
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import GradientBoostingClassifier
from sklearn import metrics
dwellings_denver = pd.read_csv("https://github.com/byuidatascience/data4dwellings/raw/master/data-raw/dwellings_denver/dwellings_denver.csv")
dwellings_ml = pd.read_csv("https://github.com/byuidatascience/data4dwellings/raw/master/data-raw/dwellings_ml/dwellings_ml.csv")
dwellings_neighborhoods_ml = pd.read_csv("https://github.com/byuidatascience/data4dwellings/raw/master/data-raw/dwellings_neighborhoods_ml/dwellings_neighborhoods_ml.csv")   

alt.data_transformers.enable('json')

  1. Now try describing the modeling (machine learning) we are going to do in terms of features and targets. A. Are there any columns that are the target in disguise? B. Are the observational units unique in every row?

# %%
h_subset = dwellings_ml.filter(['livearea', 'finbsmnt', 
    'basement', 'yearbuilt', 'nocars', 'numbdrm', 'numbaths', 
    'stories', 'yrbuilt', 'before1980']).sample(500)

sns.pairplot(h_subset, hue = 'before1980')

corr = h_subset.drop(columns = 'before1980').corr()
# %%
sns.heatmap(corr)