Read the project overview and Questions for understanding.
- Our 2-week project details
- Questions?
The big ML picture
AI is able to learn ‘rules’ from highly repetitive data. Sebastian Thrun
The single most important thing for AI to accomplish in the next ten years is to free us from the burden of repetitive work. Sebastian Thrun
Understanding Classification Machine Learning
As we review the following material, be prepared to address the following questions
- What is the difference between a feature and a target?
- What does it mean to classify?
- What does it mean to create a machine learning model?
- What does it mean to find ‘boundaries’ in our variables or features?
- How does finding ‘boundaries’ help us in ML?
- What is a histogram?
- What are false positives?
- What are false negatives?
- What is accuracy?
- What is training data?
- What is test data?
- Machine learning identifies patterns using statistical learning and computers by unearthing boundaries in data sets. You can use it to make predictions.
- One method for making predictions is called a decision trees, which uses a series of if-then statements to identify boundaries and define patterns in the data.
- Overfitting happens when some boundaries are based on distinctions that don’t make a difference. You can see if a model overfits by having test data flow through the model.
Evaluating Machine Learning Models
As we review the following material, be prepared to address the following questions
- What is bias?
- What is variance?
- What does the ‘minimum node size’ impact?
- The bias error is an error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting).
- The variance is an error from sensitivity to small fluctuations in the training set. High variance may result from an algorithm modeling the random noise in the training data (overfitting). ref
- Models approximate real-life situations using limited data.
- In doing so, errors can arise due to assumptions that are overly simple (bias) or overly complex (variance).
- Building models is about making sure there’s a balance between the two.
Reviewing the reward/penalty in Machine Learning
What is the ‘Pavlovian bell’ in the machine learning model?
Some mathematical penalty/reward equation.
- Variance, RMSE, SD
- Proportions, Classification
If your model is near perfect in its predictability, you might be cheating.
Watch out for transactional data!
- Financial: orders, invoices, payments
- Work: plans, activity records
- School: Grades
scikit learn
- Tutorials
- Getting Started: What do you notice about the header portion of each of the script chunks?
Using our project data to understand features, targets, and samples.
Getting our packages for project 4
# install packages
import sys
!{sys.executable} -m pip install seaborn scikit-learn
- Import
dwellings_ml.csv
and write a short sentence describing your data. Remember to explain an observation and what measurements we have on that observation.
# the full imports
import pandas as pd
import numpy as np
import seaborn as sns
import altair as alt
# the from imports
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import GradientBoostingClassifier
from sklearn import metrics
dwellings_denver = pd.read_csv("https://github.com/byuidatascience/data4dwellings/raw/master/data-raw/dwellings_denver/dwellings_denver.csv")
dwellings_ml = pd.read_csv("https://github.com/byuidatascience/data4dwellings/raw/master/data-raw/dwellings_ml/dwellings_ml.csv")
dwellings_neighborhoods_ml = pd.read_csv("https://github.com/byuidatascience/data4dwellings/raw/master/data-raw/dwellings_neighborhoods_ml/dwellings_neighborhoods_ml.csv")
alt.data_transformers.enable('json')
- Now try describing the modeling (machine learning) we are going to do in terms of features and targets. A. Are there any columns that are the target in disguise? B. Are the observational units unique in every row?
# %%
h_subset = dwellings_ml.filter(['livearea', 'finbsmnt',
'basement', 'yearbuilt', 'nocars', 'numbdrm', 'numbaths',
'stories', 'yrbuilt', 'before1980']).sample(500)
sns.pairplot(h_subset, hue = 'before1980')
corr = h_subset.drop(columns = 'before1980').corr()
# %%
sns.heatmap(corr)