- Import
dwellings_ml.csv
and write a short sentence describing your data. Remember to explain an observation and what measurements we have on that observation.
- Now try describing the modeling (machine learning) we are going to do in terms of features and targets. A. Are there any columns that are the target in disguise? B. Are the observational units unique in every row?
# the full imports
import pandas as pd
import numpy as np
import seaborn as sns
import altair as alt
# %%
h_subset = dwellings_ml.filter(['livearea', 'finbsmnt',
'basement', 'yearbuilt', 'nocars', 'numbdrm', 'numbaths',
'stories', 'yrbuilt', 'before1980']).sample(500)
sns.pairplot(h_subset, hue = 'before1980')
corr = h_subset.drop(columns = 'before1980').corr()
# %%
sns.heatmap(corr)
- square footage
- number of bathrooms
- basement size
Let’s create one chart using some of these variables.
What is scikit-learn?
About scikit-learn helps us see the history and funding. It should stay king of the hill for a long time.
- Simple and efficient tools for predictive data analysis
- Accessible to everybody, and reusable in various contexts
- Built on NumPy, SciPy, and matplotlib
- Open source, commercially usable - BSD license
Should I import scikit-learn?
scikit-learn is very large, with many submodules. To help the user of your .py
script understand your code, the consensus is to use from .... import ....
.
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics
What does the train_test_split()
function do?
X_train, X_test, y_train, y_test = train_test_split(
X_pred,
y_pred,
test_size = .34,
random_state = 76)
Read the documentation and tell me what is returned?
Why do we use test_size
and random_state
?
What is X_pred
and y_pred
in the above function example?
We need to take our data and build the feature and target data objects.
What columns should we remove from our features (X)?
What column should we use as our target (y)?
Let’s try a decision tree
from sklearn import tree
What method do we want from tree
?
- What is our target?
- Don’t forget to set your arguments.
After we have our machine learning method, what do we want to do with the method?
- fit
- predict
- evaluate metrics
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
y_probs = clf.predict_proba(X_test)