Welcome to class!
Announcements
Spiritual thought
What is Machine Learning?
Splitting the Data
1. Start with a data set
What is the difference between dwellings_denver.csv
and dwellings_ml.csv
?
2. Choose which variables to use
How do we know which variables to use out of dwellings_ml.csv
?
Question 1 will help you identify patterns (or lack of patterns) in the data.
3. Separate into features and target
x = dwellings_ml.filter([#what variables will you use as "features"?])
y = dwellings_ml[#what variable is the "target"?]
4. Split into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = #???, random_state = #???)
Training a Classifier
Decision Tree Example
# create the model
classifier = DecisionTreeClassifier()
# train the model
classifier.fit(x_train, y_train)
# make predictions
y_predictions = classifier.predict(x_test)
# test how accurate predictions are
metrics.accuracy_score(y_test, y_predictions)
How to Improve Accuracy
To improve the accuracy of your model, you could:
- Change what variables are used in the features (x) data set
- Change what type of model you are using
- Tune (aka, “change” or “tweak”) the parameters of the model
Other Classification Models
Here are some other models you could try.
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
Make Progress on Project 4
- Step-by-step guide (mostly) for training a GaussianNB classifier. (The steps will be the same for any algorithm you use.)
Visual Introduction to Machine Learning
- Machine learning identifies patterns using statistical learning and computers by unearthing boundaries in data sets. You can use it to make predictions.
- One method for making predictions is called a decision trees, which uses a series of if-then statements to identify boundaries and define patterns in the data.
- Overfitting happens when some boundaries are based on distinctions that don’t make a difference. You can see if a model overfits by having test data flow through the model.
The goal of Grand Question 1 is to help us with “feature selection”.
- “Overfitting” happens when some boundaries are based on on distinctions that don’t make a difference.
- More data does not always lead to better models. (Occam’s Razor)
Common questions:
MaxRowsError: How can I plot Large Datasets?
You may also save data to a local filesystem and reference the data by file path. Altair has a JSON data transformer that will do this transparently when enabled:
alt.data_transformers.enable('json')
- Home page
- Tutorials
- Getting Started: What do you notice about the header portion of each of the script chunks?