Announcements

Today:

Continue discussion about evaluating models
Try to understand what models are doing

Evaluating model performance cont

Confusion Matrix

Why isn’t accuracy enough?

A confusion matrix is a quick way to see the strengths and weaknesses of your model. A confusion matrix is not a “metric”. A confusion matrix provides an easy way to calculate multiple metrics such as accuracy, precision, and recall.

alt text

Your Turn

With your group, use the links above to find a definition for your assigned metric. Then try using the confusion matrix on the screen to calculate your metric for my model.

~~Group 1: Accuracy~~
Group 2: Sensitivity/Recall
Group 3: Precision
Group 4: Specificity
Group 5: F1 Score
~~Group 6: Balanced Accuracy~~

Validation metrics

How to choose a good evaluation metric for your Machine learning model
Confustion Matrix Example )
Classification Metrics in scikit-learn

#%%
# a confusion matrix
print(metrics.confusion_matrix(y_test, y_predicted_DT))

#%%
# this one might be easier to read
print(pd.crosstab(y_test.before1980, y_predicted_DT, rownames=['True'], colnames=['Predicted'], margins=True))

#%%
# visualize a confusion matrix
# requires '.' to be installed
metrics.plot_confusion_matrix(classifier_DT, x_test, y_test)

Some python code

# Which metric seems better `accuracy_score()` or `balanced_accuracy_score()`? Why?
print("Accuracy:", metrics.accuracy_score(y_test, y_predicted_DT))
print("Balanced Accuracy:", metrics.balanced_accuracy_score(y_test, y_predicted_DT))

# Confusion matrix
print(pd.crosstab(y_test.before1980, y_predicted_DT, rownames=['True'], colnames=['Predicted'], margins=True))
metrics.plot_confusion_matrix(classifier_DT, x_test, y_test)

# Other metrics
print(metrics.classification_report(y_test, y_predicted_DT))

Improving your fit

Using different features

Let’s add Neighborhood to our previous work.

The dwellings_neighborhoods_ml data set contains the neighborhood variable “nbhd” from the original data transformed to use one hot encoding. (Image source.)

If we want to use neighborhoods in our classifier, we need to join dwellings_neighborhoods_ml with the other features we’ve been using.

#  what we used last class
x = dwellings_ml.filter(["livearea","basement","stories","numbaths"])
y = dwellings_ml[["before1980"]]

# adding on the neighborhood data
x2 = x.join(dwellings_neighborhoods, how='left')

Picking a different model

Gradient Boosting Classifier

A Gentle Introduction to the Gradient Boosting
GradientBoostingClassifier documentation
Gradient boosting wikipedia page

from sklearn.ensemble import GradientBoostingClassifier

boost = GradientBoostingClassifier(random_state=42)
boost.fit(x_train, y_train)

Category Boosting (catboost)

Video: What is Category Boosting (catboost)?
CatBoostClassifier documentation

import catboost as cb
model = cb.CatBoostClassifier(iterations=10,
                           depth=6,
                           learning_rate=1,
                           loss_function='Logloss',
                           verbose=False)
model.fit(x_train, y_train)

Day 4: Evaluating Our Models, Part 2