Announcements
Today:
- Continue discussion about evaluating models
- Try to understand what models are doing
Evaluating model performance cont
Confusion Matrix
Why isn’t accuracy enough?
A confusion matrix is a quick way to see the strengths and weaknesses of your model. A confusion matrix is not a “metric”. A confusion matrix provides an easy way to calculate multiple metrics such as accuracy, precision, and recall.

Your Turn
With your group, use the links above to find a definition for your assigned metric. Then try using the confusion matrix on the screen to calculate your metric for my model.
Group 1: Accuracy- Group 2: Sensitivity/Recall
- Group 3: Precision
- Group 4: Specificity
- Group 5: F1 Score
Group 6: Balanced Accuracy
Validation metrics
#%%
# a confusion matrix
print(metrics.confusion_matrix(y_test, y_predicted_DT))
#%%
# this one might be easier to read
print(pd.crosstab(y_test.before1980, y_predicted_DT, rownames=['True'], colnames=['Predicted'], margins=True))
#%%
# visualize a confusion matrix
# requires '.' to be installed
metrics.plot_confusion_matrix(classifier_DT, x_test, y_test)
Some python code
# Which metric seems better `accuracy_score()` or `balanced_accuracy_score()`? Why?
print("Accuracy:", metrics.accuracy_score(y_test, y_predicted_DT))
print("Balanced Accuracy:", metrics.balanced_accuracy_score(y_test, y_predicted_DT))
# Confusion matrix
print(pd.crosstab(y_test.before1980, y_predicted_DT, rownames=['True'], colnames=['Predicted'], margins=True))
metrics.plot_confusion_matrix(classifier_DT, x_test, y_test)
# Other metrics
print(metrics.classification_report(y_test, y_predicted_DT))
Improving your fit
Using different features
Let’s add Neighborhood to our previous work.
The dwellings_neighborhoods_ml data set contains the neighborhood variable “nbhd” from the original data transformed to use one hot encoding. (Image source.)

If we want to use neighborhoods in our classifier, we need to join dwellings_neighborhoods_ml with the other features we’ve been using.
# what we used last class
x = dwellings_ml.filter(["livearea","basement","stories","numbaths"])
y = dwellings_ml[["before1980"]]
# adding on the neighborhood data
x2 = x.join(dwellings_neighborhoods, how='left')
Picking a different model
Gradient Boosting Classifier
- A Gentle Introduction to the Gradient Boosting
- GradientBoostingClassifier documentation
- Gradient boosting wikipedia page
from sklearn.ensemble import GradientBoostingClassifier
boost = GradientBoostingClassifier(random_state=42)
boost.fit(x_train, y_train)
Category Boosting (catboost)
- Video: What is Category Boosting (catboost)?
- CatBoostClassifier documentation
import catboost as cb
model = cb.CatBoostClassifier(iterations=10,
depth=6,
learning_rate=1,
loss_function='Logloss',
verbose=False)
model.fit(x_train, y_train)