Announcements
Today:
- Continue discussion about evaluating models
- Try to understand what models are doing
Evaluating model performance cont
Confusion Matrix
Why isn’t accuracy enough?
A confusion matrix is a quick way to see the strengths and weaknesses of your model. A confusion matrix is not a “metric”. A confusion matrix provides an easy way to calculate multiple metrics such as accuracy, precision, and recall.
Your Turn
With your group, use the links above to find a definition for your assigned metric. Then try using the confusion matrix on the screen to calculate your metric for my model.
Group 1: Accuracy- Group 2: Sensitivity/Recall
- Group 3: Precision
- Group 4: Specificity
- Group 5: F1 Score
Group 6: Balanced Accuracy
Validation metrics
#%%
# a confusion matrix
print(metrics.confusion_matrix(y_test, y_predicted_DT))
#%%
# this one might be easier to read
print(pd.crosstab(y_test.before1980, y_predicted_DT, rownames=['True'], colnames=['Predicted'], margins=True))
#%%
# visualize a confusion matrix
# requires '.' to be installed
metrics.plot_confusion_matrix(classifier_DT, x_test, y_test)
Some python code
# Which metric seems better `accuracy_score()` or `balanced_accuracy_score()`? Why?
print("Accuracy:", metrics.accuracy_score(y_test, y_predicted_DT))
print("Balanced Accuracy:", metrics.balanced_accuracy_score(y_test, y_predicted_DT))
# Confusion matrix
print(pd.crosstab(y_test.before1980, y_predicted_DT, rownames=['True'], colnames=['Predicted'], margins=True))
metrics.plot_confusion_matrix(classifier_DT, x_test, y_test)
# Other metrics
print(metrics.classification_report(y_test, y_predicted_DT))
Improving your fit
Using different features
Let’s add Neighborhood to our previous work.
The dwellings_neighborhoods_ml
data set contains the neighborhood variable “nbhd
” from the original data transformed to use one hot encoding. (Image source.)
If we want to use neighborhoods in our classifier, we need to join dwellings_neighborhoods_ml
with the other features we’ve been using.
# what we used last class
x = dwellings_ml.filter(["livearea","basement","stories","numbaths"])
y = dwellings_ml[["before1980"]]
# adding on the neighborhood data
x2 = x.join(dwellings_neighborhoods, how='left')
Picking a different model
Gradient Boosting Classifier
- A Gentle Introduction to the Gradient Boosting
- GradientBoostingClassifier documentation
- Gradient boosting wikipedia page
from sklearn.ensemble import GradientBoostingClassifier
boost = GradientBoostingClassifier(random_state=42)
boost.fit(x_train, y_train)
Category Boosting (catboost)
- Video: What is Category Boosting (catboost)?
- CatBoostClassifier documentation
import catboost as cb
model = cb.CatBoostClassifier(iterations=10,
depth=6,
learning_rate=1,
loss_function='Logloss',
verbose=False)
model.fit(x_train, y_train)