Evaluating Model Fit

Week 4 Lesson 2.3


After this lesson, you will be able to:

  • Understand the fundamentals of evaluating classifiers
  • Understand precision, recall, accuracy, f1-score, and ROC curves
  • Know how to use sklearn.metrics functions to easily compute these metrics


Before this lesson, you should already be able to:

  • Find and label features and target in a dataset
  • Build and interpret a logistic regression model in statsmodels
  • Explain logit coefficients and odds


Before this lesson, instructors will need to:

  • Gather materials needed for class
  • Complete Prep work required
  • Prepare any specific instructions



Code Along Code Along 2


5 min Opening How to use scikit-learn to run logistic
5 min Introduction Key metrics
30 min Guided Practice Exploring classifier evaluation metrics
15 min Demo AUC and ROC curves
25 min Independent Practice Practice AUC and ROC curves
5 min Conclusion Review & Recap

Opening: Run logistic regression with sklearn (5 mins)

Like statsmodels, sklearn also has a function to run logistic regression. The implementation is different than statsmodels but is easy use.

from sklearn.linear_model import LogisticRegression
import numpy as np

# Set up fake X and Y
X = np.random.randn(100, 3)
Y = np.random.binomial(1, 0.5, 100)

logreg = LogisticRegression()
logreg.fit(X, Y)

# coefficients:

X_new = np.random.randn(100, 3)

# predict class:
y_pred = logreg.predict(X_new)

# predicted probability:
y_pp = logreg.predict_proba(X_new)

The sklearn implementation of logistic regression is focused more on flexibility than the statsmodels implementation, but does not return as many summary statistics. We will be using the sklearn implementation for this lesson.

Introduction: Key metrics (5 mins)

Classification problems and models are evaluated differently than regression models. Whereas regression models predict a continuous variable, classification problems predict probability of belonging to a class of outcome.

Instead of evaluating models based on error like in regression, we evaluate the models based on the correct and incorrect labeling of classes.

Most classification metrics are based on four outcome categories during prediction:

  • True Positives: A positive class observation (1) is correctly classified as positive by the model.
  • False Positive: A negative class observation (0) is incorrectly classified as positive.
  • True Negative: A negative class observation is correctly classified as negative.
  • False Negative: A positive class observation is incorrectly classified as negative.

Fair warning – this is going to be a difficult lecture. There are a lot of concepts covered!

Exploring classifier evaluation metrics (25 min)

For this section, let's say we have predicted Y for a test set from a logistic regression trained on a train set, as outlined in the code below.

You can follow along in the following Jupyter notebook file.

Keep in mind your numbers will be different than the results in this presentation since the data is randomly generated.

import numpy as np
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33)

logreg = LogisticRegression()
logreg.fit(X_train, Y_train)
Y_pred = logreg.predict(X_test)

Classification evaluation metric fundamentals

Confusion matrix

The confusion matrix is very basic, and while may not seem that useful contains all of the information required for calculating more complex evaluation metrics.

A confusion matrix has as rows and columns the classes modeled by your classifier. In the case of logistic regression this will be a 2x2 matrix. Rows indicate the actual class, and columns indicate the predicted class.

Predicted Cancer Predicted Healthy
Has Cancer 168 31
Is Healthy 46 85

In the medical example above, 168 of the people who actually have cancer were correctly identified as having cancer and 85 healthy people were correctly identified as being healthy. However, 46 people who were actually healthy were incorrectly identified as having cancer, and 31 people who had cancer were incorrectly identified as being healthy.

From the 2-variable confusion matrix we can calculate true positives, false positives, true negatives, and false negatives directly from the cells. Tuning your model to adjust these metrics depends on your priorities. In healthcare for example, you may want to minimize false negatives at the expense of more false positives.

  from sklearn.metrics import confusion_matrix

  confusion = np.array(confusion_matrix(Y_test, Y_pred))

  [[168  31]
  [ 46  85]]

  # calculate true positives, the number of 1s correctly predicted to be 1
  TP = confusion[0,0]

  # calculate false positives, the number of 0s incorrectly predicted to be 1
  FP  = confusion[1,0]

  # calculate true negatives, the number of 0s correctly predicted to be 0
  TN = confusion[1,1]

  # calculate false negatives, the number of 1s incorrectly predicted to be 0
  FN = confusion[0,1]

Accuracy is simply the proportion of classes correctly predicted by the model.

  Accuracy = (True Positives + True Negatives) / Total
  from sklearn.metrics import accuracy_score

  acc = accuracy_score(Y_test, Y_pred)

  # This is equivalent to:
  acc = np.sum(Y_test == Y_pred)/len(Y_test)
sklearn classification report

The classification report function returns a detailed printout of metrics about your model.

  from sklearn.metrics import classification_report

  # Example printout:
  print(classification_report(Y_test, Y_pred))

                   precision recall    f1-score   support

  0                 0.79      0.84      0.81       199
  1                 0.73      0.65      0.69       131

  avg / total       0.76      0.77      0.76       330

The 0 and 1 on the left column indicate the two classes predicted by the model. For models with multiple classes there would be more rows and labels.

Each of the columns indicate an important metric for evaluating classification model performance.

Precision is the ability of the classifier to avoid mislabeling when the observation belongs in another class.

  Precision = True Positives / (True Positives + False Positives)

A precision score of 1 indicates that the classifier never mistakenly added observations from another class. A precision score of 0 would mean that the classifier misclassified every instance of the current class.

recall is the ability of the classifier to correctly identify all observations in the current class.

  Recall = True Positives / (True Positives + False Negatives)

A recall score of 1 indicates that the classifier correctly predicted (found) all observations of the current class (by implication, no false negatives, or misclassifications of the current class). A recall score of 0 alternatively means that the classifier missed all observations of the current class.

f1-score is the harmonic mean of the precision and recall. The harmonic mean is used here rather than the more conventional arithmetic mean because the harmonic mean is more appropriate for averaging rates.

  F1-Score = 2 * (Precision * Recall) / (Precision + Recall)

The f1-score's best value is 1 and worst value is 0, like the precision and recall scores. It is a useful metric for taking into account both measures at once.

support is simply the number of observations of the labelled class.

Demo: AUC and ROC curves (15 mins)

Open the Jupyter notebook for the Demo to learn about area under the curve (AUC) and the receiver operating characteristic curve (ROC).

Check for understanding:*

  • What are some reasons that you might change the class assignment threshold in a classifier?
  • How does changing the threshold for assigning a class label affect the confusion matrix?
  • How is the ROC related to the confusion matrix?
  • Why is the ROC curve unaffected by the distribution of classes in the data?

Independent Practice: metrics and ROC with housing data (25 minutes)

Practice classification evaluation metrics and plotting ROC curves with the Sacramento housing data. Load the following AUC housing starter code to begin.

Conclusion (5 mins)

  • Review independent practice deliverable(s)
  • Explain key model metrics and how they interact


results matching ""

    No results matching ""