ASHIVNI SHEKHAWAT
Research scientist at Lyft Inc.

ROC curves are amongst the most widely used characterizations for binary classifiers, and thus having a deep understanding of ROC curves is essential for the modern data scientist. This post presents a summary of the important aspects of the ROC curves.

Note: This post is essentially a partial summary of An introduction to ROC analysis, Fawcett, T., Pattern Recognition Letters, Vol. 27 Issue 8, 2006, pp. 861-874, ISSN 0167-8655. A PDF version of this paper is currently available here. Please read the paper if you have the time. If not, this post tries to present a summary.

1. What is a ROC Curve?

ROC stands for Receiver Operating Characteristics. ROC is typically used to understand the performance of binary classifiers, though there exist generalizations to multi-class classifiers. It is a long standing technique in signal processing used to understand the tradeoffs between accuracy and false alarms of binary classifiers. Formally, the ROC is a two-dimensional graph with TPR (true positive rate) on the y-axis and FPR (false positive rate) on the x-axis. Thus, in a sense, it represents the trade-off between the value (true positive rate) of the classifier, and its cost (false positive rate). Figure 1 shows the ROC curves for four different models. Since FPR and TPR are bound between 0 and 1, so is the ROC curve.

Figure 1. ROC curves for four different models

1. A. What is AUC-ROC

AUC-ROC is the net area under the ROC curve. The highest possible AUC is 1.0, and corresponds to a model that can perfectly separate the positive and negative classes, while the lowest possible AUC is 0.5, and corresponds to a model that randomly assigns class labels. The AUC is often used as a scalar metric to compare models, where the models with higher AUC are generally considered to be better.

2. How is the ROC curve traced

How does one draw the ROC curve for a classification model? There can be two kinds of classification models: a) those that output a class label {P, N}, and b) those that output a class membership score. It is not possible to compute the entire ROC curve for the models that output class label. Instead, only the model just generates one point on the ROC graph, which just corresponds to its (FPR, TPR) tuple.
A full ROC curve can be traced for models that output a class membership score. To do so, one chooses a threshold, t, and the class membership is take as class(X) = N if score(X) < t, and class(X) = P otherwise. Thus, for each value of t, one can compute the (FPR, TPR) tuple, and thus put a point on the ROC graph. The full curve can be traced by varying t from -infinity to +infinity. There are efficient (i.e. with complexity O(n log(n)), where n is the number of samples) algorithms available for tracing such curves; see the paper for details.

2. A. ROC for random guessing model

Figure 1. stipulates that the ROC for a “random guessing” model is the diagonal line of slope 1. Why is it so? Consider the TPR of a random guessing model that guesses the positive class with 80% probability. Clearly, it should be 0.8 (since 80% of all positives will be assigned to the positive class by the model). Similarly, the FPR should be 0.8 as well. Thus, the random guessing model will trace the diagonal curve as the probability of guessing the positive class takes on various values between 0 and 1.

2. B. Special points on the ROC graph

There are three special points on the ROC graphs, corresponding to the lower left corner, the upper left corner, and the upper right corner. The lower left corner corresponds to FPR = TPR = 0. Thus, it corresponds to a model that never outputs the positive class (or, equivalently, the threshold ‘t’ is set at is lower limit). Such a model has zero type I error rate, however, it also has zero recall. On the other extreme are the models that occupy the upper right corner. Such models always output the positive class. Thus, they have perfect recall, but they also have 100% FPR (or type I error rate). In contrast to these two cases, are the models that occupy the upper left corner. These models are perfect classifiers, and have a zero rate of type I and type II errors; they have a TPR of 1 and FPR of 0. As a rule of thumb, any model whose ROC curve comes close to the upper left corner is probably a good classification model.

3. Important properties of ROC curves

  1. A perfect classifier has a AUC = 1.0, and a ROC curve that goes from the lower left corner, to upper left, to the upper right.
  2. A worthless classifier has a AUC = 0.5, and a ROC curve that is a diagonal line that goes from lower left corner to upper right.
  3. The ROC curve is always monotone increasing.
  4. The ROC curve is not sensitive to class imbalance or more generally to the population prevalence of the positive class.
  5. If the ROC for classifier A is always higher than the ROC for classifier B (i.e. the ROC of classifier A dominates that of classifier B), then the A is an unambiguously better classifier than B.
  6. The AUC is also equal to the probability that the classifier will rank a randomly chosen positive sample higher than a randomly chosen negative sample.
  7. If AUC for classifier A is greater than that of classifier B, then classifier A has better average performance, however, classifier B might still outperform classifier A in certain parts of the (FPR, TPR) space.

While some of the properties listed above are self evident, some others deserve some explanation. Let us go through these one at a time.

3. A. Why is the ROC curve monotone increasing?

The reason is that as the threshold is decreased, the true positives can only increase, while the total number of positives remain fixed, thus making the curve monotone.

3. B. Why is the ROC curve insensitive to class imbalance?

The reason is that the class imbalance only impacts the relative proportion of positive to negative samples, however, the ROC curve is a plot of TPR vs FPR, and both of those quantities are not impacted by the relative proportion of the classes. For instance, the TPR is only impacted by the proportion of the condition positives that are actually classified as positive by the classifier.

4. How is the ROC used

The ROC curve and the AUC is used to gauge the value of a classifier, and to compare two (or more) classifiers. As we mentioned earlier, if the ROC for classifier A is always higher than the ROC for classifier B (i.e. the ROC of classifier A dominates that of classifier B), then the A is an unambiguously better classifier than B. However, this is rarely the case, as we typically have to deal with more ambiguity. This ambiguity can sometimes be resolved if we are only interested in the average performance of a classifier. In such cases, we can use the AUC as scalar measure of performance — the higher the AUC-ROC, the better for the average performance of a classifier.

However, if we are interested in the performance of the classifier in certain specific regions of the (FPR, TPR) space, then the average performance might not be so relevant. For instance, suppose we are interested in an application where the false positives are extremely expensive, and thus we want to keep the FPR below a certain threshold while achieving the highest possible TPR. In such an application, we would prefer Model 3 over Model 2 in figure 1, even though Model 2 might be on average better than Model 3 (i.e. it might have a higher AUC). Such applications are not uncommon. For instance, if the positive class represents fraudulent user behavior, based on which a user account might be suspended, then we do indeed want to keep the FPR below a threshold to avoid inconvenience to legitimate customers.

5. Measures of variance

In machine learning, as in statistics, no point measure is meaningful without an associated estimate of variance. Though this principle is often ignored in modern ML, we should pay heed to it whenever possible. There are at least two methodologies that one might follow to get an estimate of the variance in the ROC curves, each with a slightly different goal.

5. A. Variability due to fluctuations in data

Suppose we N sample points {X1, X2, … , XN} that were classified by the model. Then, we can generate an ensemble of ROC curves by bootstrapping the sample points and generating one ROC curve for each bootstrap sample. This process can also be used to generate an ensemble of AUCs, there by getting a point estimate (the median) as well as the spread (2.5th and 97.5th percentiles). These results can tell us what range of results can we expect for different draws of the data sample (assuming that the underlying generating process remains fixed).

5. B. Variability due to fluctuations in model

A ML model is also not a deterministic object. Any particular model is just one member of an ensemble of models that could be obtained had the training data been slightly different, or if a slightly a different set of variables were selected in a particular tree in a random forest etc. Thus, one might wish to understand the properties of a whole family of models as opposed to just one model. This is particularly valuable if the training dataset is small. Such a bootstrap can be run if one has access to the training data, as well as the code for generating the trained model. The bootstrap process itself is fairly simple; one trains an ensemble of models bootstrapping the training data as well as other sources of randomness in model training (random seeds etc.), and computes the corresponding ensemble of ROC curves as well as AUC. The drawbacks of this process include the added computational as well as human effort.