scikit-learn - ROC-Kurve mit Konfidenzintervall

Ich bin in der Lage, eine ROC-Kurve mit scikit-learn mit
fpr, tpr, thresholds = metrics.roc_curve(y_true,y_pred, pos_label=1), wo y_true ist eine Liste von Werten basierend auf meinen gold-standard (d.h., 0 negativen 1 für positive Fälle) und y_pred ist eine entsprechende Liste von scores (z.B., 0.053497243, 0.008521122, 0.022781548, 0.101885263, 0.012913795, 0.0, 0.042881547 [...])

Ich versuche, herauszufinden, wie man Konfidenzintervalle um die Kurve, aber nicht finden, eine einfache Möglichkeit zu tun, die mit sklearn.

Vielen Dank für die änderungen! Neue user mich versuchen, und verwenden Sie die richtige Formatierung in der Zukunft.

InformationsquelleAutor user2836189 | 2013-10-01

25

Können Sie das bootstrap der roc-Berechnungen (Probe mit Ersatz neue Versionen von y_true /y_pred aus dem ursprünglichen y_true /y_pred und berechnen einen neuen Wert für roc_curve jedes mal) und die Schätzung ein Konfidenzintervall auf diese Weise.

Zu nehmen, die Variabilität, die durch die Zug-test in split-Konto, können Sie auch die ShuffleSplit CV iterator viele Male, passen Sie ein Modell auf dem Zug split, generieren y_pred für jedes Modell und so versammeln Sie eine empirische Verteilung der roc_curves als gut und schließlich berechnen von Konfidenzintervallen für diese.

Bearbeiten: - bootstrapping in python

Hier ist ein Beispiel für bootstrapping die ROC-AUC-Wert der Vorhersagen von einem einzigen Modell. Ich entschied mich für die bootstap die ROC-AUC um es einfacher zu Folgen, wie einem Stack-Überlauf-Antwort, aber es kann angepasst werden, um die bootstrap die ganze Kurve statt:
```
import numpy as np
from scipy.stats import sem
from sklearn.metrics import roc_auc_score

y_pred = np.array([0.21, 0.32, 0.63, 0.35, 0.92, 0.79, 0.82, 0.99, 0.04])
y_true = np.array([0,    1,    0,    0,    1,    1,    0,    1,    0   ])

print("Original ROC area: {:0.3f}".format(roc_auc_score(y_true, y_pred)))

n_bootstraps = 1000
rng_seed = 42  # control reproducibility
bootstrapped_scores = []

rng = np.random.RandomState(rng_seed)
for i in range(n_bootstraps):
    # bootstrap by sampling with replacement on the prediction indices
    indices = rng.randint(0, len(y_pred), len(y_pred))
    if len(np.unique(y_true[indices])) < 2:
        # We need at least one positive and one negative sample for ROC AUC
        # to be defined: reject the sample
        continue

    score = roc_auc_score(y_true[indices], y_pred[indices])
    bootstrapped_scores.append(score)
    print("Bootstrap #{} ROC area: {:0.3f}".format(i + 1, score))
```
Können Sie sehen, dass wir verwerfen müssen einige ungültige passt in diesem Fall. Jedoch auf realen Daten mit viele Vorhersagen dieses ist ein sehr Seltenes Ereignis und sollte keine Auswirkung auf das Konfidenzintervall deutlich (Sie können versuchen, variieren die rng_seed zu prüfen).

Hier ist das Histogramm:
```
import matplotlib.pyplot as plt
plt.hist(bootstrapped_scores, bins=50)
plt.title('Histogram of the bootstrapped ROC AUC scores')
plt.show()
```
Beachten Sie, dass die Resampling erzielt werden zensiert, in die [0 - 1] Bereich verursacht eine hohe Anzahl von Bewertungen in der Letzte bin.

Erhalten ein Konfidenzintervall kann man die Sortierung der Proben:
```
sorted_scores = np.array(bootstrapped_scores)
sorted_scores.sort()

# Computing the lower and upper bound of the 90% confidence interval
# You can change the bounds percentiles to 0.025 and 0.975 to get
# a 95% confidence interval instead.
confidence_lower = sorted_scores[int(0.05 * len(sorted_scores))]
confidence_upper = sorted_scores[int(0.95 * len(sorted_scores))]
print("Confidence interval for the score: [{:0.3f} - {:0.3}]".format(
    confidence_lower, confidence_upper))
```
gibt:
```
Confidence interval for the score: [0.444 - 1.0]
```
Das Konfidenzintervall ist sehr breit, aber dies ist wahrscheinlich eine Folge meiner Wahl Vorhersagen (3 Fehler von 9 Vorhersagen) und die Gesamtzahl der Vorhersagen ziemlich klein.

Andere Bemerkung auf den Plan: die Noten quantisiert (viele leere Histogramm-bins). Dies ist eine Folge der kleinen Anzahl von Vorhersagen. Man könnte sich vorstellen, ein bisschen von Gauß-Rauschen auf die Noten (oder der y_pred Werte) zur Glättung der Verteilung und machen das Histogramm besser Aussehen. Aber dann ist die Wahl des smoothing-Bandbreite ist heikel.

Schließlich-wie bereits erwähnt-dieses Konfidenzintervall ist speziell auf die Ausbildung legen. Um eine bessere Schätzung der Variabilität der ROC an, die durch Ihre model-Klasse und die Parameter, die Sie tun sollten, bekräftigte cross-Validierung statt. Dies ist jedoch oft sehr viel teurer als Sie benötigen, zu trainieren, ein neues Modell für jedes zufällige Bahn /test-split.
- Danke für die Antwort. Ich glaube, ich war der Hoffnung zu finden, die äquivalent zu pROC und haben etwas fertig zu verwenden. Ich weiss zwar von bootstrapping, ich weiß nur nicht, wie ich gehen, um es zu tun praktisch in Python (obwohl es wohl Anleitungen), so war Hoffnung, es war ein built-in irgendwo.
- Bootstrapping ist trivial zu implementieren, mit numpy.random.random_integer zum Beispiel mit Ersatz. Ich bearbeitete die Antwort.
- Ich bearbeitet meine Antwort wie das original einen Fehler hatte.
- bearbeitet für die Verwendung von 'randint' statt 'random_integers' da letzteres veraltet (und druckt 1000 deprecation-Warnungen in jupyter)
- Danke @WaylonFlinn.
InformationsquelleAutor ogrisel

DeLong Lösung
[KEIN bootstrapping]

Wie von einigen hier vorgeschlagen, eine pROC Ansatz wäre nett. Nach pROC Dokumentation, Konfidenzintervalle berechnet werden, die über DeLong:

DeLong ist eine asymptotisch exakte Methode zur Bewertung der Unsicherheit
einer AUC (DeLong et al. (1988)). Seit der version 1.9, pROC verwendet
Algorithmus vorgeschlagen, durch die Sonne und Xu (2014) hat eine O(N log N)
Komplexität und ist immer schneller als bootstrapping. Standardmäßig pROC
wählen Sie das DeLong-Methode, Wann immer möglich.

Yandex Daten, die Schule hat eine Schnelle DeLong Umsetzung auf Ihre öffentlichen repo:

https://github.com/yandexdataschool/roc_comparison

Also alle credits, um Sie für die DeLong Umsetzung der in diesem Beispiel verwendet.
Also hier ist, wie Sie ein CI erhalten über DeLong:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Tue Nov  6 10:06:52 2018

@author: yandexdataschool

Original Code found in:
https://github.com/yandexdataschool/roc_comparison

updated: Raul Sanchez-Vazquez
"""

import numpy as np
import scipy.stats
from scipy import stats

# AUC comparison adapted from
# https://github.com/Netflix/vmaf/
def compute_midrank(x):
    """Computes midranks.
    Args:
       x - a 1D numpy array
    Returns:
       array of midranks
    """
    J = np.argsort(x)
    Z = x[J]
    N = len(x)
    T = np.zeros(N, dtype=np.float)
    i = 0
    while i < N:
        j = i
        while j < N and Z[j] == Z[i]:
            j += 1
        T[i:j] = 0.5*(i + j - 1)
        i = j
    T2 = np.empty(N, dtype=np.float)
    # Note(kazeevn) +1 is due to Python using 0-based indexing
    # instead of 1-based in the AUC formula in the paper
    T2[J] = T + 1
    return T2


def compute_midrank_weight(x, sample_weight):
    """Computes midranks.
    Args:
       x - a 1D numpy array
    Returns:
       array of midranks
    """
    J = np.argsort(x)
    Z = x[J]
    cumulative_weight = np.cumsum(sample_weight[J])
    N = len(x)
    T = np.zeros(N, dtype=np.float)
    i = 0
    while i < N:
        j = i
        while j < N and Z[j] == Z[i]:
            j += 1
        T[i:j] = cumulative_weight[i:j].mean()
        i = j
    T2 = np.empty(N, dtype=np.float)
    T2[J] = T
    return T2


def fastDeLong(predictions_sorted_transposed, label_1_count, sample_weight):
    if sample_weight is None:
        return fastDeLong_no_weights(predictions_sorted_transposed, label_1_count)
    else:
        return fastDeLong_weights(predictions_sorted_transposed, label_1_count, sample_weight)


def fastDeLong_weights(predictions_sorted_transposed, label_1_count, sample_weight):
    """
    The fast version of DeLong's method for computing the covariance of
    unadjusted AUC.
    Args:
       predictions_sorted_transposed: a 2D numpy.array[n_classifiers, n_examples]
          sorted such as the examples with label "1" are first
    Returns:
       (AUC value, DeLong covariance)
    Reference:
     @article{sun2014fast,
       title={Fast Implementation of DeLong's Algorithm for
              Comparing the Areas Under Correlated Receiver Oerating Characteristic Curves},
       author={Xu Sun and Weichao Xu},
       journal={IEEE Signal Processing Letters},
       volume={21},
       number={11},
       pages={1389--1393},
       year={2014},
       publisher={IEEE}
     }
    """
    # Short variables are named as they are in the paper
    m = label_1_count
    n = predictions_sorted_transposed.shape[1] - m
    positive_examples = predictions_sorted_transposed[:, :m]
    negative_examples = predictions_sorted_transposed[:, m:]
    k = predictions_sorted_transposed.shape[0]

    tx = np.empty([k, m], dtype=np.float)
    ty = np.empty([k, n], dtype=np.float)
    tz = np.empty([k, m + n], dtype=np.float)
    for r in range(k):
        tx[r, :] = compute_midrank_weight(positive_examples[r, :], sample_weight[:m])
        ty[r, :] = compute_midrank_weight(negative_examples[r, :], sample_weight[m:])
        tz[r, :] = compute_midrank_weight(predictions_sorted_transposed[r, :], sample_weight)
    total_positive_weights = sample_weight[:m].sum()
    total_negative_weights = sample_weight[m:].sum()
    pair_weights = np.dot(sample_weight[:m, np.newaxis], sample_weight[np.newaxis, m:])
    total_pair_weights = pair_weights.sum()
    aucs = (sample_weight[:m]*(tz[:, :m] - tx)).sum(axis=1) / total_pair_weights
    v01 = (tz[:, :m] - tx[:, :]) / total_negative_weights
    v10 = 1. - (tz[:, m:] - ty[:, :]) / total_positive_weights
    sx = np.cov(v01)
    sy = np.cov(v10)
    delongcov = sx / m + sy / n
    return aucs, delongcov


def fastDeLong_no_weights(predictions_sorted_transposed, label_1_count):
    """
    The fast version of DeLong's method for computing the covariance of
    unadjusted AUC.
    Args:
       predictions_sorted_transposed: a 2D numpy.array[n_classifiers, n_examples]
          sorted such as the examples with label "1" are first
    Returns:
       (AUC value, DeLong covariance)
    Reference:
     @article{sun2014fast,
       title={Fast Implementation of DeLong's Algorithm for
              Comparing the Areas Under Correlated Receiver Oerating
              Characteristic Curves},
       author={Xu Sun and Weichao Xu},
       journal={IEEE Signal Processing Letters},
       volume={21},
       number={11},
       pages={1389--1393},
       year={2014},
       publisher={IEEE}
     }
    """
    # Short variables are named as they are in the paper
    m = label_1_count
    n = predictions_sorted_transposed.shape[1] - m
    positive_examples = predictions_sorted_transposed[:, :m]
    negative_examples = predictions_sorted_transposed[:, m:]
    k = predictions_sorted_transposed.shape[0]

    tx = np.empty([k, m], dtype=np.float)
    ty = np.empty([k, n], dtype=np.float)
    tz = np.empty([k, m + n], dtype=np.float)
    for r in range(k):
        tx[r, :] = compute_midrank(positive_examples[r, :])
        ty[r, :] = compute_midrank(negative_examples[r, :])
        tz[r, :] = compute_midrank(predictions_sorted_transposed[r, :])
    aucs = tz[:, :m].sum(axis=1) / m / n - float(m + 1.0) / 2.0 / n
    v01 = (tz[:, :m] - tx[:, :]) / n
    v10 = 1.0 - (tz[:, m:] - ty[:, :]) / m
    sx = np.cov(v01)
    sy = np.cov(v10)
    delongcov = sx / m + sy / n
    return aucs, delongcov


def calc_pvalue(aucs, sigma):
    """Computes log(10) of p-values.
    Args:
       aucs: 1D array of AUCs
       sigma: AUC DeLong covariances
    Returns:
       log10(pvalue)
    """
    l = np.array([[1, -1]])
    z = np.abs(np.diff(aucs)) / np.sqrt(np.dot(np.dot(l, sigma), l.T))
    return np.log10(2) + scipy.stats.norm.logsf(z, loc=0, scale=1) / np.log(10)


def compute_ground_truth_statistics(ground_truth, sample_weight):
    assert np.array_equal(np.unique(ground_truth), [0, 1])
    order = (-ground_truth).argsort()
    label_1_count = int(ground_truth.sum())
    if sample_weight is None:
        ordered_sample_weight = None
    else:
        ordered_sample_weight = sample_weight[order]

    return order, label_1_count, ordered_sample_weight


def delong_roc_variance(ground_truth, predictions, sample_weight=None):
    """
    Computes ROC AUC variance for a single set of predictions
    Args:
       ground_truth: np.array of 0 and 1
       predictions: np.array of floats of the probability of being class 1
    """
    order, label_1_count, ordered_sample_weight = compute_ground_truth_statistics(
        ground_truth, sample_weight)
    predictions_sorted_transposed = predictions[np.newaxis, order]
    aucs, delongcov = fastDeLong(predictions_sorted_transposed, label_1_count, ordered_sample_weight)
    assert len(aucs) == 1, "There is a bug in the code, please forward this to the developers"
    return aucs[0], delongcov


alpha = .95
y_pred = np.array([0.21, 0.32, 0.63, 0.35, 0.92, 0.79, 0.82, 0.99, 0.04])
y_true = np.array([0,    1,    0,    0,    1,    1,    0,    1,    0   ])

auc, auc_cov = delong_roc_variance(
    y_true,
    y_pred)

auc_std = np.sqrt(auc_cov)
lower_upper_q = np.abs(np.array([0, 1]) - (1 - alpha) / 2)

ci = stats.norm.ppf(
    lower_upper_q,
    loc=auc,
    scale=auc_std)

ci[ci > 1] = 1

print('AUC:', auc)
print('AUC COV:', auc_cov)
print('95% AUC CI:', ci)

Ausgabe:

AUC: 0.8
AUC COV: 0.028749999999999998
95% AUC CI: [0.46767194, 1.]

Habe ich auch überprüft, dass diese Implementierung entspricht der pROC Ergebnisse aus R:

library(pROC)

y_true = c(0,    1,    0,    0,    1,    1,    0,    1,    0)
y_pred = c(0.21, 0.32, 0.63, 0.35, 0.92, 0.79, 0.82, 0.99, 0.04)

# Build a ROC object and compute the AUC
roc = roc(y_true, y_pred)
roc

Ausgabe:

Call:
roc.default(response = y_true, predictor = y_pred)

Data: y_pred in 5 controls (y_true 0) < 4 cases (y_true 1).
Area under the curve: 0.8

Dann

# Compute the Confidence Interval
ci(roc)

Ausgabe

95% CI: 0.4677-1 (DeLong)

Dies gab mir unterschiedliche Ergebnisse auf meine Daten als R's pROC-Paket. Hat sonst jemand überprüft diese?
würden Sie bitte, um eine reproduzierbare Beispiel, ich werde mehr als glücklich sein, um zu überprüfen, ob es irgendwelche Fehler.

InformationsquelleAutor Raul

Schreibe einen Kommentar

Du musst angemeldet sein, um einen Kommentar abzugeben.