Evaluation Benchmarks

Evaluation Benchmarks#

Three different evaluation benchmarks have been identified and implemented in UNIQUE: ranking-based, proper scoring rules, and calibration-based evaluations.

These benchmarks differ in the particular properties of uncertainty that they evaluate and in how they rank the “goodness” of the UQ method to represent the actual uncertainty of the model - e.g., how well the UQ values correlate with the prediction error.

Evaluation benchmarks in UNIQUE.#

Benchmark

Description

Reference(s)

RankingBasedEvaluation

Computes evaluation metrics that rank predictions based on their actual prediction error vs. the computed UQ values. Generally speaking, the higher the (positive) correlation between prediction error and computed UQ values, the better/more confident the model that produced the predictions can be considered. Currently implemented metrics are: AUC Difference, Spearman Rank Correlation Coefficient, Increasing/Decreasing Coefficient, and Performance Drop. For more information about the methods, check out Available Evaluation Metrics.

Inspired by Scalia et al. (2020)[1], Hirschfeld et al. (2020)[2]

ProperScoringRulesEvaluation

Computes proper scoring rules to evaluate the quality of predictions. Proper scoring rules are functions that assign a scalar summary measure to the performance of distributional predictions, where the maximum score obtainable is reached when the predicted distribution exactly matches the target one (also known as minimum contrast estimation). Currently implemented metrics are: Negative Log-Likelihood, Interval Score, Check Score, Continuous Ranked Probability Score, and Brier Score. For more information about the metrics, check out Available Evaluation Metrics.

Gneiting et al. (2007)[3]

CalibrationBasedEvaluation

Computes model’s calibration - i.e., whether the model’s predictions are consistent with the underlying target distribution. Currently implemented metrics are: Mean Absolute Calibration Error, and Root Mean Squared Calibration Error. For more information about the metrics, check out Available Evaluation Metrics.

Kuleshov et al. (2018)[4]

Available Evaluation Metrics#

Below you can find an in-depth guide to the evaluation scoring metrics implemented in UNIQUE for each benchmark.

List of available evaluation metrics in UNIQUE.#

Evaluation Benchmark

Metric Name

Description

Reference(s)

Ranking-based

AUC Difference

The AUC measures the ranking capabilities of a model. The difference between the AUC computed using predictions ranked by the original model’s performance metric (e.g., true prediction error) and by the computed UQ method measures the ranking goodness of the UQ method. Lower values are better.

Yousef et al. (2004)[5]

Ranking-based

Spearman Rank Correlation Coefficient (SRCC)

The SRCC indicates how well the computed UQ method is able to rank the predictions with respect to the original model’s performance metric (e.g., true prediction error). Higher values are better.

Marino et al. (2008)[6]

Ranking-based

Increasing/Decreasing Coefficient

A coefficient that measures how “correct” the UQ-based ranking is with respect to the performance metric-based one when binning the ranked predictions (either in increasing or decreasing order) – i.e., the predictions are ranked and binned according to the computed UQ values; the coefficient is then the number of consecutive bins with decreasing performance metric values divided by the number of bins. Higher values are better.

Ranking-based

Performance Drop

The drop in performance metric’s value between either the highest and lowest UQ-binned predictions or between the original model’s performance metric on all the predictions and the lowest UQ-binned predictions – i.e., the predictions are ranked and binned according to the computed UQ method; the performance metric is computed for the bins associated with the highest and lowest UQ, and for all the predictions being considered; the score corresponds to the difference in computed performance metrics for highest and lowest UQ-based bins, and for all data and lowest UQ-based bin. Higher values are better.

Proper Scoring Rules

Negative Log-Likelihood (NLL)

The NLL assesses how well the predicted probability distribution – i.e., predictions and corresponding computed UQ values, fits the observed data or the error distribution. Lower values are better.

Maddox et al. (2019)[7], Lakshminarayanan et al. (2016)[8], Detlefsen et al. (2019)[9], Pearce et al. (2018)[10]

Proper Scoring Rules

Interval Score

The interval score evaluates the sharpness and calibration of a specific prediction interval, rewarding narrow and accurate prediction intervals whilst penalizing wider prediction intervals that do not cover the observation. Lower values are better.

Gneiting et al. (2007)[3]

Proper Scoring Rules

Check Score (or Pinball Loss)

The check score measures the distance between the computed UQ values (and associated predictions), intended as prediction quantiles, and the true target values. Lower values are better.

Koenker et al. (1978)[11], Chung et al. (2020)[12]

Proper Scoring Rules

Continuous Ranked Probability Score (CRPS)

The CRPS quantifies the difference between the predicted probability distribution – i.e., predictions and computed UQ values, and the observed distribution. Lower values are better.

Matheson et al. (1976)[13]

Proper Scoring Rules

Brier Score

The Brier Score estimates the accuracy of probabilistic predictions, computed as the mean squared difference between predicted probabilities and the actual outcomes. Lower values are better.

Brier et al. (1950)[14]

Calibration-based

Mean Absolute Calibration Error (MACE)

The MACE assesses the calibration of the predicted probabilities or intervals, by comparing the bin-wise absolute calibration errors between predicted and observed distributions. Lower values are better.

Calibration-based

Root Mean Squared Calibration Error (RMSCE)

The RMSCE assesses the calibration of the predicted probabilities or intervals, by comparing the bin-wise root mean squared calibration errors between predicted and observed distributions. Lower values are better.

References#