Prepare Your Pipeline

Prepare Your Pipeline#

Once the data has been prepared, the easiest way to run UNIQUE is through the Pipeline object. Pipeline allows you to run the uncertainty quantification benchmark in an end-to-end fashion.

In order to tell UNIQUE which inputs to use and which UQ methods to evaluate, you need to prepare a configuration file. This is a yaml file which contains all the specifications needed to retrieve and run the UNIQUE pipeline.

Available Configuration Arguments#

The Pipeline configuration options can be either provided through a yaml file - and then loaded using from_config() - or directly to the Pipeline object at initialization.

Full list of available `Pipeline` configuration arguments.#
Argument Name	Type	Description	Additional Information
`data_path`	`str`	Full path to the prepared input dataset. Supported formats: `csv`, `json`, `pkl`.	Check out Prepare Your Data for more details.
`output_path`	`str`	Full path to the output directory where to save `UNIQUE`’s outputs (figures, tables, models, etc.).	Check out Usage and the Examples for more details.
`id_column_name`	`str`	Name of the column containing the unique data IDs. Use `"index"` if there are no such identifiers or column in your dataset.	Check out Prepare Your Data for more details.
`labels_column_name`	`str`	Name of the column containing the target labels/values used to train your predictive model.
`predictions_column_name`	`str`	Name of the column containing the point-wise, final predictions from your predictive model.
`which_set_column_name`	`str`	Name of the column containing the specification of which subset the datapoint belongs to. Subsets must either be: `"TRAIN"`, `"TEST"`, `"CALIBRATION"`.	Check out Prepare Your Data for more details.
`model_name`	`str`	Name of your predictive model. Only used for logging purposes.
`problem_type`	`str`	Predictive task of the model. Must be either: `"classification"`, `"regression"`.
`mode`	`str`	Modality by which to sum variances and distance-based UQ methods to compute `SumOfVariances`. Allowed modalities: `"compact"`, `"full"`, `"extended"`.	Check out `AnalyticsMode` for more details.
`inputs_list`	`list[UniqueInputType]` or `list[dict[str, Any]]`	List of pre-initialized `UniqueInputType` (if configuring `Pipeline` directly) or a list of the `UniqueInputType` names and their corresponding arguments as dictionaries (if configuring `Pipeline` via a `yaml` file).	Check out Input Types Specification for more details on `UniqueInputType` and Configuration Template for more details on how to populate the inputs list in the `yaml` file.
`error_models_list`	`list[UniqueErrorModel]` or `list[dict[str, Any]]`	List of pre-initialized `UniqueErrorModel` (if configuring `Pipeline` directly) or a list of all the `UniqueErrorModel` names and their corresponding arguments as dictionaries (if configuring `Pipeline` via a `yaml` file).	Check out Error Models and Available Inputs & UQ Methods for more details on available error models.
`individual_plots`	`bool`	Whether to plot each computed UQ method’s evaluation plots. Note: the plots of the overall best UQ methods are always saved (displaying to screen depends on `display_outputs`).
`summary_plots`	`bool`	Whether to plot the summary plots with all UQ methods. Note: the summary plots are always saved (displaying to screen depends on `display_outputs`).
`save_plots`	`bool`	Whether to save the individual plots (if enabled via `individual_plots`).
`evaluate_test_only`	`bool`	Whether to evaluate the UQ methods against the `"TEST"` set only. If “False”, evaluation will be carried out for the `"TRAIN"` and `"CALIBRATION"` subsets as well.	Check out Input Data Preparation for more details.
`display_outputs`	`bool`	Whether to display the enabled plots and output tables to screen. Only works if running `UNIQUE` in a JupyterNotebook cell.
`n_bootstrap`	`int`	Number of bootstrapping samples to use. Default is 500 (even if `n_bootstrap` is not specified explicitly). Note: bootstrapping for selecting the overall best UQ methods is always run (unless the `Pipeline._bootstrap` is set to `False`).
`verbose`	`bool`	Whether to enable “DEBUG”-level logging verbosity. “INFO”-level messages are always printed to `stdout`.

You can find below a comprehensive template of a typical yaml configuration file for your Pipeline.

Configuration Template#

UNIQUE configuration template#

#######
# I/O #
#######
# Path to the prepared input dataset
data_path: "/path/to/your/input/dataset.[csv,json,pkl]"
# Path to the output folder where to save UNIQUE's outputs
output_path: "/path/to/output/folder"

########
# Data #
########
# Name of the column containing the unique data IDs
id_column_name: "ID"
# Name of the column containing the labels
labels_column_name: "Labels"
# Name of the column containing the original model's predictions
predictions_column_name: "Predictions"
# Name of the column containing the subset specification ("TRAIN", "TEST", "CALIBRATION")
which_set_column_name: "Subset"
# Name of the original model
model_name: "MyModel"
# Specify which task your model solves: either "regression" or "classification"
problem_type: "regression"
# Modality by which to sum variances and distance-based UQ methods. Check {py:obj}`~unique.utils.uncertainty_utils.AnalyticsMode` for more details.
mode: "compact"

#############
# UQ Inputs #
#############
# List of UNIQUE InputTypes specifying the column name of the inputs and the UQ methods to compute for each of them (if none are specified, all supported UQ methods for each InputType will be computed)
# Note: it needs to be a list, even if only one input type is specified (note the hyphens)
inputs_list:
  # FeaturesInputType are features that can have `int` or `float` values and can be represented as a single value or grouped as a list/array of features for each datapoint
  - FeaturesInputType:
    # Name of the column containing the features (for example here we assume a single value for each datapoint)
      column_name: "Feature"
    # Only the "manhattan_distance" and "euclidean_distance" UQ methods will be computed for this input (note that they are specified as a list using the hyphen)
      metrics:
      - "manhattan_distance"
      - "euclidean_distance"
  - FeaturesInputType:
    # Name of the column containing the features (for example here we assume an array of features for each datapoint)
      column_name: "FeaturesArray"
    # Only "euclidean_distance" UQ method will be computed for this input (note that you can also specify the methods as a single value - no hyphens here)
      metrics: "euclidean_distance"
  # ModelInputType is the variance of the ensemble's predictions
  - ModelInputType:
    # Name of the column containing the variance
      column_name: "Variance"
    # No methods are specified here, which means that all supported UQ methods for this input type will be computed

###################
# UQ Error Models #
###################
# List of UNIQUE ErrorModels specifying available model's hyperparameters as keyword-arguments
# You can specify as many error models as you want, even the same type but with different hyperparameters (GridSearch is not yet implemented in UNIQUE)
# Note: it needs to a list, even if only one error model is specified (note the hyphens)
error_models_list:
  # UniqueRandomForestRegressor is a RF regressor trained to predict the error between the original model's predictions and data labels
  - UniqueRandomForestRegressor:
    # All available arguments to the model can be specified here. See each model's documentation for the full list of arguments. If no hyperparameters are specified, UNIQUE will use the default ones
      max_depth: 10
      n_estimators: 500
    # List of error types to use as target values (note the hyphen). For each error type, a separate model will be built to predict it
    # Supported errors are:
    # "l1" (=absolute error), "l2" (squared error), "unsigned"
      error_type_list:
        - "l1"

#######################
# Evaluation Settings #
#######################
# Whether to plot each UQ method's evaluation plots. Note: the plots of the best UQ methods are always saved (displaying depends on `display_outputs`)
individual_plots: false
# Whether to plot the summary plots with all UQ methods. Note: the summary plots are always saved (displaying depends on `display_outputs`)
summary_plots: true
# Whether to save the enabled plots in the output folder
save_plots: false
# Whether to evaluate the UQ methods against the TEST set only. If "False", evaluation will be carried out for "TRAIN" and "CALIBRATION" sets as well
evaluate_test_only: true
# Whether to display the plots to screen. Only works if running in a JupyterNotebook cell
display_outputs: true
# Number of bootstrapping samples to run. Note: bootstrapping to determine the best UQ metric is ALWAYS run unless the private attribute `Pipeline._bootstrap` is set to False.
n_bootstrap: 500
# Logging messages levels. If True, logger will output DEBUG level messages.
verbose: false

Tip

Copy and save the above template as a yaml file to use in your project.

Note

Currently supported UQ methods (metrics argument) for FeaturesInputType inputs:

manhattan_distance
euclidean_distance
tanimoto_distance (for integer-only inputs)
gaussian_euclidean_kde
gaussian_manhattan_kde
exponential_manhattan_kde

Currently supported UQ methods (metrics argument) for ModelInputType inputs:

ensemble_variance (for regression tasks only)
probability (for classification tasks only)

Currently supported error types (error_type_list argument) for UniqueErrorModel error models:

"L1" error (defined as \(|y - \hat{y}|\));
"L2" error (defined as \((y - \hat{y})^2\));
"unsigned" error (defined as \(y - \hat{y}\));

where \(y = label\) and \(\hat{y} = prediction\).

Deprecated since version 0.1.0: For ModelInputType inputs, it is not necessary to specify the metrics argument to compute anymore.

UNIQUE automatically detects whether to treat the model-based inputs as regression-based predictions/ensemble variance or classification-based probabilities depending on the specified problem_type.

Prepare Your Pipeline

On this page

Prepare Your Pipeline#

Available Configuration Arguments#

Configuration Template#