Prepare Your Dataset#
Input Data Preparation#
In order to use UNIQUE the user only needs to input a dataframe containing at the minimum the following columns:
Column |
Description |
|---|---|
IDs |
Column containing the unique IDs of the datapoints (can be the dataframe’s |
Labels |
Column containing the target labels/values associated with each datapoint. |
Predictions |
Column containing the trained ML model’s predictions (intended as the final model’s, or ensemble of models’, single-value output, to be compared with the corresponding label). |
Subset |
Column containing the specification of which subset each datapoint belongs to. The allowed subsets are: |
Caution
Make sure to use exactly TRAIN, TEST, and CALIBRATION (all upper-case), as these values are hard-coded.
Input Types Specification#
Depending on the UQ methods one wants to evaluate/use, one can add the following inputs:
Data Features |
Model Outputs |
|---|---|
Column(s) containing the feature(s) of each datapoint - e.g., the ones used for training the original ML model. |
Column(s) containing output(s) related to the original predictive model. |
In UNIQUE, data-based and model-based features are represented by the FeaturesInputType and ModelInputType classes, respectively.
See also
Check out Available Configuration Arguments for more details about how to specify your inputs to UNIQUE.
Schematic Example#
For example, an input dataset to UNIQUE could look like this:
ID |
Labels |
Predictions |
Subset |
Data Feature #1 |
Data Feature #2 |
Ensemble Predictions |
Ensemble Variance |
|
|---|---|---|---|---|---|---|---|---|
1 |
0.12 |
0.17 |
TRAIN |
45 |
[65,12,0.12,True,…] |
[0.10,0.12,0.07,0.25,…] |
0.02 |
|
2 |
0.43 |
0.87 |
TEST |
36 |
[90,124,15.63,True,…] |
[0.43,1.52,0.23,0.45,…] |
0.13 |
|
3 |
4.78 |
5.62 |
CALIBRATION |
8 |
[0.9,83,-0.4,False,…] |
[1.87,7.92,4.32,5.08,…] |
0.81 |
|
… |
… |
… |
… |
… |
[…] |
[…] |
… |
|
|
|
|
|
|
|
|
|
|
Tip
When storing long arrays/lists in a single pd.DataFrame column, saving and reloading the dataframe as a csv file will cause issues, due to the fact that each array will be saved as a string when saving in the csv format and will be truncated with ellipses if exceeding a certain limit (typically > 1000 elements per array), thus making it impossible to correctly parse the entire original array when loading the csv file again.
To overcome this, consider dumping the input dataframe as a json or pickle file - e.g., with pd.DataFrame.to_json or pd.DataFrame.to_pickle, which will not cause any of the aforementioned issues.
UNIQUE supports input dataframe in csv, json and pickle formats.
Caution
Only unpickle files you trust.
Check out the pickle module docs for more information, and consider safer serialization formats such as json.
See also
Check Examples for some practical, hands-on tutorials on data preparation for UNIQUE.