Prepare Your Dataset#
Input Data Preparation#
In order to use UNIQUE
the user only needs to input a dataframe containing at the minimum the following columns:
Column |
Description |
---|---|
IDs |
Column containing the unique IDs of the datapoints (can be the dataframe’s |
Labels |
Column containing the target labels/values associated with each datapoint. |
Predictions |
Column containing the trained ML model’s predictions (intended as the final model’s, or ensemble of models’, single-value output, to be compared with the corresponding label). |
Subset |
Column containing the specification of which subset each datapoint belongs to. The allowed subsets are: |
Caution
Make sure to use exactly TRAIN
, TEST
, and CALIBRATION
(all upper-case), as these values are hard-coded.
Input Types Specification#
Depending on the UQ methods one wants to evaluate/use, one can add the following inputs:
Data Features |
Model Outputs |
---|---|
Column(s) containing the feature(s) of each datapoint - e.g., the ones used for training the original ML model. |
Column(s) containing output(s) related to the original predictive model. |
In UNIQUE
, data-based and model-based features are represented by the FeaturesInputType
and ModelInputType
classes, respectively.
See also
Check out Available Configuration Arguments for more details about how to specify your inputs to UNIQUE
.
Schematic Example#
For example, an input dataset to UNIQUE
could look like this:
ID |
Labels |
Predictions |
Subset |
Data Feature #1 |
Data Feature #2 |
Ensemble Predictions |
Ensemble Variance |
|
---|---|---|---|---|---|---|---|---|
1 |
0.12 |
0.17 |
TRAIN |
45 |
[65,12,0.12,True,…] |
[0.10,0.12,0.07,0.25,…] |
0.02 |
|
2 |
0.43 |
0.87 |
TEST |
36 |
[90,124,15.63,True,…] |
[0.43,1.52,0.23,0.45,…] |
0.13 |
|
3 |
4.78 |
5.62 |
CALIBRATION |
8 |
[0.9,83,-0.4,False,…] |
[1.87,7.92,4.32,5.08,…] |
0.81 |
|
… |
… |
… |
… |
… |
[…] |
[…] |
… |
|
|
|
|
|
|
|
|
|
|
Tip
When storing long arrays/lists in a single pd.DataFrame
column, saving and reloading the dataframe as a csv
file will cause issues, due to the fact that each array will be saved as a string when saving in the csv
format and will be truncated with ellipses if exceeding a certain limit (typically > 1000 elements per array), thus making it impossible to correctly parse the entire original array when loading the csv
file again.
To overcome this, consider dumping the input dataframe as a json
or pickle
file - e.g., with pd.DataFrame.to_json
or pd.DataFrame.to_pickle
, which will not cause any of the aforementioned issues.
UNIQUE
supports input dataframe in csv
, json
and pickle
formats.
Caution
Only unpickle files you trust.
Check out the pickle
module docs for more information, and consider safer serialization formats such as json
.
See also
Check Examples for some practical, hands-on tutorials on data preparation for UNIQUE
.