torchsurv.metrics.brier_score#

Classes

BrierScore([checks])

Compute the Brier Score for survival models.

class torchsurv.metrics.brier_score.BrierScore(checks: bool = True)[source]#

Compute the Brier Score for survival models.

__init__(checks: bool = True)[source]#

Initialize a BrierScore for survival class model evaluation.

Parameters:

checks (bool) – Whether to perform input format checks. Enabling checks can help catch potential issues in the input data. Defaults to True.

Examples

>>> _ = torch.manual_seed(52)
>>> n = 10
>>> time = torch.randint(low=5, high=250, size=(n,)).float()
>>> event = torch.randint(low=0, high=2, size=(n,)).bool()
>>> estimate = torch.rand((n,len(time)))
>>> brier_score = BrierScore()
>>> brier_score(estimate, event, time)
tensor([0.2463, 0.2740, 0.3899, 0.1964, 0.3608, 0.2821, 0.1932, 0.2978, 0.1950,
        0.1668])
>>> brier_score.integral() # integrated brier score
tensor(0.2862)
>>> brier_score.confidence_interval() # default: parametric, two-sided
tensor([[0.1061, 0.0604, 0.2360, 0.0533, 0.1252, 0.0795, 0.0000, 0.1512, 0.0381,
         0.0051],
        [0.3866, 0.4876, 0.5437, 0.3394, 0.5965, 0.4847, 0.4137, 0.4443, 0.3520,
         0.3285]])
>>> brier_score.p_value() # default: bootstrap permutation test, two-sided
tensor([1.0000, 0.7860, 1.0000, 0.3840, 1.0000, 1.0000, 0.3840, 1.0000, 0.7000,
        0.2380])
__call__(estimate: Tensor, event: Tensor, time: Tensor, new_time: Tensor | None = None, weight: Tensor | None = None, weight_new_time: Tensor | None = None, instate: bool = True) Tensor[source]#

Compute the Brier Score.

Parameters:
  • estimate (torch.Tensor) – Estimated probability of remaining event-free (i.e., survival function). Can be of shape = (n_samples, n_samples) if subject-specific survival is evaluated at time, or of shape = (n_samples, n_times) if subject-specific survival is evaluated at new_time.

  • event (torch.Tensor, boolean) – Event indicator of size n_samples (= True if event occured)

  • time (torch.Tensor, float) – Time-to-event or censoring of size n_samples.

  • new_time (torch.Tensor, float, optional) – Time points at which to evaluate the Brier score of size n_times. Defaults to unique time.

  • weight (torch.Tensor, optional) – Optional sample weight evaluated at time of size n_samples. Defaults to 1.

  • weight_new_time (torch.tensor, optional) – Optional sample weight evaluated at new_time of size n_times. Defaults to 1.

Returns:

Brier score evaluated at new_time.

Return type:

torch.Tensor

Note

The function evaluates the time-dependent Brier score at time \(t \in \{t_1, \cdots, t_T\}\) (argument new_time).

For each subject \(i \in \{1, \cdots, N\}\), denote \(X_i\) as the survival time and \(D_i\) as the censoring time. Survival data consist of the event indicator, \(\delta_i=(X_i\leq D_i)\) (argument event) and the time-to-event or censoring, \(T_i = \min(\{ X_i,D_i \})\) (argument time).

The survival function, of subject \(i\) is specified through \(S_i: [0, \infty) \rightarrow [0,1]\). The argument estimate is the estimated survival function. If new_time is specified, it should be of shape = (N,T) (\((i,k)\) th element is \(\hat{S}_i(t_k)\)); if new_time is not specified, it should be of shape = (N,N) (\((i,j)\) th element is \(\hat{S}_i(T_j)\)).

The time-dependent Brier score [GSSS99] at time \(t\) is the mean squared error of the event status

\[BS(t) = \mathbb{E}\left[\left(1\left(X > t\right) - \hat{S}(t)\right)^2\right]\]

The default Brier score estimate is

\[\hat{BS}(t) = \frac{1}{n}\sum_i 1(T_i \leq t, \delta_i = 1) (0 - \hat{S}_i(t))^2 + 1(T_1 > t) (1- \hat{S}_i(t))^2\]

To account for the fact that the event time are censored, the inverse probability weighting technique can be used. In this context, each subject associated with time \(t\) is weighted by the inverse probability of censoring \(\omega(t) = 1 / \hat{D}(t)\), where \(\hat{D}(t)\) is the Kaplan-Meier estimate of the censoring distribution, \(P(D>t)\). The censoring-adjusted Brier score is

\[\hat{BS}(t) = \frac{1}{n}\sum_i \omega(T_i) 1(T_i \leq t, \delta_i = 1) (0 - \hat{S}_i(t))^2 + \omega(t) 1(T_1 > t) (1- \hat{S}_i(t))^2\]

The censoring-adjusted Brier score can be obtained by specifying the argument weight, the weights evaluated at each time (\(\omega(T_1), \cdots, \omega(T_N)\)). If new_time is specified, the argument weight_new_time should also be specified accordingly, the weights evaluated at each new_time (\(\omega(t_1), \cdots, \omega(t_K)\)). In the context of train/test split, the weights should be derived from the censoring distribution estimated in the training data.

Examples

>>> from torchsurv.stats.ipcw import get_ipcw
>>> _ = torch.manual_seed(52)
>>> n = 10
>>> time = torch.randint(low=5, high=250, size=(n,)).float()
>>> event = torch.randint(low=0, high=2, size=(n,)).bool()
>>> estimate = torch.rand((n,len(time)))
>>> brier_score = BrierScore()
>>> brier_score(estimate, event, time)
tensor([0.2463, 0.2740, 0.3899, 0.1964, 0.3608, 0.2821, 0.1932, 0.2978, 0.1950,
        0.1668])
>>> ipcw = get_ipcw(event, time) # ipcw at time
>>> brier_score(estimate, event, time, weight=ipcw) # censoring-adjusted brier-score
tensor([0.2463, 0.2740, 0.4282, 0.2163, 0.4465, 0.3826, 0.2630, 0.3888, 0.2219,
        0.1882])
>>> new_time = torch.unique(torch.randint(low=5, high=time.max().int(), size=(n*2,)).float())
>>> ipcw_new_time = get_ipcw(event, time, new_time) # ipcw at new_time
>>> estimate = torch.rand((n,len(new_time)))
>>> brier_score(estimate, event, time, new_time, ipcw, ipcw_new_time) # censoring-adjusted brier-score at new time
tensor([0.4036, 0.3014, 0.2517, 0.3947, 0.4200, 0.3908, 0.3766, 0.3737, 0.3596,
        0.2088, 0.4922, 0.3237, 0.2255, 0.1841, 0.3029, 0.6919, 0.2357, 0.3507,
        0.4364, 0.3312])

References

[GSSS99] (1,2)

Erika Graf, Claudia Schmoor, Willi Sauerbrei, and Martin Schumacher. Assessment and comparison of prognostic classification schemes for survival data. Statistics in Medicine, 18(17–18):2529–2545, September 1999.

integral()[source]#

Compute the integrated Brier Score.

Returns:

Integrated Brier Score.

Return type:

torch.Tensor

Examples

>>> _ = torch.manual_seed(52)
>>> n = 10
>>> time = torch.randint(low=5, high=250, size=(n,)).float()
>>> event = torch.randint(low=0, high=2, size=(n,)).bool()
>>> estimate = torch.rand((n,len(time)))
>>> brier_score = BrierScore()
>>> brier_score(estimate, event, time)
tensor([0.2463, 0.2740, 0.3899, 0.1964, 0.3608, 0.2821, 0.1932, 0.2978, 0.1950,
        0.1668])
>>> brier_score.integral() # integrated brier score
tensor(0.2862)

Note

The integrated Brier score is the integral of the time-dependent Brier score over the interval \([t_1, t_2]\), where \(t_1 = \min\left(\{T_i\}_{i = 1, \cdots, N}\right)\) and \(t_2 = \max\left(\{T_i\}_{i = 1, \cdots, N}\right)\). It is defined by [GSSS99]

\[\hat{IBS} = \int_{t_1}^{t_2} \hat{BS}(t) dW(t)\]

where \(W(t) = t / t_2\).

The integral is estimated with the trapzoidal rule.

confidence_interval(method: str = 'parametric', alpha: float = 0.05, alternative: str = 'two_sided', n_bootstraps: int = 999) Tensor[source]#

Compute the confidence interval of the Brier Score.

This function calculates either the pointwise confidence interval or the bootstrap confidence interval for the Brier Score. The pointwise confidence interval is computed assuming that the Brier score is normally distributed and using empirical standard errors. The bootstrap confidence interval is constructed based on the distribution of bootstrap samples.

Parameters:
  • method (str) – Method for computing confidence interval. Defaults to “parametric”. Must be one of the following: “parametric”, “bootstrap”.

  • alpha (float) – Significance level. Defaults to 0.05.

  • alternative (str) – Alternative hypothesis. Defaults to “two_sided”. Must be one of the following: “two_sided”, “greater”, “less”.

  • n_bootstraps (int) – Number of bootstrap samples. Defaults to 999. Ignored if method is not “bootstrap”.

Returns:

Lower and upper bounds of the confidence interval.

Return type:

torch.Tensor([lower,upper])

Examples

>>> _ = torch.manual_seed(52)
>>> n = 10
>>> time = torch.randint(low=5, high=250, size=(n,)).float()
>>> event = torch.randint(low=0, high=2, size=(n,)).bool()
>>> estimate = torch.rand((n,len(time)))
>>> brier_score = BrierScore()
>>> brier_score(estimate, event, time)
tensor([0.2463, 0.2740, 0.3899, 0.1964, 0.3608, 0.2821, 0.1932, 0.2978, 0.1950,
        0.1668])
>>> brier_score.confidence_interval() # default: parametric, two-sided
tensor([[0.1061, 0.0604, 0.2360, 0.0533, 0.1252, 0.0795, 0.0000, 0.1512, 0.0381,
         0.0051],
        [0.3866, 0.4876, 0.5437, 0.3394, 0.5965, 0.4847, 0.4137, 0.4443, 0.3520,
         0.3285]])
>>> brier_score.confidence_interval(method = "bootstrap", alternative = "greater")
tensor([[0.1455, 0.1155, 0.2741, 0.0903, 0.1985, 0.1323, 0.0245, 0.1938, 0.0788,
         0.0440],
        [1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
         1.0000]])
p_value(method: str = 'bootstrap', alternative: str = 'two_sided', n_bootstraps: int = 999, null_value: float = None) Tensor[source]#

Perform a one-sample hypothesis test on the Brier score.

This function calculates either the pointwise p-value or the bootstrap p-value for testing the null hypothesis that the estimated brier score is equal to bs0, where bs0 is the brier score that would be expected if the survival model was not providing accurate predictions beyond random chance. The pointwise p-value is computed assuming that the Brier score is normally distributed and using the empirical standard errors. To obtain the pointwise p-value, the Brier score under the null, bs0, must be provided. The bootstrap p-value is derived by permuting survival function’s predictions to estimate the the sampling distribution under the null hypothesis.

Parameters:
  • method (str) – Method for computing p-value. Defaults to “bootstrap”. Must be one of the following: “parametric”, “bootstrap”.

  • alternative (str) – Alternative hypothesis. Defaults to “two_sided”. Must be one of the following: “two_sided” (Brier score is not equal to bs0), “greater” (Brier score is greater than bs0), “less” (Brier score is less than bs0).

  • n_bootstraps (int) – Number of bootstrap samples. Defaults to 999. Ignored if `method` is not “bootstrap”.

  • null_value (float) – The Brier score expected if the survival model was not providing accurate predictions beyond what would be beyond by random chance alone, i.e., bs0.

Returns:

p-value of the statistical test.

Return type:

torch.Tensor

Examples

>>> _ = torch.manual_seed(52)
>>> n = 10
>>> time = torch.randint(low=5, high=250, size=(n,)).float()
>>> event = torch.randint(low=0, high=2, size=(n,)).bool()
>>> new_time = torch.unique(time)
>>> estimate = torch.rand((n,len(new_time)))
>>> brier_score = BrierScore()
>>> brier_score(estimate, event, time, new_time)
tensor([0.3465, 0.5310, 0.4222, 0.4582, 0.3601, 0.3395, 0.2285, 0.1975, 0.3120,
        0.3883])
>>> brier_score.p_value() # Default: bootstrap, two_sided
tensor([1.0000, 0.0560, 1.0000, 1.0000, 1.0000, 1.0000, 0.8320, 0.8620, 1.0000,
        1.0000])
>>> brier_score.p_value(method = "parametric", alternative = "less", null_value = 0.3) # H0: bs = 0.3, Ha: bs < 0.3
tensor([0.7130, 0.9964, 0.8658, 0.8935, 0.6900, 0.6630, 0.1277, 0.1128, 0.5383,
        0.8041])
compare(other, method: str = 'parametric', n_bootstraps: int = 999) tensor[source]#

Compare two Brier scores.

This function compares two Brier scores computed on the same data with different risk scores. The statistical hypotheses are formulated as follows, null hypothesis: brierscore1 = brierscore2 and alternative hypothesis: brierscore1 < brierscore2. The statistical test is either a Student t-test for paired samples or a two-sample bootstrap test. The Student t-test for paired samples assumes that the Brier Scores are normally distributed and uses the Brier scores’ empirical standard errors.

Parameters:
  • other (BrierScore) – Another instance of the BrierScore class representing brierscore2.

  • method (str) – Statistical test used to perform the hypothesis test. Defaults to “parametric”. Must be one of the following: “parametric”, “bootstrap”.

  • n_bootstraps (int) – Number of bootstrap samples. Defaults to 999. Ignored if method is not “bootstrap”.

Returns:

p-value of the statistical test.

Return type:

torch.tensor

Examples

>>> _ = torch.manual_seed(52)
>>> n = 10
>>> time = torch.randint(low=5, high=250, size=(n,)).float()
>>> event = torch.randint(low=0, high=2, size=(n,)).bool()
>>> brier_score = BrierScore()
>>> brier_score(torch.rand((n,len(time))), event, time)
tensor([0.2463, 0.2740, 0.3899, 0.1964, 0.3608, 0.2821, 0.1932, 0.2978, 0.1950,
        0.1668])
>>> brier_score2 = BrierScore()
>>> brier_score2(torch.rand((n,len(time))), event, time)
tensor([0.4136, 0.2750, 0.3002, 0.2826, 0.2030, 0.2643, 0.2525, 0.2964, 0.1804,
        0.3109])
>>> brier_score.compare(brier_score2) # default: parametric
tensor([0.1793, 0.4972, 0.7105, 0.1985, 0.9254, 0.5591, 0.3455, 0.5060, 0.5437,
        0.0674])
>>> brier_score.compare(brier_score2, method = "bootstrap")
tensor([0.1360, 0.5030, 0.7310, 0.2090, 0.8630, 0.5490, 0.3120, 0.5110, 0.5460,
        0.1030])