The simdata dataset
Matthias Kormaksson, Kostas Sechidis
March 04, 2024
simdata.Rmd
Introduction
The knockofftools package comes with a simulated data set called
simdata
. This short vignette demonstrates how the data set
was generated.
Data generation with generate_simdata
The function generate_simdata
was used to generate the
data set simdata
that comes with the R-package.
generate_simdata <- function() {
RNGkind("L'Ecuyer-CMRG")
set.seed(56969)
N <- 2000
p <- 30
p_b = 10
p_nn <- 10
# Generate a 2000 x 30 Gaussian data.frame under equi-correlation(rho=0.5) structure,
# with 10 of the columns dichotomized
X <- generate_X(n=N, p=p, p_b=p_b, cov_type = "cov_equi", rho=0.5)
# Generate linear predictor lp = X%*%beta where first 10 beta-coefficients are = a, all other = 0.
lp <- generate_lp(X=X, p_nn=p_nn, a=1)
# simulate gaussian response with mean = lp and sd = 1.
Yg <- lp + rnorm(N)
# simulate bernoulli response with mean = exp(lp)/(1+exp(lp)).
Yb <- factor(rbinom(N, size=1, prob=exp(lp)/(1+exp(lp))))
# simulate censored survival times from Cox regression with linear predictor lp:
Tc <- simulWeib(N=N, lambda = 0.01, rho = 1, lp = lp)
dat <- data.frame(Yg, Yb, Tc, X)
return(dat)
}
This function simulates a toy data set with 30 covariates \(X_1, \dots, X_{30}\), one continuous response \(y_g\), one binary response \(y_b\) and one set of censored survival times \(T_c\). Let’s now go through the individual components of the function one by one.
Simulation of \(X\): The
generate_X
function simulates the rows of an \(n \times p\) data frame \(X\) independently from a multivariate
Gaussian distribution with mean \(0\)
and \(p \times p\) covariance matrix
\[
\Sigma_{ij} = \left \{
\begin{array}{lr}
1\{i = j\}, & \text{Independent,} \\
\rho^{1\{i \neq j\}}, & \text{Equicorrelated,} \\
\rho^{|i-j|}, & \text{AR1},
\end{array}
\right.
\] where \(p_b\) randomly
selected columns are then dichotomized with the indicator function \(\delta(x)=1(x > 0)\).
X <- generate_X(n=N, p=p, p_b=p_b, cov_type = "cov_equi", rho=0.5)
The covariance type is specified with the parameter
cov_type
and the correlation coefficient with
rho
. Each column of the resulting data.frame is either of
class "numeric"
(for the continuous columns) or
"factor"
(for the binary columns).
Calculation of linear predictor: The
generate_lp
function calculates the linear predictor \(\ell_p=X\beta\) under sparsity, where the
first p_nn
regression coefficients are non-zero, all other
are set to zero. The (common) amplitude of the non-zero regression
coefficients is specified with a
. Here we generate \(\ell_p\) that implies association with the
first 10 covariates, each with amplitude \(a=1\).
lp <- generate_lp(X=X, p_nn=p_nn, a=1)
Note that inside generate_lp
the model.matrix of
X
is first scaled.
Simulation of Gaussian response: \(Y_g\) is Gaussian with mean \(\mu = \ell_p\) and standard deviation \(\sigma = 1\):
Yg <- lp + rnorm(N)
Simulation of Bernoulli response: \(Y_b\) is Bernoulli with success probability \(\mu = exp(\ell_p)/(1+exp(\ell_p))\):
Simulation of event and censoring times: The final command:
Tc <- simulWeib(N=N, lambda = 0.01, rho = 1, lp = lp)
generates censored survival times \(T_c\) from a Cox regression model \[
\lambda(t) = \lambda_0(t) \exp \left(\ell_p \right),
\] with Weibull baseline hazard: \[
\lambda_0(t) = \lambda_0 \rho t^{\rho-1}
\] where \(\lambda_0 > 0\)
and \(\rho > 0\) are scale and shape
parameters, respectively. The censoring times \(C\) are randomly drawn from an exponential
distribution with a small (fixed) rate \(\lambda_C=0.0005\), which results in very
mild censoring. Once \(T\) and \(C\) have been simulated the function
returns a survival object (Surv
) with
time = min(T, C)
and event = 1{T < C}
:
The data set simdata
Now let’s have a look at the first few columns of the data set
simdata
data(simdata)
head(simdata[,1:9])
#> Yg Yb Tc.time Tc.status X1 X2 X3 X4 X5 X6
#> 1 -8.616078 0 2961.163522 0.000000 0 -1.34958449 1 -2.06459293 0 0
#> 2 2.145781 1 14.818694 1.000000 0 1.17359743 1 0.84017766 1 0
#> 3 2.855925 1 5.662159 1.000000 0 0.06617883 1 0.56828582 1 0
#> 4 -7.393736 0 780.165789 0.000000 0 0.60006023 0 -0.88344294 0 0
#> 5 -5.357337 0 1290.810345 0.000000 0 0.68195033 0 -0.07230342 0 1
#> 6 3.720681 1 3.590458 1.000000 1 -0.65766424 1 0.46974365 0 1
and finally confirm that generate_simdata()
indeed
reproduces the simdata
dataset:
all.equal(simdata, generate_simdata())
#> [1] TRUE