Knockoff (feature) statistics:
knockoff.statistics.Rd
This function calculates M >= 1 independent knockoff (feature) statistics (W) given input response vector y and covariate data.frame X. The function first calculates M independent knockoff copies (Xk1, ..., XkM) of the covariate matrix (X) and then calculates the knockoff feature statistics W1, ..., WM. By default each feature statistic is calculated via the parameter statistic=knockofftools::stat_glmnet, but user may write and supply their own feature statistics functions (e.g. random forest variable importance difference). The user may additionally supply fixed effects (X.fixed) that should always be included in the underlying model (e.g. covariates to adjust for).
Usage
knockoff.statistics(
y,
X,
type = "regression",
M = 1,
knockoff.method = "seq",
statistic = "stat_glmnet",
trt = NULL,
gcm = TRUE,
...
)
Arguments
- y
response vector with
length(y) = nrow(X)
. Accepts "numeric", binary "factor", or survival ("Surv") object.- X
data.frame (or tibble) with "numeric" and "factor" columns only. The number of columns, ncol(X) needs to be > 2.
- type
should be "regression" if y is numeric, "classification" if y is a binary factor variable or "survival" if y is a survival object.
- M
the number of independent knockoff feature statistics that should be calculated.
- knockoff.method
what type of knockoffs to calculate. Defaults to sequential knockoffs, knockoff.method="seq", with other options: knockoff.method="sparseseq" and knockoff.method="mx". The "mx" method only works if all columns of the X matrix are continuous.
- statistic
knockoff feature statistic function, defaults to glmnet coefficient difference (statistic="stat_glmnet"; see ?stat_glmnet). Other options include statistic="stat_random_forest" (see ?stat_random_forest), statistic="stat_predictive_glmnet" (see ?stat_predictive_glmnet) or statistic="stat_predictive_causal_forest" (see ?stat_predictive_causal_forest).
- trt
binary treatment (factor) variable required if statistic involves a predictive knockoff filter (i.e. if statistic="stat_predictive_glmnet" or statistic="stat_predictive_causal_forest")
- gcm
logical indicator for whether a Gaussian Copula Model should be applied. Defaults to TRUE since the underlying knockoff generation mechanism for numeric variables is based on multivariate Gaussian variables. When gcm=TRUE each numeric variable is normal score transformed resulting in marginal standard normal variables. The knockoff filter then acts in this transformed variable space. User is advised not to change this parameter unless he/she understands the consequences.
- ...
additional parameters passed to the "statistic" function (note that the knockoffs parameter X_k should not be entered by user; it is already calculated inside the knockoff.statistics function).
Value
data.frame with knockoff statistics W as column. The number of rows matches the number of columns (variables) of the data.frame X and the variable names are recorded in rownames(W).
Details
If multiple knockoffs are desired (M > 1) then the method utilizes the clustermq package for parallel distribution of jobs on HPC scheduler. See clustermq-userguide for further details on how to configure (differently from defaults) clustermq scheduler and batch templates.
Examples
library(knockofftools)
set.seed(1)
# Simulate 10 Gaussian covariate predictors:
X <- generate_X(n=100, p=10, p_b=0, cov_type="cov_equi", rho=0.2)
# create linear predictor with first 5 beta-coefficients = 1 (all other zero)
lp <- generate_lp(X, p_nn = 5, a=1)
# Gaussian
# Simulate response from a linear model y = lp + epsilon, where epsilon ~ N(0,1):
y <- lp + rnorm(100)
# Calculate M independent knockoff feature statistics:
W <- knockoff.statistics(y=y, X=X, type="regression", M=5)
#> Running sequentially ('LOCAL') ...
print(variable.selections(W, error.type = "pfer", level = 2)$stable.variables)
#> [1] "X1" "X2" "X3" "X4" "X5" "X7" "X10"
if (FALSE) {
W <- knockoff.statistics(y=y, X=X, type="regression", M=5, statistic = "stat_random_forest")
print(variable.selections(W, error.type = "pfer", level = 2)$stable.variables)
# Cox
# Simulate from Weibull hazard with with baseline hazard h0(t) = lambda*rho*t^(rho-1)
# and linear predictor lp:
y <- simulWeib(N=nrow(X), lambda0=0.01, rho=1, lp=lp)
W <- knockoff.statistics(y=y, X=X, type="survival", M=5)
print(variable.selections(W, error.type = "pfer", level = 2)$stable.variables)
W <- knockoff.statistics(y=y, X=X, type="survival", M=5, statistic = "stat_random_forest")
print(variable.selections(W, error.type = "pfer", level = 2)$stable.variables)
# Check quickly predictive filters
# Generate a binary treatment variable
trt = sample(c(1,0), nrow(X), replace=TRUE)
lp.pred = lp + 1*trt*( as.integer(X[,6]>0) + as.integer(X[,7]>0))
# Gaussian
y <- lp.pred + rnorm(nrow(X))
W <- knockoff.statistics(y=y, X=X, type="regression", statistic = "stat_predictive_glmnet", trt=trt, M=5)
print(variable.selections(W, error.type = "pfer", level = 2)$stable.variables)
W <- knockoff.statistics(y=y, X=X, type="regression", statistic = "stat_predictive_causal_forest", trt=trt, M=5)
print(variable.selections(W, error.type = "pfer", level = 2)$stable.variables)
# Cox
# Simulate from Weibull hazard with with baseline hazard h0(t) = lambda*rho*t^(rho-1)
# and linear predictor lp:
y <- simulWeib(N=nrow(X), lambda0=0.01, rho=1, lp=lp.pred)
W <- knockoff.statistics(y=y, X=X, type="survival", statistic = "stat_predictive_glmnet", trt=trt, M=5)
print(variable.selections(W, error.type = "pfer", level = 2)$stable.variables)
W <- knockoff.statistics(y=y, X=X, type="survival",
statistic = "stat_predictive_causal_forest",
trt=trt, M=5)
print(variable.selections(W, error.type = "pfer", level = 2)$stable.variables)}