Knockoff (feature) statistics that captues the predictive strength: Absolute coefficient differences between treatment original variables interaction terms and treatment knockoff variables interaction terms

This function follows the implementation of the stat_glmnet function, but modifies it to focus on the coefficients of the interaction terms with the treatment. A first version of this filter was suggested in Sechidis et al. (2021), but with this function we also allow the user to adjust to variables, or to fix some variables as prognostic in the model.

Usage

stat_predictive_glmnet(
  X,
  X_k,
  y,
  trt,
  type = "regression",
  X.fixed = NULL,
  penalty.fixed = rep(0, length(X.fixed)),
  fixed.prognostic = FALSE,
  ...
)

Arguments

X: original data.frame (or tibble) with "numeric" and "factor" columns only. The number of columns, ncol(X) needs to be > 2.
X_k: knockoff data.frame (or tibble) with "numeric" and "factor" columns only obtained e.g. by X_k = knockoff(X). The dimensions and column classes must match those of the original X.
y: response vector with length(y) = nrow(X). Accepts "numeric" (type="regression") or binary "factor" (type="classification"). Can also be a survival object of class "Surv" (type="survival") as obtained from y = survival::Surv(time, status).
trt: a binary treatment indicator variable (should be numeric with 0/1 entries)
type: should be "regression" if y is numeric, "classification" if y is a binary factor variable or "survival" if y is a survival object.
X.fixed: a data.frame (or tibble) with "numeric" and "factor" columns corresponding to covariates or terms that should be treated as fixed effects in the model.
penalty.fixed: a numeric vector of length equal to number of columns of X.fixed indicating which fixed effects should be estimated with glmnet penalty and which not (1 corresponds to covariates that should be penalized and 0 corresponds to covariates that are not penalized; if X.fixed is supplied, all elements of penalty.fixed are set to zero as default)
fixed.prognostic: a parameters that describes weather the user fix some prognostic variable in the model (TRUE) or not (FALSE)
...: additional parameters passed to glmnet::cv.glmnet

Value

data.frame with knockoff statistics W as column that capture the predictive strength of the variables. The number of rows matches the number of columns (variables) of the data.frame X and the variable names are recorded in rownames(W).

Details

If there are factor covariates with multiple levels among columns of X then there will be more columns in model.matrix than in the corresponding data.frame (both for original X and its knockoff X_k). In this case, let W_j be the difference between the two maximum absolute signals of coefficients of model.matrix associated with covariate j. I.e. if j-th variable is factor with K levels then W_j is: max(|beta_j1|, ... , |beta_j,K-1|) - max(|beta.tilde_j1|, ..., |beta.tilde_j,K-1|) where (beta_j1, ..., beta_j,K-1) and (beta.tilde_j1, ..., beta.tilde_j,K-1) are the coefficients associated with dummy variables of original j-th factor and its knockoff, respectively.

Sechidis, K., Kormaksson, M., & Ohlssen, D. (2021). Using knockoffs for controlled predictive biomarker identification. Statistics in Medicine, 40(25), 5453-5473.

Examples

if (FALSE) {
library(knockofftools)

set.seed(1)

# Simulate 10 Gaussian covariate predictors and 1 factor with 4 levels:
X <- generate_X(n=500, p=10, p_b=0, cov_type="cov_diag", rho=0.2)
X$X11 <- factor(sample(c("A","B","C","D"), nrow(X), replace=TRUE))

 # Calculate the knockoff copy of X:
X_k <- knockoff(X)

# Generate a binary treatment variable
trt = sample(c(1,0), nrow(X), replace=TRUE)

# Simulate a fixed "treatment" effect:
X.fixed <- data.frame(SEX = factor(sample(c("male", "female"), nrow(X), replace=TRUE)), trt = trt)
penalty.fixed = rep(0, length(X.fixed))

# create linear predictor with first 3 beta-coefficients = 1 (all other zero) and a treatment effect of size 1
lp <- X.fixed$trt+ as.numeric(X.fixed$SEX == 'male') + (X$X1 + X$X2 + X$X3) + (X$X4 + 2*as.integer(X$X11=='A'))*trt

# Gaussian

# Simulate response from a linear model y = lp + epsilon, where epsilon ~ N(0,1):
y <- lp + rnorm(nrow(X))

# We present three different ways of using that filter
# (a) the filter suggested in Sechidis et al. (2021) - where there is no fixed prognostic part
W <- stat_predictive_glmnet(y=y, trt=trt, X=X, X_k=X_k, type="regression", X.fixed=X.fixed, penalty.fixed = penalty.fixed, fixed.prognostic = FALSE)
# (b) a variation where the user can fix some prognostic variables in the model and allows the model to penalise them
X.fixed.prognostic <- data.frame( X1 = X$X1, X2 = X$X2, X3 = X$X3)
penalty.fixed.prognostic = rep(1, length(X.fixed.prognostic))
W <- stat_predictive_glmnet(y=y, trt=trt, X=X, X_k=X_k, type="regression", X.fixed=cbind(X.fixed, X.fixed.prognostic),  penalty.fixed = c(penalty.fixed,penalty.fixed.prognostic), fixed.prognostic = TRUE)

# Cox

# Simulate from Weibull hazard with with baseline hazard h0(t) = lambda*rho*t^(rho-1) and linear predictor lp:
y <- simulWeib(N=nrow(X), lambda0=0.01, rho=1, lp=lp)

# Explore the three scenarios above with survival outcomes
# (a) the filter suggested in Sechidis et al. (2021) - where there is no fixed prognostic part
W <- stat_predictive_glmnet(y=y, trt=trt, X=X, X_k=X_k, type="survival", X.fixed=X.fixed, penalty.fixed = penalty.fixed, fixed.prognostic = FALSE)
# (b) a variation where the user can fix some prognostic variables in the model and allows the model to penalise them
X.fixed.prognostic <- data.frame( X1 = X$X1, X2 = X$X2, X3 = X$X3)
penalty.fixed.prognostic = rep(1, length(X.fixed.prognostic))
W <- stat_predictive_glmnet(y=y, trt=trt, X=X, X_k=X_k, type="survival", X.fixed=cbind(X.fixed, X.fixed.prognostic),  penalty.fixed = c(penalty.fixed,penalty.fixed.prognostic), fixed.prognostic = TRUE)
}