Knockoff (feature) statistics: Absolute elastic-net coefficient differences between original and knockoff variables

This function follows mostly the implementation of knockoff::glmnet.stat_coefdiff. The input data.frames (X, X_k) and X.fixed (if supplied) are first converted to design matrices (with the function model.matrix). This means that if the input features contain factor variables then associated dummy variable are determined by the model.matrix contrasts (defaults to indicator dummy variables with a reference level). There is then a call to glmnet::cv.glmnet where the input is y and x = cbind(X, X_k, X.fixed) and penalty is only applied to cbind(X, X_k). If user wishes to also penalize the parameters of X.fixed then an additional penalty.fixed parameter can be adjusted accordingly.

Usage

stat_glmnet(
  y,
  X,
  X_k,
  type = "regression",
  X.fixed = NULL,
  penalty.fixed = rep(0, length(X.fixed)),
  ...
)

Arguments

y: response vector with length(y) = nrow(X). Accepts "numeric" (type="regression") or binary "factor" (type="classification"). Can also be a survival object of class "Surv" (type="survival") as obtained from y = survival::Surv(time, status).
X: original data.frame (or tibble) with "numeric" and "factor" columns only. The number of columns, ncol(X) needs to be > 2.
X_k: knockoff data.frame (or tibble) with "numeric" and "factor" columns only obtained e.g. by X_k = knockoff(X). The dimensions and column classes must match those of the original X.
type: should be "regression" if y is numeric, "classification" if y is a binary factor variable or "survival" if y is a survival object.
X.fixed: a data.frame (or tibble) with "numeric" and "factor" columns corresponding to covariates or terms that should be treated as fixed effects in the model.
penalty.fixed: a numeric vector of length equal to number of columns of X.fixed indicating which fixed effects should be estimated with glmnet penalty and which not (1 corresponds to covariates that should be penalized and 0 corresponds to covariates that are not penalized; if X.fixed is supplied, all elements of penalty.fixed are set to zero as default)
...: additional parameters passed to glmnet::cv.glmnet

Value

data.frame with knockoff statistics W as column. The number of rows matches the number of columns (variables) of the data.frame X and the variable names are recorded in rownames(W).

Details

If there are factor covariates with multiple levels among columns of X then there will be more columns in model.matrix than in the corresponding data.frame (both for original X and its knockoff X_k). In this case, let W_j be the difference between the two maximum absolute signals of coefficients of model.matrix associated with covariate j. I.e. if j-th variable is factor with K levels then W_j is: max(|beta_j1|, ... , |beta_j,K-1|) - max(|beta.tilde_j1|, ..., |beta.tilde_j,K-1|) where (beta_j1, ..., beta_j,K-1) and (beta.tilde_j1, ..., beta.tilde_j,K-1) are the coefficients associated with dummy variables of original j-th factor and its knockoff, respectively.

Examples

library(knockofftools)

set.seed(1)

# Simulate 10 Gaussian covariate predictors and 1 factor with 4 levels:
X <- generate_X(n=100, p=10, p_b=0, cov_type="cov_equi", rho=0.2)

# Simulate response from a linear model y = X%*%beta + epsilon, where epsilon ~ N(0,1) with
# first 3 beta-coefficients = 1 (all other zero):
y <- (X$X1 + X$X2 + X$X3) + rnorm(100)

# Calculate M independent knockoff feature statistics:
W <- knockoff.statistics(y=y, X=X)