Knockoff variable selection: Select the variables by controlling a user-specified error rate
variable.selections.Rd
This is the main function that performs the knockoff based variable selection using as input the knockoff statistics W. In case of multiple knockoffs, ncol(W) > 1, the function performs variable selection for each knockoff and additionally stabilizes the selections by combining their outcomes.
Arguments
- W
a data.frame of knockoff W-statistics (feature statistics); columns correspond to different knockoffs and rows correspond to the underlying variables. row.names(W) records the variable names.
- level
the nominal level that the user wants to control
- error.type
the error rate to control, at the moment "fdr", "pfer" and "kfwer"
- k
a positive integer corresponding to k-FWER (multiple testing when one seeks to control at least k false discoveries), to be used only with error.type = 'kfwer'
- thres
threshold parameter for stabilizing the selections (eta parameter for derandomized knockoffs, trims parameter for multi_select). A natural choice is thres = 0.5.
Value
an object of class "variable.selections" that is essentially a list with two elements: 1) $selections = (p x M) binary data.frame where rows correspond to variables, and cols correspond to different knockoffs; a value of 1 means the given variable was selected for that particular knockoff simulation, 0 otherwise; 2) $stable.selection = a character vector with the selected variables from stability selection (as described in Details). The second field is only meaningful if user specifies multiple knockoffs (say M > 5). If M = 1 then the stable.selection simply returns the indicies of $selections that are equal to 1.
Details
Knockoffs is a randomized procedure which relies on the construction of synthetic (knockoff) variables. This function performs variable selection for multiple knockoffs and then stabilizes the selections by combining their outcomes. When the pfer or kfwer error is controlled the derandomizing knockoffs is used, which was introduced by Ret et al. (2021) and provably controls this errors. When the fdr is controlled the heuristic multiple selection algorithm is used, which was introduced by Kormaksson et al. (2021).
Z. Ren, Y. Wei, & E. Candès, (2021). Derandomizing knockoffs. Journal of the American Statistical Association, 1-11.
M. Kormaksson, L. J. Kelly, X. Zhu, S. Haemmerle, L. Pricop, & D. Ohlssen (2021). Sequential knockoffs for continuous and categorical predictors: With application to a large psoriatic arthritis clinical trial pool. Statistics in Medicine, 40(14), 3313-3328.
See also
plot.variable.selections
for plotting an organized heatmap of the selections.
Examples
library(knockofftools)
set.seed(1)
# Simulate 10 Gaussian covariate predictors:
X <- generate_X(n=100, p=10, p_b=0, cov_type="cov_equi", rho=0.2)
# create linear predictor with first 5 beta-coefficients = 1 (all other zero)
lp <- generate_lp(X, p_nn = 5, a=1)
# Gaussian
# Simulate response from a linear model y = lp + epsilon, where epsilon ~ N(0,1):
y <- lp + rnorm(100)
# Calculate M independent knockoff feature statistics:
W <- knockoff.statistics(y=y, X=X, type="regression", M=5)
#> Running sequentially ('LOCAL') ...
S = variable.selections(W, error.type = "pfer", level = 1)
# selections under alternative error control:
S = variable.selections(W, error.type = "kfwer", k=1, level = 0.50)
S = variable.selections(W, error.type = "fdr", level = 0.5)