Skip to contents

This function uses the randomForestSRC package to estimate importance scores from random forest.

Usage

stat_random_forest(X, X_k, y, type = "regression", ...)

Arguments

X

original data.frame (or tibble) with "numeric" and "factor" columns only. The number of columns, ncol(X) needs to be > 2.

X_k

knockoff data.frame (or tibble) with "numeric" and "factor" columns only obtained e.g. by X_k = knockoff(X). The dimensions and column classes must match those of the original X.

y

response vector with length(y) = nrow(X). Accepts "numeric" (family="gaussian") or binary "factor" (family="binomial"). Can also be a survival object of class "Surv" (type="survival") as obtained from y = survival::Surv(time, status).

type

should be "regression" if y is numeric, "classification" if y is a binary factor variable or "survival" if y is a survival object.

...

Value

data.frame with knockoff statistics W as column. The number of rows matches the number of columns (variables) of the data.frame X and the variable names are recorded in rownames(W).

Details

If there are factor covariates with multiple levels among columns of X then there will be more columns in model.matrix than in the corresponding data.frame (both for original X and its knockoff X_k). In this case, let W_j be the difference between the two sums derived by the variable importance (VI) scores associated with covariate j. I.e. if j-th variable is factor with K levels then W_j is: sum(|VI_j,1|, ... , |VI_j,K|) - sum(|VI_j1|, ..., |VI_j,K|).

Examples

library(knockofftools)

set.seed(1)

# Simulate 10 Gaussian covariate predictors and 1 factor with 4 levels:
X <- generate_X(n=500, p=10, p_b=0, cov_type="cov_diag", rho=0.2)
X$X11 <- factor(sample(c("A","B","C","D"), nrow(X), replace=TRUE))

# Calculate the knockoff copy of X:
X_k <- knockoff(X)

# create linear predictor with first 3 beta-coefficients = 1 (all other zero) and a treatment effect of size 1
lp <- (X$X1 + X$X2 + X$X3)

# Gaussian

# Simulate response from a linear model y = lp + epsilon, where epsilon ~ N(0,1):
y <- lp + rnorm(nrow(X))

W <- stat_random_forest(X, X_k, y, type = "regression")

# Cox

# Simulate from Weibull hazard with with baseline hazard h0(t) = lambda*rho*t^(rho-1) and linear predictor lp:
y <- simulWeib(N=nrow(X), lambda0=0.01, rho=1, lp=lp)

# Calculate  knockoff feature statistics:
W <- stat_random_forest(X, X_k, y, type = "survival")