rsf.main: Univariate Minimal Depth of a Maximal Subtree (MDMS)

Description

Ranking of individual and noise variables main effects by univariate Minimal Depth of a Maximal Subtree (MDMS)

Usage

rsf.main(X,
             ntree = 1000,
             method = "mdms",
             splitrule = "logrank",
             importance = "random",
             B = 1000,
             ci = 90,
             parallel = FALSE,
             conf = NULL,
             verbose = TRUE,
             seed = NULL)

Arguments

data.frame or numeric matrix of input covariates. Dataset X assumes that: - all variables are in columns - the observed times to event and censoring variables are in the first two columns: "stime": numeric vector of observed times. "status": numeric vector of observed status (censoring) indicator variable. - each variable has a unique name, excluding the word "noise"

ntree

Number of trees in the forest. Defaults to 1000.

method

Method for ranking of individual and noise variables. character string "mdms" (default) that stands for Univariate Minimal Depth of a Maximal Subtree (MDMS).

splitrule

Splitting rule used to grow trees. For time-to-event analysis, use "logrank" (default), which implements log-rank splitting (Segal, 1988; Leblanc and Crowley, 1993).

importance

Method for computing variable importance. Defaults to Character string "random". See details below.

Postitive integer of the number of replications of the cross-validation procedure. Defaults to 1000.

Confidence Interval for inferences of individual and noise variables. numeric scalar between 50 and 100. Defaults to 90.

parallel

logical. Is parallel computing to be performed? Defaults to FALSE.

conf

list of 5 fields containing the parameters values needed for creating the parallel backend (cluster configuration). See details below for usage. Optional, defaults to NULL, but all fields are required if used:

type : character vector specifying the cluster type ("SOCKET", "MPI").
spec : A specification (character vector or integer scalar) appropriate to the type of cluster.
homogeneous : logical scalar to be set to FALSE for inhomogeneous clusters.
verbose : logical scalar to be set to FALSE for quiet mode.
outfile : character vector of an output log file name to direct the stdout and stderr connection output from the workernodes. "" indicates no redirection.

verbose

logical scalar. Is the output to be verbose? Optional, defaults to TRUE.

seed

Positive integer scalar of the user seed to reproduce the results. Defaults to NULL.

Value

data.frame containing the following columns:

"obs.mean" observed mean of covariates importances
"obs.se" observed standard error of covariates importances
"obs.LBCI" observed Lower Bound Confidence Interval of covariates importances
"obs.UBCI" observed Upper Bound Confidence Interval of covariates importances
"noise.mean" observed mean of noise covariates importances
"noise.se" observed standard error of noise covariates ranks
"noise.LBCI" observed Lower Bound Confidence Interval of noise covariates importances
"noise.UBCI" observed Upper Bound Confidence Interval of noise covariates importances
"signif.1SE" calls of covariates importances significance using the 1SE rule
"signif.CI" calls of covariates importances significance using the CI rule at ci% confidence level

Acknowledgments

This work made use of the High Performance Computing Resource in the Core Facility for Advanced Research Computing at Case Western Reserve University. We are thankful to Ms. Janet Schollenberger, Senior Project Coordinator, CAMACS, as well as Dr. Jeremy J. Martinson, Sudhir Penugonda, Shehnaz K. Hussain, Jay H. Bream, and Priya Duggal, for providing us the data related to the samples analyzed in the present study. Data in this manuscript were collected by the Multicenter AIDS Cohort Study (MACS) at (https://www.statepi.jhsph.edu/macs/macs.html) with centers at Baltimore, Chicago, Los Angeles, Pittsburgh, and the Data Coordinating Center: The Johns Hopkins University Bloomberg School of Public Health. The MACS is funded primarily by the National Institute of Allergy and Infectious Diseases (NIAID), with additional co-funding from the National Cancer Institute (NCI), the National Heart, Lung, and Blood Institute (NHLBI), and the National Institute on Deafness and Communication Disorders (NIDCD). MACS data collection is also supported by Johns Hopkins University CTSA. This study was supported by two grants from the National Institute of Health: NIDCR P01DE019759 (Aaron Weinberg, Peter Zimmerman, Richard J. Jurevic, Mark Chance) and NCI R01CA163739 (Hemant Ishwaran). The work was also partly supported by the National Science Foundation grant DMS 1148991 (Hemant Ishwaran) and the Center for AIDS Research grant P30AI036219 (Mark Chance).

Details

The option importance allows several ways to calculate Variable Importance (VIMP). The default "permute" returns Breiman-Cutler permutation VIMP as described in Breiman (2001). For each tree, the prediction error on the out-of-bag (OOB) data is recorded. Then for a given variable x, OOB cases are randomly permuted in x and the prediction error is recorded. The VIMP for x is defined as the difference between the perturbed and unperturbed error rate, averaged over all trees. If "random" is used, then x is not permuted, but rather an OOB case is assigned a daughter node randomly whenever a split on x is encountered in the in-bag tree. If "anti" is used, then x is assigned to the opposite node whenever a split on x is encountered in the in-bag tree.

The function rsf.main relies on the R package parallel to create a parallel backend within an R session, enabling access to a clusterof compute cores and/or nodes on a local and/or remote machine(s) and scaling-up with the number of CPU cores available and efficient parallel execution. To run a procedure in parallel (with parallel RNG), argument parallel is to be set to TRUE and argument conf is to be specified (i.e. non NULL). Argument conf uses the options described in function makeCluster of the R packages parallel and snow. IRSF supports two types of communication mechanisms between master and worker processes: 'Socket' or 'Message-Passing Interface' ('MPI'). In IRSF, parallel 'Socket' clusters use sockets communication mechanisms only (no forking) and are therefore available on all platforms, including Windows, while parallel 'MPI' clusters use high-speed interconnects mechanism in networks of computers (with distributed memory) and are therefore available only in these architectures. A parallel 'MPI' cluster also requires R package Rmpi to be installed. Value type is used to setup a cluster of type 'Socket' ("SOCKET") or 'MPI' ("MPI"), respectively. Depending on this type, values of spec are to be used alternatively:

For 'Socket' clusters (conf$type="SOCKET"), spec should be a character vector naming the hosts on which to run the job; it can default to a unique local machine, in which case, one may use the unique host name "localhost". Each host name can potentially be repeated to the number of CPU cores available on the local machine. It can also be an integer scalar specifying the number of processes to spawn on the local machine; or a list of machine specifications if you have ssh installed (a character value named host specifying the name or address of the host to use).
For 'MPI' clusters (conf$type="MPI"), spec should be an integer scalar specifying the total number of processes to be spawned across the network of available nodes, counting the workernodes and masternode.

The actual creation of the cluster, its initialization, and closing are all done internally. For more details, see the reference manual of R package snow and examples below.

When random number generation is needed, the creation of separate streams of parallel RNG per node is done internally by distributing the stream states to the nodes. For more details, see the vignette of R package parallel. The use of a seed allows to reproduce the results within the same type of session: the same seed will reproduce the same results within a non-parallel session or within a parallel session, but it will not necessarily give the exact same results (up to sampling variability) between a non-parallelized and parallelized session due to the difference of management of the seed between the two (see parallel RNG and value of returned seed below).

References

Dazard J-E., Ishwaran H., Mehlotra R.K., Weinberg A. and Zimmerman P.A. (2018). "Ensemble Survival Tree Models to Reveal Pairwise Interactions of Variables with Time-to-Events Outcomes in Low-Dimensional Setting" Statistical Applications in Genetics and Molecular Biology, 17(1):20170038.
Ishwaran, H. and Kogalur, U.B. (2007). "Random Survival Forests for R". R News, 7(2):25-31.
Ishwaran, H. and Kogalur, U.B. (2013). "Contributed R Package randomForestSRC: Random Forests for Survival, Regression and Classification (RF-SRC)" CRAN.

Examples

Run this code

# NOT RUN {
#===================================================
# Loading the library and its dependencies
#===================================================
library("IRSF")

# }
# NOT RUN {
    #===================================================
    # IRSF package news
    #===================================================
    IRSF.news()

    #================================================
    # MVR package citation
    #================================================
    citation("IRSF")

    #===================================================
    # Loading of the Synthetic and Real datasets
    # Use help for descriptions
    #===================================================
    data("MACS", package="IRSF")
    ?MACS

    head(MACS)

    #===================================================
    # Synthetic dataset
    # Continuous case:
    # All variables xj, j in {1,...,p}, are iid 
    # from a multivariate uniform distribution
    # with parmeters  a=1, b=5, i.e. on [1, 5].
    # rho = 0.50
    # Regression model: X1 + X2 + X1X2
    #===================================================
    seed <- 1234567
    set.seed(seed)
    n <- 200
    p <- 5
    x <- matrix(data=runif(n=n*p, min=1, max=5),
                nrow=n, ncol=p, byrow=FALSE,
                dimnames=list(1:n, paste("X", 1:p, sep="")))

    beta <- c(rep(1,2), rep(0,p-2), 1)
    covar <- cbind(x, "X1X2"=x[,1]*x[,2])
    eta <- covar %*% beta                     # regression function

    seed <- 1234567
    set.seed(seed)
    lambda0 <- 1
    lambda <- lambda0 * exp(eta - mean(eta))  # hazards function
    tt <- rexp(n=n, rate=lambda)              # true (uncensored) event times
    tc <- runif(n=n, min=0, max=3.9)          # true (censored) event times
    stime <- pmin(tt, tc)                     # observed event times
    status <- 1 * (tt <= tc)                  # observed event indicator
    X <- data.frame(stime, status, x)
        
    #===================================================
    # Synthetic dataset
    # Ranking of individual and noise variables by univariate 
    # Minimal Depth of a Maximal Subtree (MDMS)
    # Serial mode
    #===================================================
    X.main.mdms <- rsf.main(X=X,
                            ntree=1000,
                            method="mdms",
                            splitrule="logrank",
                            importance="random",
                            B=1000,
                            ci=90,
                            parallel=FALSE,
                            conf=NULL,
                            verbose=TRUE,
                            seed=seed)

    #===================================================
    # Examples of parallel backend parametrization 
    #===================================================
    if (require("parallel")) {
       cat("'parallel' is attached correctly \n")
    } else {
       stop("'parallel' must be attached first \n")
    }
    #===================================================
    # Ex. #1 - Multicore PC
    # Running WINDOWS
    # SOCKET communication cluster
    # Shared memory parallelization
    #===================================================
    cpus <- parallel::detectCores(logical = TRUE)
    conf <- list("spec" = rep("localhost", cpus),
                 "type" = "SOCKET",
                 "homo" = TRUE,
                 "verbose" = TRUE,
                 "outfile" = "")
    #===================================================
    # Ex. #2 - Master node + 3 Worker nodes cluster
    # All nodes equipped with identical setups of multicores 
    # (8 core CPUs per machine for a total of 32)
    # SOCKET communication cluster
    # Distributed memory parallelization
    #===================================================
    masterhost <- Sys.getenv("HOSTNAME")
    slavehosts <- c("compute-0-0", "compute-0-1", "compute-0-2")
    nodes <- length(slavehosts) + 1
    cpus <- 8
    conf <- list("spec" = c(rep(masterhost, cpus),
                            rep(slavehosts, cpus)),
                 "type" = "SOCKET",
                 "homo" = TRUE,
                 "verbose" = TRUE,
                 "outfile" = "")
    #===================================================
    # Ex. #3 - Enterprise Multinode Cluster w/ multicore/node  
    # Running LINUX with SLURM scheduler
    # MPI communication cluster
    # Distributed memory parallelization
    # Below, variable 'cpus' is the total number of requested 
    # taks (threads/CPUs), which is specified from within a 
    # SLURM script.
    #==================================================
    if (require("Rmpi")) {
        print("Rmpi is loaded correctly \n")
    } else {
        stop("Rmpi must be installed first to use MPI\n")
    }
    cpus <- as.numeric(Sys.getenv("SLURM_NTASKS"))
    conf <- list("spec" = cpus,
                 "type" = "MPI",
                 "homo" = TRUE,
                 "verbose" = TRUE,
                 "outfile" = "")

    #===================================================
    # Real dataset
    #===================================================
    seed <- 1234567
    data("MACS", package="IRSF")

    X <- MACS[,c("TTX","EventX","Race","Group3",
                 "DEFB.CNV3","CCR2.SNP","CCR5.SNP2",
                 "CCR5.ORF","CXCL12.SNP2")]

    #===================================================
    # Real dataset
    # Ranking of individual and noise variables by univariate 
    # Minimal Depth of a Maximal Subtree (MDMS)
    # Parallel mode
    #===================================================
    MACS.main.mdms <- rsf.main(X=X,
                               ntree=1000, 
                               method="mdms", 
                               splitrule="logrank", 
                               importance="random", 
                               B=1000, 
                               ci=80,
                               parallel=TRUE, 
                               conf=conf, 
                               verbose=TRUE, 
                               seed=seed)
    
# }

Run the code above in your browser using DataLab