Return an object of class "mvr". Offers the option of parallel computation for improved efficiency.
mvr(data,
block = rep(1,nrow(data)),
tolog = FALSE,
nc.min = 1,
nc.max = 30,
probs = seq(0, 1, 0.01),
B = 100,
parallel = FALSE,
conf = NULL,
verbose = TRUE)numeric matrix of untransformed (raw) data,
where samples are by rows and variables (to be clustered) are by columns,
or an object that can be coerced to such a matrix (such as a numcharacter or numeric vector or factor grouping/blocking variable of length the sample size.
Defaults to single group situation (see details).logical scalar. Is the data to be log2-transformed first? Optional, defaults to FALSE.
Note that negative or null values will be changed to 1 before taking log2-transformation.integer scalar of the minimum number of clusters, defaults to 1integer scalar of the maximum number of clusters, defaults to 30numeric vector of probabilities for quantile diagnostic plots. Defaults to seq(0, 1, 0.01).integer scalar of the number of Monte Carlo replicates of the inner loop
of the sim statistic function (see details).logical scalar. Is parallel computing to be performed? Optional, defaults to FALSE.list of parameters for cluster configuration.
Inputs for R package makeCluster (R package NULL. Slogical scalar. Is the output to be verbose? Optional, defaults to TRUE.numeric matrix of original data.numeric matrix of MVR-transformed data.numeric vector of centering values
for standardization (cluster mean of pooled sample mean).numeric vector of scaling values
for standardization (cluster mean of pooled sample std dev).list (of size the number of groups) containing for each group:
numericvectorof cluster membership of each variableinteger scalar of number of clusters found in optimal cluster configuration}
numeric vector of the similarity statistic values}
numeric vector of the standard errors of the similarity statistic values}
numeric matrix (K x p) of the vector of standardized means by groups (rows),
where K = #groups and p = #variables}
numeric matrix (K x p) of the vector of standardized standard deviations by groups (rows),
where K = #groups and p = #variables}
numeric matrix (nc.max - nc.min + 1) x (length(probs)) of quantiles of means}
numeric matrix (nc.max - nc.min + 1) x (length(probs)) of quantiles of standard deviations}block.}
tolog.}
nc.min.}
nc.max.}
probs.}[object Object],[object Object],[object Object],[object Object],[object Object]
makeCluster(R packagejustvsn(R package
#=================================================== # MVR package news #=================================================== MVR.news()
#================================================ # MVR package citation #================================================ citation("MVR")
#=================================================== # Loading of the Synthetic and Real datasets # (see description of datasets) #=================================================== data("Synthetic", "Real", package="MVR") ?Synthetic ?Real
#=================================================== # Mean-Variance Regularization (Synthetic dataset) # Single-Group Assumption # Assuming equal variance between groups # Without cluster usage #=================================================== nc.min <- 1 nc.max <- 10 probs <- seq(0, 1, 0.01) n <- 10 mvr.obj <- mvr(data = Synthetic, block = rep(1,n), tolog = FALSE, nc.min = nc.min, nc.max = nc.max, probs = probs, B = 100, parallel = FALSE, conf = NULL, verbose = TRUE)
#=================================================== # Examples of parallelization below with # a SOCKET or MPI cluster configuration #=================================================== # 1- WINDOWS multicores PC with SOCKET communication # With a 2-Quad (8-CPUs) PC #=================================================== if (.Platform$OS.type == "windows") { cpus <- detectCores() conf <- list("names" = rep("localhost", cpus), "cpus" = cpus, "type" = "SOCK", "homo" = TRUE, "verbose" = TRUE, "outfile" = "") } #=================================================== # 2- LINUX multinodes cluster with SOCKET communication # with 4-nodes (32-CPUs) cluster # with 1 masternode and 3 workernodes # All hosts run identical setups # Same number of core CPUs (8) per node #=================================================== if (.Platform$OS.type == "unix") { masterhost <- Sys.getenv("HOSTNAME") slavehosts <- c("compute-0-0", "compute-0-1", "compute-0-2") nodes <- length(slavehosts) + 1 cpus <- 8 conf <- list("names" = c(rep(masterhost, cpus), rep(slavehosts, cpus)), "cpus" = nodes * cpus, "type" = "SOCK", "homo" = TRUE, "verbose" = TRUE, "outfile" = "") } #=================================================== # 3- LINUX multinodes cluster with MPI communication # Here, a file named ".nodes" (e.g. in the home directory) # must contain the list of nodes of the cluster #=================================================== if (.Platform$OS.type == "unix") { hosts <- scan(file=paste(Sys.getenv("HOME"), "/.nodes", sep=""), what="", sep="\n") hostnames <- unique(hosts) nodes <- length(hostnames) cpus <- length(hosts)/length(hostnames) conf <- list("cpus" = nodes * cpus, "type" = "MPI", "homo" = TRUE, "verbose" = TRUE, "outfile" = "") } #=================================================== # Run: # Mean-Variance Regularization (Real dataset) # Multi-Group Assumption # Assuming unequal variance between groups #=================================================== nc.min <- 1 nc.max <- 30 probs <- seq(0, 1, 0.01) n <- 6 GF <- factor(gl(n = 2, k = n/2, len = n), ordered = FALSE, labels = c("M", "S")) mvr.obj <- mvr(data = Real, block = GF, tolog = FALSE, nc.min = nc.min, nc.max = nc.max, probs = probs, B = 100, parallel = TRUE, conf = conf, verbose = TRUE)
block is a vector or a factor grouping/blocking variable. It must be of length sample size
with as many different character or numeric values as the number of levels or sample groups.
It defaults to single group situation, i.e. under the assumption of equal variance between sample groups.
All group sample sizes must be greater than 1, otherwise the program will stop. Note that argument B is internally reset to conf$cpus*ceiling(B/conf$cpus) in case the parallelization
is used (i.e. conf is non NULL), where conf$cpus denotes the total number of CPUs to be used (see below).
Argument nc.max currently defaults to 30. Empirically, we found that this is enough for most datasets tested.
This depends on (i) the dimensionality/sample size ratio $\frac{p}{n}$, (ii) the signal/noise ratio, and
(iii) whether a pre-transformation has been applied (see Dazard, J-E. and J. S. Rao (2012) for more details).
See the cluster diagnostic function cluster.diagnostic for more details, whether larger values of nc.max may be required.
To run a parallel session (and parallel RNG) of the MVR procedures (parallel=TRUE), argument conf
is to be specified (i.e. non NULL). It must list the specifications of the folowing parameters for cluster configuration:
"names", "cpus", "type", "homo", "verbose", "outfile". These match the arguments described in function makeCluster
of the R package
names:charactervector specifying the host names on which to run the job.
Could default to a unique local machine, in which case, one may use the unique host name "localhost".
Each host name can potentially be repeated to the number of CPU cores available on the corresponding machine.spec:integerscalar specifying the total number of CPU cores to be used
across the network of available nodes, counting the workernodes and masternode.type:charactervector specifying the cluster type ("SOCK", "PVM", "MPI").homogeneous:logicalscalar to be set toFALSEfor inhomogeneous clusters.verbose:logicalscalar to be set toFALSEfor quiet mode.outfile:charactervector of the output log file name for the workernodes.makeCluster
(R package