Return an object of class "mvr
". Offers the option of parallel computation for improved efficiency.
mvr(data,
block = rep(1,nrow(data)),
tolog = FALSE,
nc.min = 1,
nc.max = 30,
probs = seq(0, 1, 0.01),
B = 100,
parallel = FALSE,
conf = NULL,
verbose = TRUE)
numeric
matrix
of untransformed (raw) data,
where samples are by rows and variables (to be clustered) are by columns,
or an object that can be coerced to such a matrix
(such as a num
character
or numeric
vector
or factor
grouping/blocking variable of length the sample size.
Defaults to single group situation (see details).logical
scalar. Is the data to be log2-transformed first? Optional, defaults to FALSE
.
Note that negative or null values will be changed to 1 before taking log2-transformation.integer
scalar of the minimum number of clusters, defaults to 1integer
scalar of the maximum number of clusters, defaults to 30numeric
vector
of probabilities for quantile diagnostic plots. Defaults to seq
(0, 1, 0.01).integer
scalar of the number of Monte Carlo replicates of the inner loop
of the sim statistic function (see details).logical
scalar. Is parallel computing to be performed? Optional, defaults to FALSE
.list
of parameters for cluster configuration.
Inputs for R package makeCluster
(R package NULL
. Slogical
scalar. Is the output to be verbose? Optional, defaults to TRUE
.numeric
matrix
of original data.numeric
matrix
of MVR-transformed data.numeric
vector
of centering values
for standardization (cluster mean of pooled sample mean).numeric
vector
of scaling values
for standardization (cluster mean of pooled sample std dev).list
(of size the number of groups) containing for each group:
numeric
vector
of cluster membership of each variableinteger
scalar of number of clusters found in optimal cluster configuration}
numeric
vector
of the similarity statistic values}
numeric
vector
of the standard errors of the similarity statistic values}
numeric
matrix
(K
x p) of the vector of standardized means by groups (rows),
where K
= #groups and p
= #variables}
numeric
matrix
(K
x p) of the vector of standardized standard deviations by groups (rows),
where K
= #groups and p
= #variables}
numeric
matrix
(nc.max
- nc.min
+ 1) x (length(probs
)) of quantiles of means}
numeric
matrix
(nc.max
- nc.min
+ 1) x (length(probs
)) of quantiles of standard deviations}block
.}
tolog
.}
nc.min
.}
nc.max
.}
probs
.}[object Object],[object Object],[object Object],[object Object],[object Object]
makeCluster
(R packagejustvsn
(R package
#=================================================== # MVR package news #=================================================== MVR.news()
#================================================ # MVR package citation #================================================ citation("MVR")
#=================================================== # Loading of the Synthetic and Real datasets # (see description of datasets) #=================================================== data("Synthetic", "Real", package="MVR") ?Synthetic ?Real
#=================================================== # Mean-Variance Regularization (Synthetic dataset) # Single-Group Assumption # Assuming equal variance between groups # Without cluster usage #=================================================== nc.min <- 1 nc.max <- 10 probs <- seq(0, 1, 0.01) n <- 10 mvr.obj <- mvr(data = Synthetic, block = rep(1,n), tolog = FALSE, nc.min = nc.min, nc.max = nc.max, probs = probs, B = 100, parallel = FALSE, conf = NULL, verbose = TRUE)
#=================================================== # Examples of parallelization below with # a SOCKET or MPI cluster configuration #=================================================== # 1- WINDOWS multicores PC with SOCKET communication # With a 2-Quad (8-CPUs) PC #=================================================== if (.Platform$OS.type == "windows") { cpus <- detectCores() conf <- list("names" = rep("localhost", cpus), "cpus" = cpus, "type" = "SOCK", "homo" = TRUE, "verbose" = TRUE, "outfile" = "") } #=================================================== # 2- LINUX multinodes cluster with SOCKET communication # with 4-nodes (32-CPUs) cluster # with 1 masternode and 3 workernodes # All hosts run identical setups # Same number of core CPUs (8) per node #=================================================== if (.Platform$OS.type == "unix") { masterhost <- Sys.getenv("HOSTNAME") slavehosts <- c("compute-0-0", "compute-0-1", "compute-0-2") nodes <- length(slavehosts) + 1 cpus <- 8 conf <- list("names" = c(rep(masterhost, cpus), rep(slavehosts, cpus)), "cpus" = nodes * cpus, "type" = "SOCK", "homo" = TRUE, "verbose" = TRUE, "outfile" = "") } #=================================================== # 3- LINUX multinodes cluster with MPI communication # Here, a file named ".nodes" (e.g. in the home directory) # must contain the list of nodes of the cluster #=================================================== if (.Platform$OS.type == "unix") { hosts <- scan(file=paste(Sys.getenv("HOME"), "/.nodes", sep=""), what="", sep="\n") hostnames <- unique(hosts) nodes <- length(hostnames) cpus <- length(hosts)/length(hostnames) conf <- list("cpus" = nodes * cpus, "type" = "MPI", "homo" = TRUE, "verbose" = TRUE, "outfile" = "") } #=================================================== # Run: # Mean-Variance Regularization (Real dataset) # Multi-Group Assumption # Assuming unequal variance between groups #=================================================== nc.min <- 1 nc.max <- 30 probs <- seq(0, 1, 0.01) n <- 6 GF <- factor(gl(n = 2, k = n/2, len = n), ordered = FALSE, labels = c("M", "S")) mvr.obj <- mvr(data = Real, block = GF, tolog = FALSE, nc.min = nc.min, nc.max = nc.max, probs = probs, B = 100, parallel = TRUE, conf = conf, verbose = TRUE)
block
is a vector
or a factor
grouping/blocking variable. It must be of length sample size
with as many different character
or numeric
values as the number of levels or sample groups.
It defaults to single group situation, i.e. under the assumption of equal variance between sample groups.
All group sample sizes must be greater than 1, otherwise the program will stop. Note that argument B
is internally reset to conf$cpus
*ceiling
(B
/conf$cpus
) in case the parallelization
is used (i.e. conf
is non NULL
), where conf$cpus
denotes the total number of CPUs to be used (see below).
Argument nc.max
currently defaults to 30. Empirically, we found that this is enough for most datasets tested.
This depends on (i) the dimensionality/sample size ratio $\frac{p}{n}$, (ii) the signal/noise ratio, and
(iii) whether a pre-transformation has been applied (see Dazard, J-E. and J. S. Rao (2012) for more details).
See the cluster diagnostic function cluster.diagnostic
for more details, whether larger values of nc.max
may be required.
To run a parallel session (and parallel RNG) of the MVR procedures (parallel
=TRUE
), argument conf
is to be specified (i.e. non NULL
). It must list the specifications of the folowing parameters for cluster configuration:
"names", "cpus", "type", "homo", "verbose", "outfile". These match the arguments described in function makeCluster
of the R package
names
:character
vector specifying the host names on which to run the job.
Could default to a unique local machine, in which case, one may use the unique host name "localhost".
Each host name can potentially be repeated to the number of CPU cores available on the corresponding machine.spec
:integer
scalar specifying the total number of CPU cores to be used
across the network of available nodes, counting the workernodes and masternode.type
:character
vector specifying the cluster type ("SOCK", "PVM", "MPI").homogeneous
:logical
scalar to be set toFALSE
for inhomogeneous clusters.verbose
:logical
scalar to be set toFALSE
for quiet mode.outfile
:character
vector of the output log file name for the workernodes.makeCluster
(R package