Learn R Programming

R2GUESS (version 2.0)

R2GUESS: Wrapper function that reads the input files and parameter values required by GUESS, runs the C++ code from R and stores the main GUESS output in an ESS object

Description

The R2GUESS function reads and compiles data, input files and parameters that are required to run GUESS source code. It automatically runs GUESS (enabling or not the GPU capacity), saves the results and summary files in text files. For portability, R2GUESS generates an ESS object which compiles information about the input and parameters used to run GUESS, and outputs as detailed in as.ESS.object.

Usage

R2GUESS(dataY, dataX, path.input, path.output, path.par,
    path.init = NULL, file.par, file.init = NULL,
    file.log = NULL, nsweep, burn.in, Egam,
    Sgam, root.file.output, time = TRUE, top = 100,
    history = TRUE, label.X = NULL, label.Y = NULL,
    choice.Y = NULL, nb.chain, conf = NULL, cuda = TRUE,
    MAP.file = NULL, time.limit=NULL,seed=NULL)

Arguments

dataY

either a one element character vector (such as 'dataY.txt') or a data frame. If dataY is entered as a character vector, it specifies, assuming that data are in the path.input folder, the location of the response matrix. In the corresponding file observations are presented in rows, and the (possibly multivariate) outcome(s) in columns. The first two rows (single integers) represent the number of rows (n) and columns (q) in the matrix. If a data frame argument is passed, it links to a nxq numerical matrix compiling the observed responses.

dataX

either a one element character vector (such as 'dataX.txt') or a data frame. If dataX is entered as a character vector, it specifies, assuming that data are in the path.input folder, the location of the predictor matrix. In the corresponding file observations are presented in rows, and the predictors in columns. The first two rows (single integers) represent the number of rows (n) and columns (p) in the matrix. If a data frame argument is passed, it links to a nxq numerical matrix compiling the observed predictors.

path.input

path linking to the directory containing the data (dataX and dataY). If dataX or/and dataY have been entered as data frame(s), the function will generate the corresponding text files required to run GUESS in the path.input folder.

path.output

path indicating the directory in which output files will be saved.

path.par

path indicating the directory in which to find the parameter file needed to run GUESS.

path.init

path indicating the location of the file describing the initial guess of the MCMC procedure (i.e. the variables to include in the initial model).

file.par

name of the parameter file containing all user-specified parameters required to set up the run and the features of the moves. This file is located in path.par and contains fields that are extensively described in http://www.bgx.org.uk/software/GUESS_Doc_short.pdf. These parameters are not mandatory and, if not specified, they will be set to their default values, also given in documentation. An example of this file is provided in the package.

file.init

name of the file specifying which variables to include at the first iteration of the MCMC run. The first row of the file is a single scalar representing the number of rows (# variables to include). Subsequent rows indicate the position of the covariates to include. This file is optional and if not specified (default=NULL), initial guesses of the MCMC algorithm will be derived from a step-wise regression approach.

file.log

name of the log file. This file compiles in real time summary information describing the initial parameters, the computational time and state of the run. This file will also contain information about moves sampled at each sweep. By default (=NULL), the name is given by the argument root.file.output extended by '_log' and for computational efficiency (especially when GPU is enabled), a minimal amount of information is returned.

nsweep

integer specifying the number of sweeps for the MCMC run (including the burn-in).

burn.in

integer specifying the number of sweeps to be discarded to account the burn-in.

Egam

numeric representing the 'a priori' average model size.

Sgam

numeric representing the 'a priori' standard deviation of the model size.

root.file.output

name specifying the file stem for writing the output files in the directory specified by path.output.

time

Boolean value. When time=TRUE (default value) a file recording the time each sweep took will be created and saved in path.output directory.

top

number of top models to be reported in the output. The default value is 100.

history

Boolean value. When history=TRUE (default value), a number of additional output files that record the history of each move is provided. See section 5 of http://www.bgx.org.uk/software/GUESS_Doc_short.pdf for more details.

label.X

a character vector specifying the name of the predictors. If not specified (=NULL), variables are labelled by their position in the matrix. Predictors name and information is given in the MAP.file in the case of SNP data (field SNPName).

label.Y

a character vector specifying the name of the outcomes. If not specified (=NULL), the outcomes are labelled Y1,..Yq, where q is the number of columns in the outcome matrix or will be named after the argument dataY (if specified by a data frame).

choice.Y

a character vector or a numeric vector specifying which phenotypes in the response matrix dataY to analyse in a joint model. By default, all phenotypes in the response matrix will be considered.

nb.chain

an integer specifying the number of chains to consider in the evolutionary procedure.

conf

either a one element character vector (such as 'conf.txt') or a data frame. If conf is entered as a character vector, it specifies, assuming that data are in the path.input folder, the location of the confounder matrix. In the corresponding file observations are presented in rows, and the values for the confounders in columns. The first two rows (single integers) represent the number of rows (n) and columns (k) in the matrix. If a data frame argument is passed, it links to a nxk numerical matrix compiling the observed confounders. If specified, the function will substitute the response matrix by the residuals from the linear model regressing the confounders against the outcomes.

cuda

a boolean value. cuda=TRUE redirects linear algebra operations towards the GPU. On non-CULA compatible platforms, this option will be ignored.

MAP.file

either a one element character vector or a data frame. If a character vector is used, it specifies, assuming that data are in the path.input folder, the location of the annotation file. In the corresponding file, predictors are presented in rows, and are described as a MAP.file. If a data frame argument is passed, it links to a px3 matrix.

time.limit

a numerical value specifying the maximum computing time (in hours) for the run. If the run exceeds that value, modelling options, parameters value, state of the pseudo random number generator, and state of each chain will be saved to enable to resume the run exactly at the same point it was interrupted (using resume option). By default (=NULL) the run will go on until its completion.

seed

a integer specifying the random seed used to initialize the pseudo-random number generator. If not specified, the seed will be initialised using the CPU clock.

Value

An object of class ESS containing information listed in as.ESS.object. The object can subsequently be used to post-process the results using provided R functions (such as summary.ESS, plotMPPI, plot.ESS).

Details

For any of the dataX, dataY parameters, if a data frame argument is passed, a text file labelled data-*-C-CODE.txt will be created in the path.input directory. If conf is specified, and additional files representing the adjusted responses will be created according to the file labelling system.This file will be formatted to have the suitable structure to be read by the C++ code: individuals presented in rows, and observations in columns, with the first two rows indicating the number of rows and columns in the matrix. The returned ESS object will include all result files produced by the code. The number and type of outputs produced depend on the running options chosen. A full description of the available output can be found in http://www.bgx.org.uk/software/GUESS_Doc_short.pdf

See Also

as.ESS.object, summary.ESS,as.ESS.object, plotMPPI, plot.ESS

Examples

Run this code
# NOT RUN {
path.input <- system.file("Input", package="R2GUESS")
path.output <- tempdir()
path.par <- system.file("extdata", package="R2GUESS")
file.par.Hopx <- "Par_file_example_Hopx.xml"
#you can have a look of the parameter file in
print(paste(path.par,file.par.Hopx,sep=""))
##To reach convergence you may need to increase nsweep=110000 and the burn.in=10000
## RUNNING is APPROX 5 minutes
root.file.output.Hopx <- "Example-GUESS-Y-Hopx"
label.Y <- c("ADR","Fat","Heart","Kidney")
data(data.Y.Hopx)
data(data.X)
data(MAP.file)

modelY_Hopx<-R2GUESS(dataY=data.Y.Hopx,dataX=data.X,choice.Y=1:4,
label.Y=label.Y,,MAP.file=MAP.file,file.par=file.par.Hopx,file.init=NULL,
file.log=NULL,root.file.output=root.file.output.Hopx,path.input=path.input,
path.output=path.output,path.par=path.par,path.init=NULL,nsweep=11000,
burn.in=1000,Egam=5,Sgam=5,top=100,history=TRUE,time=TRUE,
nb.chain=3,conf=NULL,cuda=FALSE)

summary(modelY_Hopx,20) # 20 best models

print(modelY_Hopx)
# }

Run the code above in your browser using DataLab