This function checks data for missing entries and zero variance across multiple data sets and returns a list of samples and genes that pass criteria maximum number of missing values. If necessary, the filtering is iterated.
goodSamplesGenesMS(
multiExpr,
minFraction = 1/2,
minNSamples = ..minNSamples,
minNGenes = ..minNGenes,
tol = NULL,
verbose = 2, indent = 0)
expression data in the multi-set format (see checkSets
). A vector of
lists, one per set. Each set must contain a component data
that contains the expression data, with
rows corresponding to samples and columns to genes or probes.
minimum fraction of non-missing samples for a gene to be considered good.
minimum number of non-missing samples for a gene to be considered good.
minimum number of good genes for the data set to be considered fit for analysis. If the actual number of good genes falls below this threshold, an error will be issued.
an optional 'small' number to compare the variance against. For each set in multiExpr
,
the default value is 1e-10 * max(abs(multiExpr[[set]]$data), na.rm = TRUE)
.
The reason of comparing the variance to this number, rather than
zero, is that the fast way of computing variance used by this function sometimes causes small numerical
overflow errors which make variance of constant vectors slightly non-zero; comparing the variance to
tol
rather than zero prevents the retaining of such genes as 'good genes'.
integer level of verbosity. Zero means silent, higher values make the output progressively more and more verbose.
indentation for diagnostic messages. Zero means no indentation, each unit adds two spaces.
A list with the foolowing components:
A list with one component per given set. Each component is a logical vector with
one entry per sample in the corresponding set that is TRUE
if the sample is
considered good and FALSE
otherwise.
A logical vector with one entry per gene that is TRUE
if the gene is
considered good and FALSE
otherwise.
This function iteratively identifies samples and genes with too many missing entries, and genes with
zero variance. Iterations may be
required since excluding samples effectively changes criteria on genes and vice versa. The process is
repeated until the lists of good samples and genes are stable.
The constants ..minNSamples
and ..minNGenes
are both set to the value 4.
goodGenes
, goodSamples
, goodSamplesGenes
for cleaning
individual sets separately;
goodSamplesMS
, goodGenesMS
for additional cleaning of multiple data
sets together.