This function checks data for missing entries and returns a list of genes that have non-zero variance and pass two criteria on maximum number of missing values: the fraction of missing values must be below a given threshold and the total number of missing samples must be below a given threshold.
goodGenes(datExpr,
useSamples = NULL,
useGenes = NULL,
minFraction = 1/2,
minNSamples = ..minNSamples,
minNGenes = ..minNGenes,
tol = NULL,
verbose = 1, indent = 0)
expression data. A data frame in which columns are genes and rows ar samples.
optional specifications of which samples to use for the check. Should be a logical
vector; samples whose entries are FALSE
will be ignored for the missing value counts. Defaults to
using all samples.
optional specifications of genes for which to perform the check. Should be a logical
vector; genes whose entries are FALSE
will be ignored. Defaults to
using all genes.
minimum fraction of non-missing samples for a gene to be considered good.
minimum number of non-missing samples for a gene to be considered good.
minimum number of good genes for the data set to be considered fit for analysis. If the actual number of good genes falls below this threshold, an error will be issued.
an optional 'small' number to compare the variance against. Defaults to the square of
1e-10 * max(abs(datExpr), na.rm = TRUE)
. The reason of comparing the variance to this number, rather than
zero, is that the fast way of computing variance used by this function sometimes causes small numerical
overflow errors which make variance of constant vectors slightly non-zero; comparing the variance to
tol
rather than zero prevents the retaining of such genes as 'good genes'.
integer level of verbosity. Zero means silent, higher values make the output progressively more and more verbose.
indentation for diagnostic messages. Zero means no indentation, each unit adds two spaces.
A logical vector with one entry per gene that is TRUE
if the gene is considered good and
FALSE
otherwise. Note that all genes excluded by useGenes
are automatically assigned
FALSE
.
The constants ..minNSamples
and ..minNGenes
are both set to the value 4.
For most data sets, the fraction of missing samples criterion will be much more stringent than the
absolute number of missing samples criterion.