Given multiple data sets corresponding to the same variables and a grouping of variables into groups, the function selects a representative variable for each group using a variety of possible selection approaches. Typical uses include selecting a representative probe for each gene in microarray data.
consensusRepresentatives(
mdx,
group,
colID,
consensusQuantile = 0,
method = "MaxMean",
useGroupHubs = TRUE,
calibration = c("none", "full quantile"),
selectionStatisticFnc = NULL,
connectivityPower = 1,
minProportionPresent = 1,
getRepresentativeData = TRUE,
statisticFncArguments = list(),
adjacencyArguments = list(),
verbose = 2, indent = 0)
A named vector giving, for each group, the selected representative (input rowID
or the variable (column) name in mdx
). Names correspond to groups.
A logical vector with one entry per variable (column) in input mdx
(possibly
after restriction to variables occurring in colID
), TRUE
if the column was selected as a
representative.
Only present if getRepresentativeData
is TRUE
;
the input mdx
restricted to the representative variables, with column
names changed to the corresponding groups.
A multiData
structure. All sets must have the same columns.
Character vector whose components contain the group label (e.g. a character string) for
each entry of colID
. This vector must be of the same length as the vector colID
. In gene
expression applications, this vector could contain the gene symbol (or a co-expression module label).
Character vector of column identifiers. This must include all the column names from
mdx
, but can include other values as well. Its entries must be unique (no duplicates) and no
missing values are permitted.
A number between 0 and 1 giving the quantile probability for consensus calculation. 0 means the minimum value (true consensus) will be used.
character string for determining which method is used to choose the representative
(when useGroupHubs
is TRUE
, this method is only used for groups with 2
variables).
The following values can be used:
"MaxMean" (default) or "MinMean" return the variable with the highest or lowest mean value, respectively;
"maxRowVariance" return the variable with the highest variance;
"absMaxMean" or "absMinMean" return the variable with the highest or lowest mean absolute value; and
"function" will call a user-input function (see the description of the argument
selectionStatisticFnc
). The built-in functions can be instructed to use robust analogs (median and
median absolute deviation) by also specifying statisticFncArguments=list(robust = TRUE)
.
Logical: if TRUE
, groups with 3 or more variables will be
represented by the variable with the highest
connectivity according to a signed weighted correlation network adjacency matrix among the corresponding
rows. The connectivity is defined as the row sum of the adjacency matrix. The signed weighted
adjacency matrix is defined as A=(0.5+0.5*COR)^power where power is determined by the argument
connectivityPower
and COR denotes the matrix of pairwise correlation coefficients among the
corresponding rows. Additional arguments to the underlying function adjacency
can be specified
using the argument adjacencyArguments
below.
Character string describing the method of calibration of the selection statistic among
the data sets. Recognized values are "none"
(no calibration) and "full quantile"
(quantile
normalization).
User-supplied function used to calculate the selection statistic when
method
above equals "function"
. The function must take argumens x
(a matrix) and
possibly other arguments that can be specified using statisticFncArguments
below. The return value
must be a vector with one component per column of x
giving the selection statistic for each column.
Positive number (typically integer) for specifying the soft-thresholding power used
to construct the signed weighted adjacency matrix, see the description of useGroupHubs
.
This option is only used if useGroupHubs
is TRUE
.
A number between 0 and 1 specifying a filter of candidate probes. Specifically, for each group, the variable
with the maximum consensus proportion of present data is found. Only variables whose consensus proportion of
present data is at least minProportionPresent
times the maximum consensus proportion are retained as
candidates for being a representative.
Logical: should the representative data, i.e., mdx
restricted to
the representative variables, be returned?
A list giving further arguments to the selection statistic function. Can be
used to supply additional arguments to the user-specified selectionStatisticFnc
; the value
list(robust = TRUE)
can be used with the built-in functions to use their robust variants.
Further arguments to the function adjacency
, e.g.
adjacencyArguments=list(corFnc = "bicor", corOptions = "use = 'p', maxPOutliers = 0.05")
will select
the robust correlation bicor
with a good set of options. Note that the adjacency
arguments type
and power
cannot be changed.
Level of verbosity; 0 means silent, larger values will cause progress messages to be printed.
Indent for the diagnostic messages; each unit equals two spaces.
Peter Langfelder, based on code by Jeremy Miller
This function was inspired by collapseRows
, but there are also important differences. This function
focuses on selecting representatives; when summarization is more important, collapseRows
provides more
flexibility since it does not require that a single representative be selected.
This function and collapseRows
use different input and ouput conventions; user-specified functions need
to be tailored differently for collapseRows
than for consensusRepresentatives
.
Missing data are allowed and are treated as missing at random. If rowID
is NULL
, it is replaced
by the variable names in mdx
.
All groups with a single variable are represented by that variable, unless the consensus proportion of present
data in the variable is lower than minProportionPresent
, in which case the variable and the group are
excluded from the output.
For all variables belonging to groups with 2 variables (when useGroupHubs=TRUE
) or with at least 2 variables
(when useGroupHubs=FALSE
), selection statistics are calculated in each set (e.g., the selection
statistic may be the mean, variance, etc). This results in a matrix of selection statistics (one entry per
variable per data set). The selection statistics are next optionally calibrated (normalized) between sets to
make them comparable; currently the only implemented calibration method is quantile normalization.
For
each variable, the consensus selection statistic is defined as the
consensus of the (calibrated) selection statistics across the data sets is calculated. The
'consensus' of a vector (say 'x') is simply defined as the quantile with probability
consensusQuantile
of the vector x. Important exception: for the "MinMean"
and
"absMinMean"
methods, the consensus is the quantile with probability 1-consensusQuantile
, since
the idea of the consensus is to select the worst (or close to worst) value across the data sets.
For each group, the representative is selected as the variable with the best (typically highest, but for
"MinMean"
and
"absMinMean"
methods the lowest) consensus selection statistic.
If useGroupHubs=TRUE
, the intra-group connectivity is calculated for all variables in each set. The
intra-group connectivities are optionally calibrated (normalized) between sets, and consensus intra-group
connectivity is calculated similarly to the consensus selection statistic above. In each group, the variable
with the highest consensus intra-group connectivity is chosen as the representative.
multiData
for a description of the multiData
structures;
collapseRows
that solves a related but different problem. Please note the differences in input
and output!