mv_ag: Multi Value Trait Aggregation function

Description

EMD can get very heavy with large datasets. For an example lemnatech dataset filtering for images from every 5th day there are 6332^2 = 40,094,224 pairwise EMD values. In long format that's a 40 million row dataframe, which is unwieldy. This function is to help reduce the size of datasets before comparing histograms and moving on with matrix methods or network analysis.

Usage

mv_ag(
  df,
  group,
  mvCols = "frequencies",
  n_per_group = 1,
  outRows = NULL,
  keep = NULL,
  parallel = getOption("mc.cores", 1),
  traitCol = "trait",
  labelCol = "label",
  valueCol = "value",
  id = "image"
)

Value

Returns a dataframe summarized by the specified groups over the multi-value traits.

Arguments

df: A dataframe with multi value traits. This can be in wide or long format, data is assumed to be long if traitCol, valueCol, and labelCol are present.
group: Vector of column names for variables which uniquely identify groups in the data to summarize data over. Typically this would be the design variables and a time variable.
mvCols: Either a vector of column names/positions representing multi value traits or a character string that identifies the multi value trait columns as a regex pattern. Defaults to "frequencies".
n_per_group: Number of rows to return for each group.
outRows: Optionally this is a different way to specify how many rows to return. This will often not be exact so that groups have the same number of observations each.
keep: A vector of single value traits to also average over groups, if there are a mix of single and multi value traits in your data.
parallel: Optionally the groups can be run in parallel with this number of cores, defaults to 1 if the "mc.cores" option is not set globally.
traitCol: Column with phenotype names, defaults to "trait".
labelCol: Column with phenotype labels (units), defaults to "label".
valueCol: Column with phenotype values, defaults to "value".
id: Column that uniquely identifies images if the data is in long format. This is ignored when data is in wide format.

Examples

Run this code


s1 <- mvSim(
  dists = list(runif = list(min = 15, max = 150)),
  n_samples = 10,
  counts = 1000,
  min_bin = 1,
  max_bin = 180,
  wide = TRUE
)
mv_ag(s1, group = "group", mvCols = "sim_", n_per_group = 2)

Run the code above in your browser using DataLab