Learn R Programming

phyloseq (version 1.16.2)

rarefy_even_depth: Resample an OTU table such that all samples have the same library size.

Description

Please note that the authors of phyloseq do not advocate using this as a normalization procedure, despite its recent popularity. Our justifications for using alternative approaches to address disparities in library sizes have been made available as http://dx.plos.org/10.1371/journal.pcbi.1003531{an article in PLoS Computational Biology}. See phyloseq_to_deseq2 for a recommended alternative to rarefying directly supported in the phyloseq package, as well as http://joey711.github.io/waste-not-supplemental/{the supplemental materials for the PLoS-CB article} and http://joey711.github.io/phyloseq-extensions{the phyloseq extensions repository on GitHub}. Nevertheless, for comparison and demonstration, the rarefying procedure is implemented here in good faith and with options we hope are useful. This function uses the standard R sample function to resample from the abundance values in the otu_table component of the first argument, physeq. Often one of the major goals of this procedure is to achieve parity in total number of counts between samples, as an alternative to other formal normalization procedures, which is why a single value for the sample.size is expected. This kind of resampling can be performed with and without replacement, with replacement being the more computationally-efficient, default setting. See the replace parameter documentation for more details. We recommended that you explicitly select a random number generator seed before invoking this function, or, alternatively, that you explicitly provide a single positive integer argument as rngseed.

Usage

rarefy_even_depth(physeq, sample.size = min(sample_sums(physeq)),
  rngseed = FALSE, replace = TRUE, trimOTUs = TRUE, verbose = TRUE)

Arguments

physeq
(Required). A phyloseq-class object that you want to trim/filter.
sample.size
(Optional). A single integer value equal to the number of reads being simulated, also known as the depth, and also equal to each value returned by sample_sums on the output.
rngseed
(Optional). A single integer value passed to set.seed, which is used to fix a seed for reproducibly random number generation (in this case, reproducibly random subsampling). The default value is 711. If set to FALSE, then no fiddling with the RNG seed is performed, and it is up to the user to appropriately call set.seed beforehand to achieve reproducible results.
replace
(Optional). Logical. Whether to sample with replacement (TRUE) or without replacement (FALSE). The default is with replacement (replace=TRUE). Two implications to consider are that (1) sampling with replacement is faster and more memory efficient as currently implemented; and (2), sampling with replacement means that there is a chance that the number of reads for a given OTU in a given sample could be larger than the original count value, as opposed to sampling without replacement where the original count value is the maximum possible. Prior to phyloseq package version number 1.5.20, this parameter did not exist and sampling with replacement was the only random subsampling implemented in the rarefy_even_depth function. Note that this default behavior was selected for computational efficiency, but differs from analogous functions in related packages (e.g. subsampling in QIIME).
trimOTUs
(Optional). logical(1). Whether to trim OTUs from the dataset that are no longer observed in any sample (have a count of zero in every sample). The number of OTUs trimmed, if any, is printed to standard out as a reminder.
verbose
(Optional). Logical. Default is TRUE. If TRUE, extra non-warning, non-error messages are printed to standard out, describing steps in the rarefying process, the OTUs and samples removed, etc. This can be useful the first few times the function is executed, but can be set to FALSE as-needed once behavior has been verified as expected.

Value

  • An object of class phyloseq. Only the otu_table component is modified.

Details

This approach is sometimes mistakenly called ``rarefaction'', which http://en.wikipedia.org/wiki/Rarefaction{in physics refers to a form of wave decompression;} but in this context, ecology, the term refers to a http://en.wikipedia.org/wiki/Rarefaction_(ecology){repeated sampling procedure to assess species richness}, first proposed in 1968 by Howard Sanders. In contrast, the procedure implemented here is used as an ad hoc means to normalize microbiome counts that have resulted from libraries of widely-differing sizes. Here we have intentionally adopted an alternative name, rarefy, that has also been used recently to describe this process and, to our knowledge, not previously used in ecology.

Make sure to use set.seed for exactly-reproducible results of the random subsampling.

See Also

sample

set.seed

Examples

Run this code
# Test with esophagus dataset
data("esophagus")
esorepT = rarefy_even_depth(esophagus, replace=TRUE)
esorepF = rarefy_even_depth(esophagus, replace=FALSE)
sample_sums(esophagus)
sample_sums(esorepT)
sample_sums(esorepF)
## NRun Manually: Too slow!
# data("GlobalPatterns")
# GPrepT = rarefy_even_depth(GlobalPatterns, 1E5, replace=TRUE)
## Actually just this one is slow
# system.time(GPrepF <- rarefy_even_depth(GlobalPatterns, 1E5, replace=FALSE))

Run the code above in your browser using DataLab