This method uses simulated sequencing datasets to estimate the RDI values for datasets
with a known true deviation.
Briefly, a baseline probability vector (either randomly generated or supplied by the
baseVects
parameter) is randomly perturbed, and the difference between the
baseline vector and the perturbed vector is calculated. Then, nSample
sequencing
datasets of size n are randomly drawn from both the baseline vector and the perturbed
vector, and the RDI distance between all datasets calculated. This process is repeated
nIter
times, resulting in a dataset of RDI values and matched true differences.
A set of spline models is then fit to the data: one from RDI to true difference, and
another from true difference to RDI value, allowing for bi-directional conversions.
If a baseline probability vector is not provided, one will be generated from an
empirical model of gene segment prevalence. However, for best performance, this is not
recommended. Estimates of true fold change is very sensitive to the distribution of
features in your count dataset, and it is important that your baseline vector match
your overall dataset as accurately as possible. The best baseline vector is almost
always the average feature prevalence across all repertoires in a dataset, although
manually generated baseline vectors may also work well.
The units used for the RDI model should always match the units used to generate your
RDI values. For more details on units, refer to the details of calcRDI.