sleuth_prep: Constructor for a 'sleuth' object

Description

A sleuth is a group of kallistos. Borrowing this terminology, a 'sleuth' object stores a group of kallisto results, and can then operate on them while accounting for covariates, sequencing depth, technical and biological variance.

Usage

sleuth_prep(sample_to_covariates, full_model = NULL, target_mapping = NULL,
  aggregation_column = NULL, num_cores = max(1L, parallel::detectCores() -
  1L), ...)

Arguments

sample_to_covariates

a data.frame which contains a mapping from sample (a required column) to some set of experimental conditions or covariates. The column path is also required, which is a character vector where each element points to the corresponding kallisto output directory. The column sample should be in the same order as the corresponding entry in path.

full_model

an R formula which explains the full model (design) of the experiment OR a design matrix. It must be consistent with the data.frame supplied in sample_to_covariates. You can fit multiple covariates by joining them with '+' (see example)

target_mapping

a data.frame that has at least one column 'target_id' and others that denote the mapping for each target. if it is not NULL, target_mapping is joined with many outputs where it might be useful. For example, you might have columns 'target_id', 'ensembl_gene' and 'entrez_gene' to denote different transcript to gene mappings. Note that sleuth_prep will treat all columns as having the 'character' data type.

aggregation_column

a string of the column name in target_mapping to aggregate targets (typically to summarize the data on the gene level). The aggregation is done using a p-value aggregation method when generating the results table. See sleuth_results for more information.

num_cores

an integer of the number of computer cores mclapply should use to speed up sleuth preparation

...

any of several other arguments that can be used as advanced options for sleuth preparation. See details.

Value

a sleuth object containing all kallisto samples, metadata, and summary statistics

Details

This method takes a list of samples with kallisto results and returns a sleuth object with the defined normalization of the data across samples (default is the DESeq method; See basic_filter), and then the defined transformation of the data (default is log(x + 0.5)). This also collects all of the bootstraps for the modeling done using sleuth_fit. This function also takes several advanced options that can be used to customize your analysis. Here are the advanced options for sleuth_prep:

Extra arguments related to Bootstrap Summarizing:

extra_bootstrap_summary: if TRUE, compute extra summary statistics for estimated counts. This is not necessary for typical analyses; it is only needed for certain plots (e.g. plot_bootstrap). Default is FALSE.
read_bootstrap_tpm: read and compute summary statistics on bootstraps on the TPM. This is not necessary for typical analyses; it is only needed for some plots (e.g. plot_bootstrap) and if TPM values are used for sleuth_fit. Default is FALSE.
max_bootstrap: the maximum number of bootstrap values to read for each transcript. Setting this lower than the total bootstraps available will save some time, but will likely decrease the accuracy of the estimation of the inferential noise.

Advanced Options for Filtering:

filter_fun: the function to use when filtering. This function will be applied to the raw counts on a row-wise basis, meaning that each feature will be considered individually. The default is to filter out any features that do not have at least 5 estimated counts in at least 47 for more information). If the preferred filtering method requires a matrix-wide transformation or otherwise needs to consider multiple features simultaneously instead of independently, please consider using filter_target_id below.
filter_target_id: character vector of target_ids to filter using methods that can't be implemented using filter_fun. If non-NULL, this will override filter_fun.

Advanced Options for the Normalization Step: (NOTE: Be sure you know what you're doing before you use these options)

normalize: boolean for whether normalization and other steps should be performed. If this is set to false, bootstraps will not be read and transformation of the data will not be done. This should only be set to FALSE if one desires to do a quick check of the raw data. The default is TRUE.
norm_fun_counts: a function to perform between sample normalization on the estimated counts. The default is the DESeq method. See norm_factors for details.
norm_fun_tpm: a function to perform between sample normalization on the TPM. The default is the DESeq method. See norm_factors for details.

Advanced Options for the Transformation Step: (NOTE: Be sure you know what you're doing before you use these options)

transform_fun_counts: the transformation that should be applied to the normalized counts. Default is 'log(x+0.5)' (i.e. natural log with 0.5 offset).
transform_fun_tpm: the transformation that should be applied to the TPM values. Default is 'x' (i.e. the identity function / no transformation)

Advanced Options for Gene Aggregation:

gene_mode: Set this to TRUE to get the old counts-aggregation method for doing gene-level analysis. This requires aggregation_column to be set. If TRUE, this will override the p-value aggregation mode, but will allow for gene-centric modeling, plotting, and results.

Examples

Run this code

# NOT RUN {
# Assume we have run kallisto on a set of samples, and have two treatments,
genotype and drug.
colnames(s2c)
# [1] "sample"  "genotype"  "drug"  "path"
so <- sleuth_prep(s2c, ~genotype + drug)
# }

Run the code above in your browser using DataLab