UPC_RNASeq: Universal exPression Codes (UPC) for RNA-Seq data

Description

This function is used to derive UPC values for RNA-Seq data. It requires at least one input file that specifies a read count for each genomic region (e.g., gene). This file should list a unique identifier for each region in the first column and corresponding read counts (not RPKM/FPKM values) in the second column.

This function also can correct for the GC content and length of each genomic region. Users who wish to enable this correction must provide a separate annotation file. This tab-separated file should contain a row for each genomic region. The first column should contain a unique identifier that corresponds to identifiers from the read-count input file. The second column should indicate the length of the genomic region. And the third column should specify the number of G or C bases in the region. The ParseMetaFromGtfFile function can be used to generate annotation files.

Usage

UPC_RNASeq(inFilePattern, annotationFilePath = NA, outFilePath = NA, modelType = "nn", convThreshold = 0.01, ignoreZeroes = FALSE, numDataHeaderRows=0, numAnnotationHeaderRows=0, batchFilePath=NA, verbose = TRUE)

Arguments

inFilePattern

Absolute or relative path to the input file(s) to be processed. The input file(s) can contain one or more columns, where each column would contain data for a given sample. To process multiple files, wildcard characters can be used (e.g., "*.txt"). Required.

annotationFilePath

Absolute or relative path where the annotation file is located. This parameter is optional.

outFilePath

Absolute or relative path where the output file will be saved. This is optional.

modelType

Various models can be used for the mixture model to differentiate between active and inactive probes. The default is the normal-normal model (``nn''), which uses the normal distribution. Other available options are log-normal (``ln''), negative-binomial (``nb''), and normal-normal Bayes (``nn_bayes'').

convThreshold

Convergence threshold that determines at what point the mixture-model parameters have stabilized. The default value should be suitable in most cases. However, if the model fails to converge (or converges too quickly), it may be useful to adjust this value. (This parameter is optional.)

ignoreZeroes

Whether to ignore read counts equal to zero when performing UPC calculations. Default is FALSE.

numDataHeaderRows

The number of header rows present in the input data file(s). If a header is present, the column names will be used as sample IDs.

numAnnotationHeaderRows

The number of header rows present in the annotation data file (if one has been specified).

batchFilePath

Absolute or relative path to a tab-separated text file that indicates batch (and optionally, covariate information) for each sample. Optional.

verbose

Whether to output more detailed status information as files are normalized. Default is TRUE.

Value

An ExpressionSet object that contains a row for each probeset/gene/transcript and a column for each input file.

References

Piccolo SR, Withers MR, Francis OE, Bild AH and Johnson WE. Multi-platform single-sample estimates of transcriptional activation. Proceedings of the National Academy of Sciences of the United States of America, 2013, 110:44 17778-17783.

Examples

Run this code

## Not run: 
# result = UPC_RNASeq("ReadCounts.txt", "Annotation.txt")
# ## End(Not run)

Run the code above in your browser using DataLab