Learn R Programming

TCGA2STAT (version 1.2)

getTCGA: Get TCGA Data.

Description

Obtain TCGA data from the Broad GDAC Firehose and process the data into a format ready for statistical analysis.

Usage

getTCGA(disease = "GBM", data.type = "RNASeq2", type = "", filter = "Y", p = getOption("mc.cores", 2L), clinical = FALSE, cvars = "OS")

Arguments

disease
acronym for cancer type; default to "GBM" for glioblastoma multiforme.
data.type
genomic data profiling platform; default to "RNASeq2" for gene level RNA-Seq data from the second pipeline (RNASeqV2).
type
specific type of measurement produced by certain platforms.
filter
chromosome to be filtered out during data import; only applicable CNA or CNV data.
p
maximum number of processing cores used in parallel processing; default to the value set in "mc.cores" global option or 2.
clinical
logical value to indicate if clinical data is to be imported; default to FALSE.
cvars
clinical covariates to be merged with genomic data; default to "OS" for overall survival.

Value

A list containing:
dat
a matrix of dimension gene x sample.
clinical
a matrix of dimension sample x clinical covariates; NULL if clinical=FALSE
merged.dat
a matrix, which is the merged dat and clinical data as specified by cvars. Thus, each matrix of size sample x (cvars + gene); NULL if clinical=FALSE or cvars is not a valid name for clinical covariate.
and for methylation data, an additional element:
cpgs
a matrix of dimension cpg sites x 3. The three columns are gene symbol, chromosome, and genomic coordinate for each CpG site. The order of CpG sites in this matrix is the same as the order in dat.

Details

Values for disease include "ACC", "BLCA", "BRCA", "CESC", "CHOL", "COAD", "COADREAD", "DLBC", "ESCA", "FPPP", "GBM", "GBMLGG", "HNSC", "KICH", "KIPAN", "KIRC", "KIRP", "LAML", "LGG", "LIHC", "LUAD", "LUSC", "MESO", "OV", "PAAD", "PCPG", "PRAD", "READ", "SARC", "SKCM", "STAD", "TGCT", "THCA", "THYM", "UCEC", "UCS", and "UVM". Values for data.type include "RNASeq2", "RNASeq", "miRNASeq", "CNA_SNP", "CNV_SNP", "CNA_CGH", "Methylation", "Mutation", "mRNA_Array", and "miRNA_Array". Note that not all combinations are permitted; Appendix A of the package vignette outlines all values of disease and data.type accommodated by TCGA2STAT.

The type parameter should only be used along with these data.type parameters:

  • RNASeq - "count" for raw read counts (default); "RPKM" for normalized read counts (reads per kilobase per million mapped reads).
  • miRNASeq - "count" for raw read counts (default); "rpmmm" for normalized read counts.
  • Mutation - "somatic" for non-silent somatic mutations (default); "all" for all mutations.
  • Methylation - "27K" platform (default); "450K" platform.
  • CNA_CGH - "415K" for CGH Custom Microarray 2x415K (default); "244A" for CGH Microarray.
  • mRNA_Array - "G450" for Agilent 244K Custom Gene Expression G4502A (default); "U133" for Affymetrix Human Genome U133A 2.0 Array; "Huex" for Affymetrix Human Exon 1.0 ST Array.

The Level III RNA-Seq, miRNA-Seq, mRNA-array, and miRNA-array data imported are at gene level, but not the mutation, copy number alterations/variation (CNA/CNV), and methylation data. Our package processes and aggregates the mutation and CNA/CNV data at the gene level. The mutation data imported are in MAF files, where each file contains mutations found for the particular patient, and the number of mutations differs across patients. We filter the mutation data based on status and variant classification and then aggregate the filtered data at the gene level. The Level III CNA/CNV data imported are in segments; therefore we employ the CNTools package to merge the segmented data into gene-level data. The methylation data imported is at probe level where each probe represents a CpG site. As methylation profiles at different CpG sites within the same gene could vary a lot, it would not be biological meaningful to aggregate the probe-level methylation data into gene-level data. We return the methylation data at probe level.

Examples

Run this code

library(TCGA2STAT)
rsem.ov <- getTCGA(disease="OV", data.type="RNASeq2")
rnaseq.ov <- getTCGA(disease="OV", data.type="RNASeq", type="RPKM")
rnaseq_os.ov <- getTCGA(disease="OV", data.type="RNASeq", type="RPKM", clinical=TRUE)

Run the code above in your browser using DataLab