rangeBasedAccessors: Extract genomic features from an object

Description

Generic functions to extract genomic features from an object. This page documents the methods for OrganismDb objects only.

Usage

"transcripts"(x, columns=c("TXID", "TXNAME"), filter=NULL)
"exons"(x, columns="EXONID", filter=NULL)
"cds"(x, columns="CDSID", filter=NULL)
"genes"(x, columns="GENEID", filter=NULL)
"transcriptsBy"(x, by, columns, use.names=FALSE, outerMcols=FALSE)
"exonsBy"(x, by, columns, use.names=FALSE, outerMcols=FALSE)
"cdsBy"(x, by, columns, use.names=FALSE, outerMcols=FALSE)
"getTxDbIfAvailable"(x, ...)

"asBED"(x)
"asGFF"(x)
"disjointExons"(x, aggregateGenes=FALSE,  includeTranscripts=TRUE, ...) 
"microRNAs"(x)
"tRNAs"(x)
"promoters"(x, upstream=2000, downstream=200, ...)
"distance"(x, y, ignore.strand=FALSE, ..., id, type=c("gene", "tx", "exon", "cds"))
"extractTranscriptSeqs"(x, transcripts, strand = "+")
"extractUpstreamSeqs"(x, genes, width=1000, exclude.seqlevels=NULL)
"intronsByTranscript"(x, use.names=FALSE)
"fiveUTRsByTranscript"(x, use.names=FALSE)
"threeUTRsByTranscript"(x, use.names=FALSE)
"isActiveSeq"(x)

Arguments

A MultiDb object. Except for the extractTranscriptSeqs method. In that case it's a BSgenome object and the second argument is an MultiDb object.

...

Arguments to be passed to or from methods.

One of "gene", "exon", "cds" or "tx". Determines the grouping.

columns

The columns or kinds of metadata that can be retrieved from the database. All possible columns are returned by using the columns method.

filter

Either NULL or a named list of vectors to be used to restrict the output. Valid names for this list are: "gene_id", "tx_id", "tx_name", "tx_chrom", "tx_strand", "exon_id", "exon_name", "exon_chrom", "exon_strand", "cds_id", "cds_name", "cds_chrom", "cds_strand" and "exon_rank".

use.names

Controls how to set the names of the returned GRangesList object. These functions return all the features of a given type (e.g. all the exons) grouped by another feature type (e.g. grouped by transcript) in a GRangesList object. By default (i.e. if use.names is FALSE), the names of this GRangesList object (aka the group names) are the internal ids of the features used for grouping (aka the grouping features), which are guaranteed to be unique. If use.names is TRUE, then the names of the grouping features are used instead of their internal ids. For example, when grouping by transcript (by="tx"), the default group names are the transcript internal ids ("tx_id"). But, if use.names=TRUE, the group names are the transcript names ("tx_name"). Note that, unlike the feature ids, the feature names are not guaranteed to be unique or even defined (they could be all NAs). A warning is issued when this happens. See ?id2name for more information about feature internal ids and feature external names and how to map the formers to the latters.

Finally, use.names=TRUE cannot be used when grouping by gene by="gene". This is because, unlike for the other features, the gene ids are external ids (e.g. Entrez Gene or Ensembl ids) so the db doesn't have a "gene_name" column for storing alternate gene names.

upstream

For promoters : An integer(1) value indicating the number of bases upstream from the transcription start site. For additional details see ?`promoters,GRanges-method`.

downstream

For promoters : An integer(1) value indicating the number of bases downstream from the transcription start site. For additional details see ?`promoters,GRanges-method`.

aggregateGenes

For disjointExons : A logical. When FALSE (default) exon fragments that overlap multiple genes are dropped. When TRUE, all fragments are kept and the gene_id metadata column includes all gene ids that overlap the exon fragment.

includeTranscripts

For disjointExons : A logical. When TRUE (default) a tx_name metadata column is included that lists all transcript names that overlap the exon fragment.

For distance, a MultiDb instance. The id is used to extract ranges from the MultiDb which are then used to compute the distance from x.

A character vector the same length as x. The id must be identifiers in the MultiDb object. type indicates what type of identifier id is.

type

A character(1) describing the id. Must be one of ‘gene’, ‘tx’, ‘exon’ or ‘cds’.

ignore.strand

A logical indicating if the strand of the ranges should be ignored. When TRUE, strand is set to '+'.

outerMcols

A logical indicating if the the 'outer' mcols (metadata columns) should be populated for some range based accesors which return a GRangesList object. By default this is FALSE, but if TRUE then the outer list object will also have it's metadata columns (mcols) populated as well as the mcols for the 'inner' GRanges objects.

transcripts

An object representing the exon ranges of each transcript to extract. It must be a GRangesList or MultiDb object while the x is a BSgenome object. Internally, it's turned into a GRangesList object with exonsBy(transcripts, by="tx", use.names=TRUE).

strand

Only supported when x is a DNAString object.

Can be an atomic vector, a factor, or an Rle object, in which case it indicates the strand of each transcript (i.e. all the exons in a transcript are considered to be on the same strand). More precisely: it's turned into a factor (or factor-Rle) that has the "standard strand levels" (this is done by calling the strand function on it). Then it's recycled to the length of RangesList object transcripts if needed. In the resulting object, the i-th element is interpreted as the strand of all the exons in the i-th transcript.

strand can also be a list-like object, in which case it indicates the strand of each exon, individually. Thus it must have the same shape as RangesList object transcripts (i.e. same length plus strand[[i]] must have the same length as transcripts[[i]] for all i).

strand can only contain "+" and/or "-" values. "*" is not allowed.

genes

An object containing the locations (i.e. chromosome name, start, end, and strand) of the genes or transcripts with respect to the reference genome. Only GenomicRanges and MultiDb objects are supported at the moment. If the latter, the gene locations are obtained by calling the genes function on the MultiDb object internally.

width

How many bases to extract upstream of each TSS (transcription start site).

exclude.seqlevels

A character vector containing the chromosome names (a.k.a. sequence levels) to exclude when the genes are obtained from a MultiDb object.

Value

Details

These are the range based functions for extracting transcript information from a MultiDb object.

Examples

Run this code

## extracting all transcripts from Homo.sapiens with some extra metadata
library(Homo.sapiens)
cols = c("TXNAME","SYMBOL")
res <- transcripts(Homo.sapiens, columns=cols)

## extracting all transcripts from Homo.sapiens, grouped by gene and
## with extra metadata
res <- transcriptsBy(Homo.sapiens, by="gene", columns=cols)

## list possible values for columns argument:
columns(Homo.sapiens)

## Get the TxDb from an MultiDb object (if it's available)
getTxDbIfAvailable(Homo.sapiens)

## Other functions listed above should work in way similar to their TxDb
## counterparts.  So for example:
promoters(Homo.sapiens)
## Should give the same value as:
promoters(getTxDbIfAvailable(Homo.sapiens))

Run the code above in your browser using DataLab