align(index,readfile1,readfile2=NULL,type="rna",input_format="gzFASTQ",output_format="BAM",
output_file=paste(readfile1,"subread",output_format,sep="."),nsubreads=10,
TH1=3,TH2=1,maxMismatches=3,nthreads=1,indels=5,complexIndels=FALSE,phredOffset=33,
unique=TRUE,nBestLocations=1,minFragLength=50,maxFragLength=600,PE_orientation="fr",
nTrim5=0,nTrim3=0,readGroupID=NULL,readGroup=NULL,color2base=FALSE,
DP_GapOpenPenalty=-1,DP_GapExtPenalty=0,DP_MismatchPenalty=0,DP_MatchScore=2,
detectSV=FALSE)
subjunc(index,readfile1,readfile2=NULL,input_format="gzFASTQ",output_format="BAM",
output_file=paste(readfile1,"subjunc",output_format,sep="."),nsubreads=14,
TH1=1,TH2=1,maxMismatches=3,nthreads=1,indels=5,complexIndels=FALSE,phredOffset=33,
unique=TRUE,nBestLocations=1,minFragLength=50,maxFragLength=600,PE_orientation="fr",
nTrim5=0,nTrim3=0,readGroupID=NULL,readGroup=NULL,color2base=FALSE,
reportAllJunctions=FALSE)
input_format
option for more supported formats.readfile2
should be in the same order as their mate files included in readfile1
.NULL
by default.rna
(or 0
; RNA-seq data) and dna
(or 1
; genomic DNA-seq data such as WGS, WES, ChIP-seq data etc.). Character strings are case insensitive.gzFASTQ
by default (this also includes FASTQ, FASTA and gzipped FASTA formats). Other supported formats include SAM
and BAM
. Character values are case insensitive.BAM
by default. Acceptable formats include SAM
and BAM
.readfile1
added with an suffix string.3
by default. Mis-matches found in soft-clipped bases are not counted.1
by default.5
by default.TRUE
, the program will try to detect multiple short indels that occurs concurrently in a small genomic region (indels could be as close as 1bp apart).33
and 64
. By default, 33
is used.TRUE
by default. Uniquely mapped reads must have one (1) mapping location that has less mis-matched bases than other candidate locations.unique
argument takes precedence over nBestLocations
argument.fr
, ff
and rf
. Letter f
denotes the forward strand and letter r
the reverse strand. fr
by default (ie. the first read in the pair is on the forward strand and the second read on the reverse strand).0
by default.0
by default.NULL
by default.tag:value
. This string will be added to the read group (RG) header in the mapping output. NULL
by default.TRUE
, color-space read bases will be converted to base-space bases in the mapping output. Note that the mapping itself will still be performed at color-space. FALSE
by default.-1
by default.0
by default.0
by default.2
by default.Subjunc
, BED files including discovered exon-exon junctions are also written to the current working directory.
align
function implements the Subread
aligner (Liao et al., 2013) that uses a new mapping paradigm called ``seed-and-vote".
Subread
is general-purpose aligner that can be used to align both genomic DNA-seq reads and RNA-seq reads.Subjunc
is designed for mapping RNA-seq reads.
The major difference between Subjunc
and Subread
is that Subjunc
reports discovered exon-exon junctions and it also performs full alignments for every read including exon-spanning reads.
Subread
uses the largest mappable regions in the reads to find their mapping locations.
The seed-and-vote paradigm has been found to be not only more accurate than the conventional seed-and-extend (adopted by most aligners) in read mapping, but it is a lot more efficient (Liao et al., 2013).
Both Subread
and Subjunc
can be used to align reads generated from major sequencing platforms including Illumina GA/HiSeq, ABI SOLiD, Roche 454 and Ion Torrent sequencers.
Note that to map color-space reads (e.g. SOLiD reads), a color-space index should be built for the reference genome (see buildindex
for details).
Subread
and Subjunc
have adjustable memory usage.
See buildindex
function for more details.
Mapping performance is largely determined by the number of subreads extracted from each read nsubreads
and the consensus threshold TH1
(also TH2
for paired-end read data).
Default settings are recommended for most of the read mapping tasks.
Subjunc
requires donor/receptor sites to be present when detecting exon-exon junctions.
It can detect up to four junction locations in each exon-spanning read.
detectSV
option should be used for SV detection in genomic DNA sequencing data.
For RNA-seq data, users may use subjunc
with the reportAllJunctions
option to detect SVs (and also junctions).
For each library, breakpoints detected from SV events will be saved to a file with suffix name `.breakpoints.txt', which includes chromosomal coordinates of SV breakpoints and numbers of supporting reads.
The BAM/SAM output includes extra fields to describe the complete alignments of breakpoint-containing reads.
For a breakpoint-containing read, mapping of its major sequence segment is described in the main fields of BAM/SAM ouptut whereas mapping of its minor sequence segment, which does not map along with the major segment due to the presence of a breakpoint, is described in the extra fields including `CC'(Chr), `CP'(Position),`CG'(CIGAR) and `CT'(strand).
Note that each breakpoint-containing read occupies only one row in BAM/SAM output.
# Build an index for the artificial sequence included in file 'reference.fa'.
library(Rsubread)
ref <- system.file("extdata","reference.fa",package="Rsubread")
buildindex(basename="./reference_index",reference=ref)
# align a sample read dataset ('reads.txt') to the sample reference
reads <- system.file("extdata","reads.txt.gz",package="Rsubread")
align(index="./reference_index",readfile1=reads,output_file="./Rsubread_alignment.BAM",phredOffset=64)
Run the code above in your browser using DataLab