Import files containing aligned reads into an internal representation of the alignments, sequences, and quality scores. Most methods (see details for exceptions) read all files into a single R object.
readAligned(dirPath, pattern=character(0), ...)
grep
-style) pattern describing file
names to be read. The default (character(0)
) results in
(attempted) input of all files in the directory.dirPath
is a character vector, the argument type
must be
provided. Possible values for type
and their meaning are
described below. Most methods implement filter=srFilter()
,
allowing objects of SRFilter
to selectively
returns aligned reads.AlignedRead
) containing
alignments, sequences and qualities of all files in dirPath
matching pattern
. There is no guarantee of order in which files
are read.There is no standard aligned read file format; methods parse particular file types.
The readAligned,character-method
interprets file types based
on an additional type
argument. Supported types are:
type="SolexaExport"
This type parses .*_export.txt
files following the
documentation in the Solexa Genome Alignment software manual,
version 0.3.0. These files consist of the following columns;
consult Solexa documentation for precise descriptions. If parsed,
values can be retrieved from AlignedRead
as
follows:
alignData
alignData
alignData
alignData
alignData
sread
quality
chromosome
alignData
position
strand
alignQuality
alignData
The following optional arguments, set to FALSE
by default,
influence data input
TRUE
, include the
multiplex index as a column multiplexIndex
in
alignData
.
TRUE
, include the paired
read number as a column pairedReadNumber
in
alignData
.
TRUE
, construct an identifier string
as
Machine_Run:Lane:Tile:X:Y#multiplexIndex/pairedReadNumber. The
substrings #multiplexIndex and
/pairedReadNumber are not present if
withMultiplexIndex=FALSE
or
withPairedReadNumber=FALSE
.
TRUE
, sets all
with*
values to TRUE
.
Note that not all paired read columns are interpreted. Different
interfaces to reading alignment files are described in
SolexaPath
and
SolexaSet
.
type="SolexaPrealign"
type="SolexaAlign"
type="SolexaRealign"
These types parse s_L_TTTT_prealign.txt
,
s_L_TTTT_align.txt
or s_L_TTTT_realign.txt
files
produced by default and eland analyses. From the Solexa
documentation, align
corresponds to unfiltered first-pass
alignments, prealign
adjusts alignments for error rates
(when available), realign
filters alignments to exclude
clusters failing to pass quality criteria.
Because base quality scores are not stored with alignments, the
object returned by readAligned
scores all base qualities as
-32
.
If parsed, values can be retrieved from
AlignedRead
as follows:
sread
alignQuality
alignData
position
strand
readXStringColumns
alignData
type="SolexaResult"
This parses s_L_eland_results.txt
files, an intermediate
format that does not contain read or alignment quality
scores.
Because base quality scores are not stored with alignments, the
object returned by readAligned
scores all base qualities as
-32
.
Columns of this file type can be retrieved from
AlignedRead
as follows (description of
columns is from Table 19, Genome Analyzer Pipeline Software User
Guide, Revision A, January 2008):
sread
alignData
as
matchCode
. Codes are (from the Eland manual): NM (no
match); QC (no match due to quality control failure); RM (no
match due to repeat masking); U0 (best match was unique and
exact); U1 (best match was unique, with 1 mismatch); U2 (best
match was unique, with 2 mismatches); R0 (multiple exact
matches found); R1 (multiple 1 mismatch matches found, no
exact matches); R2 (multiple 2 mismatch matches found, no
exact or 1-mismatch matches).
alignData
as
nExactMatch
alignData
as nOneMismatch
alignData
as nTwoMismatch
chromosome
position
strand
alignData
, as
NCharacterTreatment
. . indicates treatment of
N was not applicable; D indicates treatment
as deletion; | indicates treatment as insertion
alignData
as
mismatchDetailOne
and mismatchDetailTwo
. Present
only for unique inexact matches at one or two
positions. Position and type of first substitution error,
e.g., 11A represents 11 matches with 12th base an A in
reference but not read. The reference manual cited below lists
only one field (mismatchDetailOne
), but two are present
in files seen in the wild.
type="MAQMap", records=-1L
map
files produced by MAQ. See details in the next section. The
records
option determines how many lines are read;
-1L
(the default) means that all records are input. For
type="MAQMap"
, dir
and pattern
must match a
single file.
type="MAQMapShort", records=-1L
type="MAQMap"
but for map files made with Maq prior to
version 0.7.0. (These files use a different maximum read length
[64 instead of 128], and are hence incompatible with newer Maq map
files.). For type="MAQMapShort"
, dir
and
pattern
must match a single file.
type="MAQMapview"
Parse alignment files created by MAQ's mapiew command. Interpretation of columns is based on the description in the MAQ manual, specifically
...each line consists of read name, chromosome, position, strand, insert size from the outer coordinates of a pair, paired flag, mapping quality, single-end mapping quality, alternative mapping quality, number of mismatches of the best hit, sum of qualities of mismatched bases of the best hit, number of 0-mismatch hits of the first 24bp, number of 1-mismatch hits of the first 24bp on the reference, length of the read, read sequence and its quality.
The read name, read sequence, and quality are read as
XStringSet
objects. Chromosome and strand are read as
factor
s. Position is numeric
, while mapping quality is
numeric
. These fields are mapped to their corresponding
representation in AlignedRead
objects.
Number of mismatches of the best hit, sum of qualities of mismatched
bases of the best hit, number of 0-mismatch hits of the first 24bp,
number of 1-mismatch hits of the first 24bp are represented in the
AlignedRead
object as components of alignData
.
Remaining fields are currently ignored.
type="Bowtie"
Parse alignment files created with the Bowtie alignment
algorithm. Parsed columns can be retrieved from
AlignedRead
as follows:
id
strand
chromosome
position
; see comment below
sread
; see comment below
quality
; see comments below
alignData
, similar
column; Bowtie v. 0.9.9.3 (12 May, 2009) documents this as
the number of other instances where the same read aligns against the
same reference characters as were aligned against in this
alignment. Previous versions marked this as Reserved
alignData
mismatch, column
NOTE: the default quality encoding changes to FastqQuality
with ShortRead version 1.3.24.
This method includes the argument qualityType
to specify
how quality scores are encoded. Bowtie quality scores are
Phred-like by default, with
qualityType='FastqQuality'
, but can be specified as
Solexa-like, with qualityType='SFastqQuality'
.
Bowtie outputs positions that are 0-offset from the left-most end
of the +
strand. ShortRead
parses position
information to be 1-offset from the left-most end of the +
strand.
Bowtie outputs reads aligned to the -
strand as their
reverse complement, and reverses the quality score string of these
reads. ShortRead
parses these to their original sequence
and orientation.
type="SOAP"
Parse alignment files created with the SOAP alignment
algorithm. Parsed columns can be retrieved from
AlignedRead
as follows:
id
sread
; see comment below
quality
; see comment below
alignData
alignData
(pairedEnd
)
alignData
(alignedLength
)
strand
chromosome
position
; see comment below
alignData
(typeOfHit
: integer
portion; hitDetail
: text portion)
This method includes the argument qualityType
to specify
how quality scores are encoded. It is unclear from SOAP
documentation what the quality score is; the default is
Solexa-like, with qualityType='SFastqQuality'
, but
can be specified as Phred-like, with
qualityType='FastqQuality'
.
SOAP outputs positions that are 1-offset from the left-most end of
the +
strand. ShortRead
preserves this
representation.
SOAP reads aligned to the -
strand are reported by SOAP as
their reverse complement, with the quality string of these reads
reversed. ShortRead
parses these to their original sequence
and orientation.
The AlignedRead
class.
Genome Analyzer Pipeline Software User Guide, Revision A, January 2008.
The MAQ reference manual, http://maq.sourceforge.net/maq-manpage.shtml#5, 3 May, 2008.
The Bowtie reference manual, http://bowtie-bio.sourceforge.net, 28 October, 2008.
The SOAP reference manual, http://soap.genomics.org.cn/soap1, 16 December, 2008.
sp <- SolexaPath(system.file("extdata", package="ShortRead"))
ap <- analysisPath(sp)
## ELAND_EXTENDED
(aln0 <- readAligned(ap, "s_2_export.txt", "SolexaExport"))
## PhageAlign
(aln1 <- readAligned(ap, "s_5_.*_realign.txt", "SolexaRealign"))
## MAQ
dirPath <- system.file('extdata', 'maq', package='ShortRead')
list.files(dirPath)
## First line
readLines(list.files(dirPath, full.names=TRUE)[[1]], 1)
countLines(dirPath)
## two files collapse into one
(aln2 <- readAligned(dirPath, type="MAQMapview"))
## select only chr1-5.fa, '+' strand
filt <- compose(chromosomeFilter("chr[1-5].fa"),
strandFilter("+"))
(aln3 <- readAligned(sp, "s_2_export.txt", filter=filt))
Run the code above in your browser using DataLab