Learn R Programming

ShortRead (version 1.30.0)

readAligned: (Legacy) Read aligned reads and their quality scores into R representations

Description

Import files containing aligned reads into an internal representation of the alignments, sequences, and quality scores. Most methods (see ‘details’ for exceptions) read all files into a single R object.

Usage

readAligned(dirPath, pattern=character(0), ...)

Arguments

dirPath
A character vector (or other object; see methods defined on this generic) giving the directory path (relative or absolute; some methods also accept a character vector of file names) of aligned read files to be input.
pattern
The (grep-style) pattern describing file names to be read. The default (character(0)) results in (attempted) input of all files in the directory.
...
Additional arguments, used by methods. When dirPath is a character vector, the argument type must be provided. Possible values for type and their meaning are described below. Most methods implement filter=srFilter(), allowing objects of SRFilter to selectively returns aligned reads.

Value

A single R object (e.g., AlignedRead) containing alignments, sequences and qualities of all files in dirPath matching pattern. There is no guarantee of order in which files are read.

Details

There is no standard aligned read file format; methods parse particular file types.

The readAligned,character-method interprets file types based on an additional type argument. Supported types are:

type="SolexaExport"

This type parses .*_export.txt files following the documentation in the Solexa Genome Alignment software manual, version 0.3.0. These files consist of the following columns; consult Solexa documentation for precise descriptions. If parsed, values can be retrieved from AlignedRead as follows:

Machine
see below

Run number
stored in alignData

Lane
stored in alignData

Tile
stored in alignData

X
stored in alignData

Y
stored in alignData

Multiplex index
see below

Paired read number
see below

Read
sread

Quality
quality

Match chromosome
chromosome

Match contig
alignData

Match position
position

Match strand
strand

Match description
Ignored

Single-read alignment score
alignQuality

Paired-read alignment score
Ignored

Partner chromosome
Ignored

Partner contig
Ignored

Partner offset
Ignored

Partner strand
Ignored

Filtering
alignData

The following optional arguments, set to FALSE by default, influence data input

withMultiplexIndex
When TRUE, include the multiplex index as a column multiplexIndex in alignData.

withPairedReadNumber
When TRUE, include the paired read number as a column pairedReadNumber in alignData.

withId
When TRUE, construct an identifier string as ‘Machine_Run:Lane:Tile:X:Y#multiplexIndex/pairedReadNumber’. The substrings ‘#multiplexIndex’ and ‘/pairedReadNumber’ are not present if withMultiplexIndex=FALSE or withPairedReadNumber=FALSE.

withAll
A convencience which, when TRUE, sets all with* values to TRUE.

Note that not all paired read columns are interpreted. Different interfaces to reading alignment files are described in SolexaPath and SolexaSet.

type="SolexaPrealign"
See SolexaRealign

type="SolexaAlign"
See SolexaRealign

type="SolexaRealign"

These types parse s_L_TTTT_prealign.txt, s_L_TTTT_align.txt or s_L_TTTT_realign.txt files produced by default and eland analyses. From the Solexa documentation, align corresponds to unfiltered first-pass alignments, prealign adjusts alignments for error rates (when available), realign filters alignments to exclude clusters failing to pass quality criteria.

Because base quality scores are not stored with alignments, the object returned by readAligned scores all base qualities as -32.

If parsed, values can be retrieved from AlignedRead as follows:

Sequence
stored in sread

Best score
stored in alignQuality

Number of hits
stored in alignData

Target position
stored in position

Strand
stored in strand

Target sequence
Ignored; parse using readXStringColumns

Next best score
stored in alignData

type="SolexaResult"

This parses s_L_eland_results.txt files, an intermediate format that does not contain read or alignment quality scores.

Because base quality scores are not stored with alignments, the object returned by readAligned scores all base qualities as -32.

Columns of this file type can be retrieved from AlignedRead as follows (description of columns is from Table 19, Genome Analyzer Pipeline Software User Guide, Revision A, January 2008):

Id
Not parsed

Sequence
stored in sread

Type of match code
Stored in alignData as matchCode. Codes are (from the Eland manual): NM (no match); QC (no match due to quality control failure); RM (no match due to repeat masking); U0 (best match was unique and exact); U1 (best match was unique, with 1 mismatch); U2 (best match was unique, with 2 mismatches); R0 (multiple exact matches found); R1 (multiple 1 mismatch matches found, no exact matches); R2 (multiple 2 mismatch matches found, no exact or 1-mismatch matches).

Number of exact matches
stored in alignData as nExactMatch

Number of 1-error mismatches
stored in alignData as nOneMismatch

Number of 2-error mismatches
stored in alignData as nTwoMismatch

Genome file of match
stored in chromosome

Position
stored in position

Strand
(direction of match) stored in strand

‘N’ treatment
stored in alignData, as NCharacterTreatment. ‘.’ indicates treatment of ‘N’ was not applicable; ‘D’ indicates treatment as deletion; ‘|’ indicates treatment as insertion

Substitution error
stored in alignData as mismatchDetailOne and mismatchDetailTwo. Present only for unique inexact matches at one or two positions. Position and type of first substitution error, e.g., 11A represents 11 matches with 12th base an A in reference but not read. The reference manual cited below lists only one field (mismatchDetailOne), but two are present in files seen in the wild.

type="MAQMap", records=-1L
Parse binary map files produced by MAQ. See details in the next section. The records option determines how many lines are read; -1L (the default) means that all records are input. For type="MAQMap", dir and pattern must match a single file.

type="MAQMapShort", records=-1L
The same as type="MAQMap" but for map files made with Maq prior to version 0.7.0. (These files use a different maximum read length [64 instead of 128], and are hence incompatible with newer Maq map files.). For type="MAQMapShort", dir and pattern must match a single file.

type="MAQMapview"

Parse alignment files created by MAQ's ‘mapiew’ command. Interpretation of columns is based on the description in the MAQ manual, specifically

        ...each line consists of read name, chromosome, position,
        strand, insert size from the outer coordinates of a pair,
        paired flag, mapping quality, single-end mapping quality,
        alternative mapping quality, number of mismatches of the
        best hit, sum of qualities of mismatched bases of the best
        hit, number of 0-mismatch hits of the first 24bp, number
        of 1-mismatch hits of the first 24bp on the reference,
        length of the read, read sequence and its quality.
      

The read name, read sequence, and quality are read as XStringSet objects. Chromosome and strand are read as factors. Position is numeric, while mapping quality is numeric. These fields are mapped to their corresponding representation in AlignedRead objects.

Number of mismatches of the best hit, sum of qualities of mismatched bases of the best hit, number of 0-mismatch hits of the first 24bp, number of 1-mismatch hits of the first 24bp are represented in the AlignedRead object as components of alignData.

Remaining fields are currently ignored.

type="Bowtie"

Parse alignment files created with the Bowtie alignment algorithm. Parsed columns can be retrieved from AlignedRead as follows:

Identifier
id

Strand
strand

Chromosome
chromosome

Position
position; see comment below

Read
sread; see comment below

Read quality
quality; see comments below

Similar alignments
alignData, ‘similar’ column; Bowtie v. 0.9.9.3 (12 May, 2009) documents this as the number of other instances where the same read aligns against the same reference characters as were aligned against in this alignment. Previous versions marked this as ‘Reserved’

Alignment mismatch locations
alignData ‘mismatch’, column

NOTE: the default quality encoding changes to FastqQuality with ShortRead version 1.3.24.

This method includes the argument qualityType to specify how quality scores are encoded. Bowtie quality scores are ‘Phred’-like by default, with qualityType='FastqQuality', but can be specified as ‘Solexa’-like, with qualityType='SFastqQuality'.

Bowtie outputs positions that are 0-offset from the left-most end of the + strand. ShortRead parses position information to be 1-offset from the left-most end of the + strand.

Bowtie outputs reads aligned to the - strand as their reverse complement, and reverses the quality score string of these reads. ShortRead parses these to their original sequence and orientation.

type="SOAP"

Parse alignment files created with the SOAP alignment algorithm. Parsed columns can be retrieved from AlignedRead as follows:

id
id

seq
sread; see comment below

qual
quality; see comment below

number of hits
alignData

a/b
alignData (pairedEnd)

length
alignData (alignedLength)

+/-
strand

chr
chromosome

location
position; see comment below

types
alignData (typeOfHit: integer portion; hitDetail: text portion)

This method includes the argument qualityType to specify how quality scores are encoded. It is unclear from SOAP documentation what the quality score is; the default is ‘Solexa’-like, with qualityType='SFastqQuality', but can be specified as ‘Phred’-like, with qualityType='FastqQuality'.

SOAP outputs positions that are 1-offset from the left-most end of the + strand. ShortRead preserves this representation.

SOAP reads aligned to the - strand are reported by SOAP as their reverse complement, with the quality string of these reads reversed. ShortRead parses these to their original sequence and orientation.

See Also

The AlignedRead class.

Genome Analyzer Pipeline Software User Guide, Revision A, January 2008.

The MAQ reference manual, http://maq.sourceforge.net/maq-manpage.shtml#5, 3 May, 2008.

The Bowtie reference manual, http://bowtie-bio.sourceforge.net, 28 October, 2008.

The SOAP reference manual, http://soap.genomics.org.cn/soap1, 16 December, 2008.

Examples

Run this code
sp <- SolexaPath(system.file("extdata", package="ShortRead"))
ap <- analysisPath(sp)
## ELAND_EXTENDED
(aln0 <- readAligned(ap, "s_2_export.txt", "SolexaExport"))
## PhageAlign
(aln1 <- readAligned(ap, "s_5_.*_realign.txt", "SolexaRealign"))

## MAQ
dirPath <- system.file('extdata', 'maq', package='ShortRead')
list.files(dirPath)
## First line
readLines(list.files(dirPath, full.names=TRUE)[[1]], 1)
countLines(dirPath)
## two files collapse into one
(aln2 <- readAligned(dirPath, type="MAQMapview"))

## select only chr1-5.fa, '+' strand
filt <- compose(chromosomeFilter("chr[1-5].fa"),
                strandFilter("+"))
(aln3 <- readAligned(sp, "s_2_export.txt", filter=filt))

Run the code above in your browser using DataLab