preprocessReads: Preprocess Short Reads

Description

Truncate sequences, remove parts matching to adapters and filter out low quality or low complexity sequences from (compressed) 'fasta' or 'fastq' files.

Usage

preprocessReads(filename, outputFilename=NULL,
             filenameMate=NULL, outputFilenameMate=NULL,
             truncateStartBases=NULL, truncateEndBases=NULL, 
             Lpattern="", Rpattern="",
             max.Lmismatch=rep(0:2, c(6,3,100)), max.Rmismatch=rep(0:2, c(6,3,100)),
             with.Lindels=FALSE, with.Rindels=FALSE,
             minLength=14L, nBases=2L, complexity=NULL,
             nrec=1000000L, clObj=NULL)

Arguments

filename

the name(s) of the input sequence file(s).

outputFilename

the name(s) of the output sequence file(s).

filenameMate

for paired-end experiments, the name(s) of the input sequence file(s) containing the second read (mate) of each pair.

outputFilenameMate

for paired-end experiments, the name(s) of the output sequence file(s) containing the second read (mate) of each pair.

truncateStartBases

integer(1): the number of bases to be truncated (removed) from the begining of each sequence.

truncateEndBases

integer(1): the number of bases to be truncated (removed) from the end of each sequence.

Lpattern

character(1): the left (5'-end) adapter sequence.

Rpattern

character(1): the right (3'-end) adapter sequence.

max.Lmismatch

mismatch tolerance when searching for matches of Lpattern (see Details).

max.Rmismatch

mismatch tolerance when searching for matches of Rpattern (see Details).

with.Lindels

if TRUE, indels are allowed in the alignments of the suffixes of Lpattern with the subject, at its beginning (see Details).

with.Rindels

same as with.Lindels but for alignments of the prefixes of Rpattern with the subject, at its end (see Details).

minLength

integer(1): the minimal allowed sequence length.

nBases

integer(1): the maximal number of Ns allowed per sequence.

complexity

NULL (default) or numeric(1): If not NULL, the minimal sequence complexity, as a fraction of the average complexity in the human genome (~3.9bits). For example, complexity = 0.5 will filter out sequences that do not have at least half the complexity of the human genome. See Details on how the complexity is calculated.

nrec

integer(1): the number of sequence records to read at a time.

clObj

a cluster object to be used for parallel processing of multiple files (see Details).

Value

A matrix with summary statistics on the processed sequences, containing:
- One column per input file (or pair of input files for paired-end experiments).
- The number of sequences or sequence pairs in rows:
  - totalSequences- the total number in the input
  - matchTo5pAdaptor- matching toLpattern
  - matchTo3pAdaptor- matching toRpattern
  - tooShort- shorter thanminLength
  - tooManyN- more thannBasesNs
  - lowComplexity- relative complexity belowcomplexity
  - totalPassed- the number of sequences/sequence pairs that pass all filtering criteria and were written to the output file(s).

Details

Sequence files can be in fasta or fastq format, and can be compressed by either gzip, bzip2 or xz (extensions .gz, .bz2 or .xz). Multiple files can be processed by a single call to preprocessReads; in that case all sequence file vectors must have identical lengths.

nrec can be used to limit the memory usage when processing large input files. preprocessReads iteratively loads chunks of nrec sequences from the input until all data been processed.

Sequence pairs from paired-end experiments can be processed by specifying pairs of input and output files (filenameMate and outputFilenameMate arguments). In that case, it is assumed that pairs appear in the same order in the two input files, and only pairs in which both reads pass all filtering criteria are written to the output files, maintaining the consistent ordering.

If output files are compressed, the processed sequences are first written to temporary files (created in the same directory as the final output file), and the output files are generated at the end by compressing the temporary files.

For the trimming of left and/or right flanking sequences (adapters) from sequence reads, the trimLRPatterns function from package Biostrings is used, and the arguments Lpattern, Rpattern, max.Lmismatch, max.Rmismatch, with.Lindels and with.Rindels are used in the call to trimLRPatterns. Lfixed and Rfixed arguments of trimLRPatterns are set to TRUE, thus only fixed patterns (without IUPAC codes for ambigous bases) can be used. Currently, trimming of adapters is only supported for single read experiments.

Sequence complexity ($H$) is calculated based on the dinucleotide composition using the formula (Shannon entropy): $$H = -\sum_i {f_i \log_2 f_i},$$ where $f_i$ is the fraction of dinucleotide $i$ from all dinucleotides in the sequence. Sequence reads that fulfill the condition $H/H_r \ge c$ are retained (not filtered out), where $H_r = 3.908$ is the reference complexity in bits obtained from the human genome, and $c$ is the value given to the argument complexity.

If an object that inherits from class cluster is provided to the clObj argument, for example an object returned by makeCluster from package parallel, multiple files will be processed in parallel using clusterMap from package parallel.

Examples

Run this code

# sample files
infiles <- system.file(package="QuasR", "extdata",
                       c("rna_1_1.fq.bz2","rna_1_2.fq.bz2"))
outfiles <- paste(tempfile(pattern=c("output_1_","output_2_")),".fastq",sep="")

# single read example
preprocessReads(infiles, outfiles, nBases=0, complexity=0.6)
unlink(outfiles)

# paired-end example
preprocessReads(filename=infiles[1],
                outputFilename=outfiles[1],
                filenameMate=infiles[2],
                outputFilenameMate=outfiles[2],
                nBases=0, complexity=0.6)
unlink(outfiles)