Learn R Programming

QuasR (version 1.12.0)

preprocessReads: Preprocess Short Reads

Description

Truncate sequences, remove parts matching to adapters and filter out low quality or low complexity sequences from (compressed) 'fasta' or 'fastq' files.

Usage

preprocessReads(filename, outputFilename=NULL,
             filenameMate=NULL, outputFilenameMate=NULL,
             truncateStartBases=NULL, truncateEndBases=NULL, 
             Lpattern="", Rpattern="",
             max.Lmismatch=rep(0:2, c(6,3,100)), max.Rmismatch=rep(0:2, c(6,3,100)),
             with.Lindels=FALSE, with.Rindels=FALSE,
             minLength=14L, nBases=2L, complexity=NULL,
             nrec=1000000L, clObj=NULL)

Arguments

filename
the name(s) of the input sequence file(s).
outputFilename
the name(s) of the output sequence file(s).
filenameMate
for paired-end experiments, the name(s) of the input sequence file(s) containing the second read (mate) of each pair.
outputFilenameMate
for paired-end experiments, the name(s) of the output sequence file(s) containing the second read (mate) of each pair.
truncateStartBases
integer(1): the number of bases to be truncated (removed) from the begining of each sequence.
truncateEndBases
integer(1): the number of bases to be truncated (removed) from the end of each sequence.
Lpattern
character(1): the left (5'-end) adapter sequence.
Rpattern
character(1): the right (3'-end) adapter sequence.
max.Lmismatch
mismatch tolerance when searching for matches of Lpattern (see Details).
max.Rmismatch
mismatch tolerance when searching for matches of Rpattern (see Details).
with.Lindels
if TRUE, indels are allowed in the alignments of the suffixes of Lpattern with the subject, at its beginning (see Details).
with.Rindels
same as with.Lindels but for alignments of the prefixes of Rpattern with the subject, at its end (see Details).
minLength
integer(1): the minimal allowed sequence length.
nBases
integer(1): the maximal number of Ns allowed per sequence.
complexity
NULL (default) or numeric(1): If not NULL, the minimal sequence complexity, as a fraction of the average complexity in the human genome (~3.9bits). For example, complexity = 0.5 will filter out sequences that do not have at least half the complexity of the human genome. See Details on how the complexity is calculated.
nrec
integer(1): the number of sequence records to read at a time.
clObj
a cluster object to be used for parallel processing of multiple files (see Details).

Value

  • A matrix with summary statistics on the processed sequences, containing:
    • One column per input file (or pair of input files for paired-end experiments).
    • The number of sequences or sequence pairs in rows:
      • totalSequences- the total number in the input
      • matchTo5pAdaptor- matching toLpattern
      • matchTo3pAdaptor- matching toRpattern
      • tooShort- shorter thanminLength
      • tooManyN- more thannBasesNs
      • lowComplexity- relative complexity belowcomplexity
      • totalPassed- the number of sequences/sequence pairs that pass all filtering criteria and were written to the output file(s).

Details

Sequence files can be in fasta or fastq format, and can be compressed by either gzip, bzip2 or xz (extensions .gz, .bz2 or .xz). Multiple files can be processed by a single call to preprocessReads; in that case all sequence file vectors must have identical lengths.

nrec can be used to limit the memory usage when processing large input files. preprocessReads iteratively loads chunks of nrec sequences from the input until all data been processed.

Sequence pairs from paired-end experiments can be processed by specifying pairs of input and output files (filenameMate and outputFilenameMate arguments). In that case, it is assumed that pairs appear in the same order in the two input files, and only pairs in which both reads pass all filtering criteria are written to the output files, maintaining the consistent ordering.

If output files are compressed, the processed sequences are first written to temporary files (created in the same directory as the final output file), and the output files are generated at the end by compressing the temporary files.

For the trimming of left and/or right flanking sequences (adapters) from sequence reads, the trimLRPatterns function from package Biostrings is used, and the arguments Lpattern, Rpattern, max.Lmismatch, max.Rmismatch, with.Lindels and with.Rindels are used in the call to trimLRPatterns. Lfixed and Rfixed arguments of trimLRPatterns are set to TRUE, thus only fixed patterns (without IUPAC codes for ambigous bases) can be used. Currently, trimming of adapters is only supported for single read experiments.

Sequence complexity ($H$) is calculated based on the dinucleotide composition using the formula (Shannon entropy): $$H = -\sum_i {f_i \log_2 f_i},$$ where $f_i$ is the fraction of dinucleotide $i$ from all dinucleotides in the sequence. Sequence reads that fulfill the condition $H/H_r \ge c$ are retained (not filtered out), where $H_r = 3.908$ is the reference complexity in bits obtained from the human genome, and $c$ is the value given to the argument complexity.

If an object that inherits from class cluster is provided to the clObj argument, for example an object returned by makeCluster from package parallel, multiple files will be processed in parallel using clusterMap from package parallel.

See Also

trimLRPatterns from package Biostrings, makeCluster from package parallel

Examples

Run this code
# sample files
infiles <- system.file(package="QuasR", "extdata",
                       c("rna_1_1.fq.bz2","rna_1_2.fq.bz2"))
outfiles <- paste(tempfile(pattern=c("output_1_","output_2_")),".fastq",sep="")

# single read example
preprocessReads(infiles, outfiles, nBases=0, complexity=0.6)
unlink(outfiles)

# paired-end example
preprocessReads(filename=infiles[1],
                outputFilename=outfiles[1],
                filenameMate=infiles[2],
                outputFilenameMate=outfiles[2],
                nBases=0, complexity=0.6)
unlink(outfiles)

Run the code above in your browser using DataLab