preprocessReads(filename, outputFilename=NULL,
filenameMate=NULL, outputFilenameMate=NULL,
truncateStartBases=NULL, truncateEndBases=NULL,
Lpattern="", Rpattern="",
max.Lmismatch=rep(0:2, c(6,3,100)), max.Rmismatch=rep(0:2, c(6,3,100)),
with.Lindels=FALSE, with.Rindels=FALSE,
minLength=14L, nBases=2L, complexity=NULL,
nrec=1000000L, clObj=NULL)
Lpattern
(see Rpattern
(see TRUE
, indels are allowed in the
alignments of the suffixes of Lpattern
with the subject,
at its beginning (see with.Lindels
but for alignments
of the prefixes of Rpattern
with the subject, at its end
(see NULL
(default) or numeric(1): If not NULL
,
the minimal sequence complexity, as a fraction of the average complexity
in the human genome (~3.9bits). For example, complexity = 0.5
will
filter out sequences that do not have at least half the complexity of the
human genome. See totalSequences
- the total number in the inputmatchTo5pAdaptor
- matching toLpattern
matchTo3pAdaptor
- matching toRpattern
tooShort
- shorter thanminLength
tooManyN
- more thannBases
NslowComplexity
- relative complexity belowcomplexity
totalPassed
- the number of sequences/sequence pairs
that pass all filtering criteria and were written to the output
file(s).preprocessReads
; in that
case all sequence file vectors must have identical lengths.nrec
can be used to limit the memory usage when processing
large input files. preprocessReads
iteratively loads chunks of
nrec
sequences from the input until all data been processed.
Sequence pairs from paired-end experiments can be processed by
specifying pairs of input and output files (filenameMate
and
outputFilenameMate
arguments). In that case, it is assumed that
pairs appear in the same order in the two input files, and only pairs
in which both reads pass all filtering criteria are written to the
output files, maintaining the consistent ordering.
If output files are compressed, the processed sequences are first written to temporary files (created in the same directory as the final output file), and the output files are generated at the end by compressing the temporary files.
For the trimming of left and/or right flanking sequences (adapters) from
sequence reads, the trimLRPatterns
function
from package Lpattern
,
Rpattern
, max.Lmismatch
, max.Rmismatch
,
with.Lindels
and with.Rindels
are used in the call to
trimLRPatterns
. Lfixed
and Rfixed
arguments
of trimLRPatterns
are set to TRUE
, thus only fixed
patterns (without IUPAC codes for ambigous bases) can be
used. Currently, trimming of adapters is only supported for single read
experiments.
Sequence complexity ($H$) is calculated based on the dinucleotide
composition using the formula (Shannon entropy): $$H = -\sum_i {f_i \log_2 f_i},$$
where $f_i$ is the fraction of dinucleotide $i$ from all
dinucleotides in the sequence. Sequence reads that fulfill the condition
$H/H_r \ge c$ are retained (not filtered out), where $H_r =
3.908$ is the reference complexity in bits obtained from the human
genome, and $c$ is the value given to the argument complexity
.
If an object that inherits from class cluster
is provided to
the clObj
argument, for example an object returned by
makeCluster
from package clusterMap
from package
trimLRPatterns
from package makeCluster
from package # sample files
infiles <- system.file(package="QuasR", "extdata",
c("rna_1_1.fq.bz2","rna_1_2.fq.bz2"))
outfiles <- paste(tempfile(pattern=c("output_1_","output_2_")),".fastq",sep="")
# single read example
preprocessReads(infiles, outfiles, nBases=0, complexity=0.6)
unlink(outfiles)
# paired-end example
preprocessReads(filename=infiles[1],
outputFilename=outfiles[1],
filenameMate=infiles[2],
outputFilenameMate=outfiles[2],
nBases=0, complexity=0.6)
unlink(outfiles)
Run the code above in your browser using DataLab