fastqPairedFilter: Filters and trims paired forward and reverse fastq files.

Description

fastqPairedFilter takes in two input fastq file (can be compressed), filters them based on several user-definable criteria, and outputs those reads which pass the filter in both directions along with their associated qualities to two new fastq file (also can be compressed). Several functions in the ShortRead package are leveraged to do this filtering. The filtered forward/reverse reads remain identically ordered.

Usage

fastqPairedFilter(fn, fout, maxN = c(0, 0), truncQ = c(2, 2), truncLen = c(0, 0), trimLeft = c(0, 0), minQ = c(0, 0), maxEE = c(Inf, Inf), rm.phix = c(FALSE, FALSE), matchIDs = FALSE, id.sep = "\\s", id.field = NULL, n = 1e+06, compress = TRUE, verbose = FALSE, ...)

Arguments

(Required). A character(2) naming the paths to the (forward,reverse) fastq files.

fout

(Required). A character(2) naming the paths to the (forward,reverse) output files. Note that by default (compress=TRUE) the output fastq files are gzipped.

FILTERING AND TRIMMING ARGUMENTS that follow can be provided as length 1 or length 2 vectors. If a length 1 vector is provided, the same parameter value is used for the forward and reverse sequence files. If a length 2 vector is provided, the first value is used for the forward reads, and the second for the reverse reads.

maxN

(Optional). Default 0. After truncation, sequences with more than maxN Ns will be discarded. Note that dada currently does not allow Ns.

truncQ

(Optional). Default 2. Truncate reads at the first instance of a quality score less than or equal to truncQ. The default value of 2 is a special quality score indicating the end of good quality sequence in Illumina 1.8+.

truncLen

(Optional). Default 0 (no truncation). Truncate reads after truncLen bases. Reads shorter than this are discarded. Note that dada currently requires all sequences to be the same length.

trimLeft

(Optional). Default 0. The number of nucleotides to remove from the start of each read. If both truncLen and trimLeft are provided, filtered reads will have length truncLen-trimLeft.

minQ

(Optional). Default 0. After truncation, reads contain a quality score below minQ will be discarded.

maxEE

(Optional). Default Inf (no EE filtering). After truncation, reads with higher than maxEE "expected errors" will be discarded. Expected errors are calculated from the nominal definition of the quality score: EE = sum(10^(-Q/10))

rm.phix

(Optional). Default FALSE. If TRUE, discard reads that match against the phiX genome, as determined by isPhiX.

ID MATCHING ARGUMENTS that follow enforce matching between the sequence identification strings in the forward and reverse reads. The function can automatically detect and match ID fields in Illumina format, e.g: EAS139:136:FC706VJ:2:2104:15343:197393

matchIDs

(Optional). Default FALSE. Whether to enforce matching between the id-line sequence identifiers of the forward and reverse fastq files. If TRUE, only paired reads that share id fields (see below) are output. If FALSE, no read ID checking is done. Note: matchIDs=FALSE essentially assumes matching order between forward and reverse reads. If that matched order is not present future processing steps may break (in particular mergePairs).

id.sep

(Optional). Default "\s" (white-space). The separator between fields in the id-line of the input fastq files. Passed to the strsplit.

id.field

(Optional). Default NULL (automatic detection). The field of the id-line containing the sequence identifier. If NULL (the default) and matchIDs is TRUE, the function attempts to automatically detect the sequence identifier field under the assumption of Illumina formatted output.

(Optional). The number of records (reads) to read in and filter at any one time. This controls the peak memory requirement so that very large fastq files are supported. Default is 1e6, one-million reads. See FastqStreamer for details.

compress

(Optional). Default TRUE. Whether the output fastq files should be gzip compressed.

verbose

(Optional). Default FALSE. Whether to output status messages.

...

(Optional). Arguments passed on to isPhiX.

Value

NULL.

Details

fastqPairedFilter replicates most of the functionality of the fastq_filter command in usearch (http://www.drive5.com/usearch/manual/cmd_fastq_filter.html) but only pairs of reads that both pass the filter are retained. An added function is the option to remove contaminating phiX sequences as part of the filtering process.

Examples

Run this code


testFastqF = system.file("extdata", "sam1F.fastq.gz", package="dada2")
testFastqR = system.file("extdata", "sam1R.fastq.gz", package="dada2")
filtFastqF <- tempfile(fileext=".fastq.gz")
filtFastqR <- tempfile(fileext=".fastq.gz")
fastqPairedFilter(c(testFastqF, testFastqR), c(filtFastqF, filtFastqR), maxN=0, maxEE=2)
fastqPairedFilter(c(testFastqF, testFastqR), c(filtFastqF, filtFastqR), trimLeft=c(10, 20),
                    truncLen=c(240, 200), maxEE=2, rm.phix=TRUE, verbose=TRUE)

Run the code above in your browser using DataLab