fastqFilter: Filter and trim a fastq file.

Description

fastqFilter takes an input fastq file (can be compressed), filters it based on several user-definable criteria, and outputs those reads which pass the filter and their associated qualities to a new fastq file (also can be compressed). Several functions in the ShortRead package are leveraged to do this filtering.

Usage

fastqFilter(fn, fout, truncQ = 2, truncLen = 0, trimLeft = 0, maxN = 0, minQ = 0, maxEE = Inf, rm.phix = FALSE, n = 1e+06, compress = TRUE, verbose = FALSE, ...)

Arguments

(Required). The path to the input fastq file, or an R connection to that file.

fout

(Required). The path to the output file, or an R connection to that file. Note that by default (compress=TRUE) the output fastq file is gzipped.

truncQ

(Optional). Default 2. Truncate reads at the first instance of a quality score less than or equal to truncQ. The default value of 2 is a special quality score indicating the end of good quality sequence in Illumina 1.8+.

truncLen

(Optional). Default 0 (no truncation). Truncate reads after truncLen bases. Reads shorter than this are discarded. Note that dada currently requires all sequences to be the same length.

trimLeft

(Optional). Default 0. The number of nucleotides to remove from the start of each read. If both truncLen and trimLeft are provided, filtered reads will have length truncLen-trimLeft.

maxN

(Optional). Default 0. After truncation, sequences with more than maxN Ns will be discarded. Note that dada currently does not allow Ns.

minQ

(Optional). Default 0. After truncation, reads contain a quality score below minQ will be discarded.

maxEE

(Optional). Default Inf (no EE filtering). After truncation, reads with higher than maxEE "expected errors" will be discarded. Expected errors are calculated from the nominal definition of the quality score: EE = sum(10^(-Q/10))

rm.phix

(Optional). Default FALSE. If TRUE, discard reads that match against the phiX genome, as determined by isPhiX.

(Optional). The number of records (reads) to read in and filter at any one time. This controls the peak memory requirement so that very large fastq files are supported. Default is 1e6, one-million reads. See FastqStreamer for details.

compress

(Optional). Default TRUE. Whether the output fastq file should be gzip compressed.

verbose

(Optional). Default FALSE. Whether to output status messages.

...

(Optional). Arguments passed on to isPhiX.

Value

NULL.

Details

fastqFilter replicates most of the functionality of the fastq_filter command in usearch (http://www.drive5.com/usearch/manual/cmd_fastq_filter.html). It adds the ability to remove contaminating phiX sequences as part of the filtering process.

Examples

Run this code

testFastq = system.file("extdata", "sam1F.fastq.gz", package="dada2")
filtFastq <- tempfile(fileext=".fastq.gz")
fastqFilter(testFastq, filtFastq, maxN=0, maxEE=2)
fastqFilter(testFastq, filtFastq, trimLeft=10, truncLen=200, maxEE=2, verbose=TRUE)

Run the code above in your browser using DataLab