Learn R Programming

qrqc (version 1.26.0)

readSeqFile: Read and Summarize a Sequence (FASTA or FASTQ) File

Description

readSeqFile reads a FASTQ or FASTA file, summarizing the nucleotide distribution across position (cycles) and the sequence length distributions. If type is `fastq', the distribution of qualities across position will also be recorded. If hash is TRUE, the unique sequences will be hashed with counts of their frequency. By default, only 10% of the reads will be hashed; this proportion can be controlled with hash.prop. If kmer=TRUE, k-mers of length k will be hashed by position, also with the sampling proportion controlled by hash.prop.

Usage

readSeqFile(filename, type=c("fastq", "fasta"), max.length=1000, quality=c("sanger", "solexa", "illumina"), hash=TRUE, hash.prop=0.1, kmer=TRUE, k=6L, verbose=FALSE)

Arguments

filename
the name of the file which the sequences are to be read from.
type
either `fastq' or `fasta', representing the type of the file. FASTQ files will have the quality distribution by position summarized.
max.length
the largest sequence length likely to be encountered. For efficiency, a matrix larger than the largest sequence is allocated to *this* size in C, populated, and then trimmed in R. Specifying a value too small will lead to an error and the function will need to be re-run.
quality
either `illumina', `sanger', or `solexa', this determines the quality offsets and range. See the values of QUALITY.CONSTANTS for more information.
hash
a logical value indicating whether to hash sequences
hash.prop
a numeric value in (0, 1] that functions as the proportion of reads to hash.
kmer
a logical value indicating whether to hash k-mers by position.
k
an integer value indicating the k-mer size.
verbose
a logical value indicating whether be verbose (in the C backend).

Value

An S4 object of FASTQSummary or FASTASummary containing the summary statistics.

See Also

FASTQSummary and FASTASummary are the classes of the objects returned by readSeqFile.

basePlot is a function that plots the distribution of bases over sequence length for a particular FASTASummary or FASTQSummary object. gcPlot combines and plots the GC proportion. qualPlot is a function that plots the distribution of qualities over sequence length for a particular FASTASummary or FASTQSummary object.

seqlenPlot is a function that plots a histogram of sequence lengths for a particular FASTASummary or FASTQSummary object.

kmerKLPlot is a function that plots K-L divergence of k-mers to look for possible biase in reads.

Examples

Run this code
  ## Load a FASTQ file, with sequence hashing.
  s.fastq <- readSeqFile(system.file('extdata', 'test.fastq', package='qrqc'))

  ## Load a FASTA file, without sequence hashing.
  s.fasta <- readSeqFile(system.file('extdata', 'test.fasta', package='qrqc'),
                         type='fasta', hash=FALSE)

Run the code above in your browser using DataLab