Learn R Programming

DECIPHER (version 2.0.2)

Codec: Compression/Decompression of Character Vectors

Description

Compresses character vectors into raw vectors, or decompresses raw vectors into character vectors using a variety of codecs.

Usage

Codec(x, compression = "auto", compressRepeats = FALSE, processors = 1)

Arguments

x
Either a character vector to be compressed, or a list of raw vectors to be decompressed.
compression
The type of compression algorithm to use when x is a character vector. This should be (an unambiguous abbreviation of) one of "auto", "nbit", "gzip", "bzip2", or "xz". Decompression type is determined automatically. (See details section below.)
compressRepeats
Logical specifying whether to compress exact repeats and reverse complement repeats in a character vector input (x). Only applicable when compression is "auto" or "nbit". Repeat compression in long DNA sequences generally increases compression by about 2% while requiring three-fold more compression time.
processors
The number of processors to use, or NULL to automatically detect and use all available processors.

Value

If x is a character vector to be compressed, the output is a list with one element containing a raw vector per character string. If x is a list of raw vectors to be decompressed, then the output is a character vector with one string per list element.

Details

Codec can be used to compress/decompress character vectors with different algorithms. The default compression algorithm, "auto" will apply an encoding optimized for efficient compression of nucleotide sequences named "nbit". The (default) "auto" method will automatically fall back to "gzip" compression when a character string is incompressible with "nbit" encoding (e.g., amino acid sequences). In contrast, setting "compression" to "nbit" will retain the character encoding when the input is incompressible with "nbit" compression.

When performing the reverse operation, decompression, the type of compression is automatically detected based on the "magic header" added by each compression algorithm.

Examples

Run this code
fas <- system.file("extdata", "Bacteria_175seqs.fas", package="DECIPHER")
dna <- as.character(readDNAStringSet(fas)) # aligned sequences
object.size(dna)

# compression
system.time(x <- Codec(dna, compression="auto"))
object.size(x)/sum(nchar(dna)) # bytes per position

system.time(g <- Codec(dna, compression="gzip"))
object.size(g)/sum(nchar(dna)) # bytes per position

# decompression
system.time(y <- Codec(x))
stopifnot(dna==y)

system.time(z <- Codec(g))
stopifnot(dna==z)

Run the code above in your browser using DataLab