demultiplex: Demultiplex a set of reads.

Description

The function demultiplex takes a set of reads that start with a barcode and assigns those reads to a reference barcode while possibly correcting errors.

The correct metric should be used, with metric = "hamming" to correct substitution errors and metric = "seqlev" to correct insertion, deletion, and substitution errors.

Usage

demultiplex(reads, barcodes, metric=c("hamming","seqlev","levenshtein","phaseshift"), cost_sub = 1, cost_indel = 1)

Arguments

reads

The reads coming from your sequencing machines that start with a barcode. For metric = "seqlev" please provide some context after the (supposed) barcode, at least as many bases as errors that you want to correct.

barcodes

The reference barcodes that you used during library preparation and that you want to correct in your reads.

metric

The distance metric to be used to assign reads to reference barcodes.

cost_sub

The cost weight given to a substitution.

cost_indel

The cost weight given to insertions and deletions.

Value

Each reference barcode is the corrected version of the input barcode.

Details

Reads are matched to their correct reference barcodes by calculating the distances between each read and each reference barcode. The reference barcode with the smallest distance to the read is assumed to be the correct original barcode of that read.

For metric = "hamming", only the first n (with n being the length of the reference barcodes) bases of the read are used for these comparisons and no bases afterwards. Reads with fewer than n bases cannot be matched.

For metric = "seqlev", the whole read is compared with the reference barcodes. The Sequence Levenshtein distance was especially developed for barcodes in DNA context and can cope with ambiguities that stem from changes to the length of the barcode.

The Levenshtein distance (metric = "levenshtein") is largely undefined in DNA context and should be avoided. The Levenshtein distance only works if the length both of the reference barcode and the barcode in the read is known. With possible insertions and deletions, this becomes an unknown. For this reason, we always calculate the Levenshtein distance between the whole read and the whole reference barcode without coping with potential side effects.

Examples

Run this code

# Define some barcodes and inserts
barcodes <- c("AGGT", "TTCC", "CTGA", "GCAA")
insert <- 'ACGCAGGTTGCATATTTTAGGAAGTGAGGAGGAGGCACGGGCTCGAGCTGCGGCTGGGTCTGGGGCGCGG'

# Choose and mutate a couple of thousand barcodes
used_barcodes <- sample(barcodes,10000,replace=TRUE)
mutated_barcodes <- unlist(lapply(strsplit(used_barcodes,""), function(x) { pos <- sample(1:length(x),1); x[pos] <- sample(c("C","G","A","T"),1); return(paste(x,collapse='')) } ))

show(setequal(mutated_barcodes, used_barcodes)) # FALSE

# Construct reads (= barcodes + insert)
reads <- paste(mutated_barcodes, insert, sep='')

# Demultiplex
demultiplexed <- demultiplex(reads,barcodes,metric="hamming")

# Show correctness
show(setequal(demultiplexed, used_barcodes)) # TRUE