duplicateDiscordance: Duplicate discordance

Description

Find discordance rate for duplicate sample pairs

Usage

"duplicateDiscordance"(gdsobj, match.samples.on="subject.id", check.phase=FALSE, verbose=TRUE)
"duplicateDiscordance"(gdsobj, obj2, match.samples.on=c("subject.id", "subject.id"), match.variants.on=c("alleles", "position"), discordance.type=c("genotype", "hethom"), by.variant=FALSE, verbose=TRUE)

Arguments

gdsobj

A SeqVarData object with VCF data.

obj2

A SeqVarData object with VCF data.

match.samples.on

Character string or vector of strings indicating which column should be used for matching samples. See details.

match.variants.on

Character string of length one indicating how to match variants. See details.

discordance.type

Character string describing how discordances should be calculated. See details.

check.phase

A logical indicating whether phase should be considered when calculating discordance.

by.variant

Calculate discordance by variant, otherwise by sample

verbose

A logical indicating whether to print progress messages.

Value

by.variant: A data.frame with the number of discordances for each variant, the number of sample pairs with non-missing data, and the discordance rate (num.discord / num.pair). Row names are variant ids.
by.subject: A data.frame with the sample ids for each pair, the number of discordances, the number of non-missing variants, and the discordance rate (num.discord / num.var). Row.names are subject.id (as given in samples).
subjectID: currently, this is the sample ID (by.variant=FALSE only)
sample.id.1/variant.id.1: sample id or variant id in the first gds file
sample.id.2/variant.id.1: sample id or variant id in the second gds file
n.variants/n.samples: the number of non-missing variants or samples that were compared
n.concordant: the number of concordant variants
n.alt: the number of variants involving the alternate allele in either sample
n.alt.conc: the number of concordant variants invovling the alternate allele in either sample
n.het.ref: the number of mismatches where one call is a heterozygote and the other is a reference homozygote
n.het.alt: the number of mismatches where one call is a heterozygote and the other is an alternate homozygote
n.ref.alt: the number of mismatches where the calls are opposite homozygotes

Details

For calls that involve only one gds file, duplicate discordance is calculated by sample pair and by variant. If there are more than two samples per subject in samples, only the first two samples are used and a warning message is printed. If check.phase=TRUE, variants with mismatched phase are considered discordant. If check.phase=FALSE, phase is ignored. For calls that involve two gds files, duplicate discordance is calculated by matching sample pairs and variants between the two data sets. Only biallelic SNVs are considered in the comparison. Variants can be matched using chromosome and position only (match.variants.on="position") or by using chromosome, position, and alleles (match.variants.on="alleles"). If matching on alleles and the reference allele in the first dataset is the alternate allele in the second dataset, the genotype dosage will be recoded so the same allele is counted before making the comparison. If a variant in one dataset maps to multiple variants in the other dataset, only the first pair is considered for the comparison. Discordances can be calculated using either genotypes (discordance.type = "genotype") or heterozygote/homozygote status (discordance.type = "hethom"). The latter is a method to calculate discordance that does not require alleles to be measured on the same strand in both datasets, so it is probably best to also set match.variants.on = "position" if using the "hethom" option.

The argument match.samples.on can be used to select which column in the sampleData of the input SeqVarData object should be used for matching samples. For one gds file, match.samples.on should be a single string. For two gds files, match.samples.on should be a length-2 vector of character strings, where the first element is the column to use for the first gds object and the second element is the column to use for the second gds file.

To exclude certain variants or samples from the calculate, use seqSetFilter to set appropriate filters on each gds object.

Examples

Run this code

require(Biobase)

gds <- seqOpen(seqExampleFileName("gds"))

## the example file has one sample per subject, but we
## will match the first four samples into pairs as an example
sample.id <- seqGetData(gds, "sample.id")
samples <- AnnotatedDataFrame(data.frame(data.frame(subject.id=rep(c("subj1", "subj2"), times=45),
                      sample.id=sample.id,
                      stringsAsFactors=FALSE)))
seqData <- SeqVarData(gds, sampleData=samples)

# set a filter on the first four samples
seqSetFilter(seqData, sample.id=sample.id[1:4])

disc <- duplicateDiscordance(seqData)
head(disc$by.variant)
disc$by.subject
seqClose(gds)