Learn R Programming

GWASTools (version 1.18.0)

anomIdentifyLowQuality: Identify low quality samples

Description

Identify low quality samples for which false positive rate for anomaly detection is likely to be high. Measures of noise (high variance) and high segmentation are used.

Usage

anomIdentifyLowQuality(snp.annot, med.sd, seg.info, sd.thresh, sng.seg.thresh, auto.seg.thresh)

Arguments

snp.annot
SnpAnnotationDataFrame with column "eligible", where "eligible" is a logical vector indicating whether a SNP is eligible for consideration in anomaly detection (usually FALSE for HLA and XTR regions, failed SNPs, and intensity-only SNPs). See HLA and pseudoautosomal.
med.sd
data.frame of median standard deviation of BAlleleFrequency (BAF) or LogRRatio (LRR) values across autosomes for each scan, with columns "scanID" and "med.sd". Usually the result of medianSdOverAutosomes. Usually only eligible SNPs are used in these computations. In addition, for BAF, homozygous SNPS are excluded.
seg.info
data.frame with segmentation information from anomDetectBAF or anomDetectLOH. Columns must include "scanID", "chromosome", and "num.segs". (For anomDetectBAF, segmentation information is found in $seg.info from output. For anomDetectLOH, segmentation information is found in $base.info from output.)
sd.thresh
Threshold for med.sd above which scan is identified as low quality. Suggested values are 0.1 for BAF and 0.25 for LOH.
sng.seg.thresh
Threshold for segmentation factor for a given chromosome, above which the chromosome is said to be highly segmented. See Details. Suggested values are 0.0008 for BAF and 0.0048 for LOH.
auto.seg.thresh
Threshold for segmentation factor across autosome, above which the scan is said to be highly segmented. See Details. Suggested values are 0.0001 for BAF and 0.0006 for LOH.

Value

A data.frame with the following columns:
scanID
integer id for the scan
chrX.num.segs
number of segments for chromosome X
chrX.fac
segmentation factor for chromosome X
max.autosome
autosome with highest single segmentation factor
max.auto.fac
segmentation factor for chromosome = max.autosome
max.auto.num.segs
number of segments for chromosome = max.autosome
num.ch.segd
number of chromosomes segmented, i.e. for which change points were found
fac.all.auto
segmentation factor across all autosomes
med.sd
median standard deviation of BAF (or LRR values) across autosomes. See med.sd in Arguments section.
type
one of the following, indicating reason for identification as low quality:
  • auto.seg: segmentation factor fac.all.auto above auto.seg.thresh but med.sd acceptable
  • sd: standard deviation factor med.sd above sd.thresh but fac.all.auto acceptable
  • both.sd.seg: both high variance and high segmentation factors, fac.all.auto and med.sd, are above respective thresholds
  • sng.seg: segmentation factor max.auto.fac is above sng.seg.thresh but other measures acceptable
  • sng.seg.X: segmentation factor chrX.fac is above sng.seg.thresh but other measures acceptable

Details

Low quality samples are determined separately with regard to each of the two methods of segmentation, anomDetectBAF and anomDetectLOH. BAF anomalies (respectively LOH anomalies) found for samples identified as low quality for BAF (respectively LOH) tend to have a high false positive rate.

A scan is identified as low quality due to high variance (noise), i.e. if med.sd is above a certain threshold sd.thresh.

High segmentation is often an indication of artifactual patterns in the B Allele Frequency (BAF) or Log R Ratio values (LRR) that are not always captured by high variance. Here segmentation information is determined by anomDetectBAF or anomDetectLOH which use circular binary segmentation implemented by the R-package DNAcopy. The measure for high segmentation is a "segmentation factor" = (number of segments)/(number of eligible SNPS). A single chromosome segmentation factor uses information for one chromosome. A segmentation factor across autosomes uses the total number of segments and eligible SNPs across all autosomes. See med.sd, sd.thresh, sng.seg.thresh, and auto.seg.thresh.

See Also

findBAFvariance, anomDetectBAF, anomDetectLOH

Examples

Run this code
library(GWASdata)
data(illuminaScanADF, illuminaSnpADF)

blfile <- system.file("extdata", "illumina_bl.gds", package="GWASdata")
bl <- GdsIntensityReader(blfile)
blData <-  IntensityData(bl, scanAnnot=illuminaScanADF, snpAnnot=illuminaSnpADF)

genofile <- system.file("extdata", "illumina_geno.gds", package="GWASdata")
geno <- GdsGenotypeReader(genofile)
genoData <-  GenotypeData(geno, scanAnnot=illuminaScanADF, snpAnnot=illuminaSnpADF)

# initial scan for low quality with median SD
baf.sd <- sdByScanChromWindow(blData, genoData)
med.baf.sd <- medianSdOverAutosomes(baf.sd)
low.qual.ids <- med.baf.sd$scanID[med.baf.sd$med.sd > 0.05]

# segment and filter BAF
scan.ids <- illuminaScanADF$scanID[1:2]
chrom.ids <- unique(illuminaSnpADF$chromosome)
snp.ids <- illuminaSnpADF$snpID[illuminaSnpADF$missing.n1 < 1]
data(centromeres.hg18)
anom <- anomDetectBAF(blData, genoData, scan.ids=scan.ids, chrom.ids=chrom.ids,
  snp.ids=snp.ids, centromere=centromeres.hg18, low.qual.ids=low.qual.ids)

# further screen for low quality scans
snp.annot <- illuminaSnpADF
snp.annot$eligible <- snp.annot$missing.n1 < 1
low.qual <- anomIdentifyLowQuality(snp.annot, med.baf.sd, anom$seg.info,
  sd.thresh=0.1, sng.seg.thresh=0.0008, auto.seg.thresh=0.0001)

close(blData)
close(genoData)

Run the code above in your browser using DataLab