anomDetectBAF: BAF Method for Chromosome Anomaly Detection

Description

anomSegmentBAF for each sample and chromosome, breaks the chromosome up into segments marked by change points of a metric based on B Allele Frequency (BAF) values.

anomFilterBAF selects segments which are likely to be anomalous.

anomDetectBAF is a wrapper to run anomSegmentBAF and anomFilterBAF in one step.

Usage

anomSegmentBAF(intenData, genoData, scan.ids, chrom.ids, snp.ids, smooth = 50, min.width = 5, nperm = 10000, alpha = 0.001, verbose = TRUE)
anomFilterBAF(intenData, genoData, segments, snp.ids, centromere, low.qual.ids = NULL, num.mark.thresh = 15, long.num.mark.thresh = 200, sd.reg = 2, sd.long = 1, low.frac.used = 0.1, run.size = 10, inter.size = 2, low.frac.used.num.mark = 30, very.low.frac.used = 0.01,  low.qual.frac.num.mark = 150, lrr.cut = -2, ct.thresh = 10, frac.thresh = 0.1, verbose=TRUE, small.thresh=2.5, dev.sim.thresh=0.1, centSpan.fac=1.25, centSpan.nmark=50)
anomDetectBAF(intenData, genoData, scan.ids, chrom.ids, snp.ids, centromere, low.qual.ids = NULL, ...)

Arguments

intenData

An IntensityData object containing the B Allele Frequency. The order of the rows of intenData and the snp annotation are expected to be by chromosome and then by position within chromosome. The scan annotation should contain sex, coded as "M" for male and "F" for female.

genoData

A GenotypeData object. The order of the rows of genoData and the snp annotation are expected to be by chromosome and then by position within chromosome.

scan.ids

vector of scan ids (sample numbers) to process

chrom.ids

vector of (unique) chromosomes to process. Should correspond to integer chromosome codes in intenData. Recommended to include all autosomes, and optionally X (males will be ignored) and the pseudoautosomal (XY) region.

snp.ids

vector of eligible snp ids. Usually exclude failed and intensity-only SNPs. Also recommended to exclude an HLA region on chromosome 6 and XTR region on X chromosome. See HLA and pseudoautosomal. If there are SNPs annotated in the centromere gap, exclude these as well (see centromeres).

smooth

number of markers for smoothing region. See smooth.CNA in the DNAcopy package.

min.width

minimum number of markers for a segment. See segment in the DNAcopy package.

nperm

number of permutations for deciding significance in segmentation. See segment in the DNAcopy package.

alpha

significance level. See segment in the DNAcopy package.

verbose

logical indicator whether to print information about the scan id currently being processed. anomSegmentBAF prints each scan id; anomFilterBAF prints a message after every 10 samples: "processing ith scan id out of n" where "ith" with be 10, 10, etc. and "n" is the total number of samples

segments

data.frame of segments from anomSegmentBAF. Names must include "scanID", "chromosome", "num.mark", "left.index", "right.index", "seg.mean". Here "left.index" and "right.index" are row indices of intenData. Left and right refer to start and end of anomaly,respectively, in position order.

centromere

data.frame with centromere position information. Names must include "chrom", "left.base", "right.base". Valid values for "chrom" are 1:22, "X", "Y", "XY". Here "left.base" and "right.base" are base positions of start and end of centromere location in position order. Centromere data tables are provided in centromeres.

low.qual.ids

scan ids determined to be low quality for which some segments are filtered based on more stringent criteria. Default is NULL. Usual choice are scan ids for which median BAF across autosomes > 0.05. See sdByScanChromWindow and medianSdOverAutosomes.

num.mark.thresh

minimum number of SNP markers in a segment to be considered for anomaly

long.num.mark.thresh

min number of markers for "long" segment to be considered for anomaly for which significance threshold criterion is allowed to be less stringent

sd.reg

number of baseline standard deviations of segment mean from a baseline mean for "normal" needed to declare segment anomalous. This number is given by abs(mean of segment - baseline mean)/(baseline standard deviation)

sd.long

same meaning as sd.reg but applied to "long" segments

low.frac.used

if fraction of heterozygous or missing SNP markers compared with number of eligible SNP markers in segment is below this, more stringent criteria are applied to declare them anomalous.

run.size

min length of run of missing or heterozygous SNP markers for possible determination of homozygous deletions

inter.size

number of homozygotes allowed to "interrupt" run for possible determination of homozygous deletions

low.frac.used.num.mark

number of markers threshold for low.frac.used segments (which are not declared homozygous deletions

very.low.frac.used

any segments with (num.mark)/(number of markers in interval) less than this are filtered out since they tend to be false positives

low.qual.frac.num.mark

minimum num.mark threshold for low quality scans (low.qual.ids) for segments that are also below low.frac.used threshold

lrr.cut

look for runs of LRR values below lrr.cut to adjust homozygous deletion endpoints

ct.thresh

minimum number of LRR values below lrr.cut needed in order to adjust

frac.thresh

investigate interval for homozygous deletion only if lrr.cut and ct.thresh thresholds met and (# LRR values below lrr.cut)/(# eligible SNPs in segment) > frac.thresh

small.thresh

sd.fac threshold use in making merge decisions involving small num.mark segments

dev.sim.thresh

relative error threshold for determining similarity in BAF deviations; used in merge decisions

centSpan.fac

thresholds increased by this factor when considering filtering/keeping together left and right halves of centromere spanning segments

centSpan.nmark

minimum number of markers under which centromere spanning segments are automatically filtered out

...

arguments to pass to anomFilterBAF

Value

scanID

anomSegmentBAFinteger id of scan

chromosome

chromosome as integer code

left.index

row index of intenData indicating left endpoint of segment

right.index

row index of intenData indicating right endpoint of segment

num.mark

number of heterozygous or missing SNPs in the segment

seg.mean

mean of the BAF metric over the segment

raw

anomFilterBAFanomDetectBAFdata.frame of raw segmentation data, with same output as anomSegmentBAF as well as:

left.base: base position of left endpoint of segment
right.base: base position of right endpoint of segment
sex: sex of scan.id coded as "M" or "F"
sd.fac: measure of deviation from baseline equal to abs(mean of segment - baseline mean)/(baseline standard deviation); used in determining anomalous segments

filtered

data.frame of the segments identified as anomalies, with the same columns as raw as well as:

merge: TRUE if segment was a result of merging. Consecutive segments from output of anomSegmentBAF that meet certain criteria are merged.
homodel.adjust: TRUE if original segment was adjusted to narrow in on a homozygous deletion
frac.used: fraction of (eligible) heterozygous or missing SNP markers compared with total number of eligible SNP markers in segment

base.info

data frame with columns:

scanID: integer id of scan
base.mean: mean of non-anomalous baseline. This is the mean of the BAF metric for heterozygous and missing SNPs over all unsegmented autosomes that were considered.
base.sd: standard deviation of non-anomalous baseline
chr.ct: number of unsegmented chromosomes used in determining the non-anomalous baseline

seg.info

data frame with columns:

scanID: integer id of scan
chromosome: chromosome as integer
num.segs: number of segments produced by anomSegmentBAF

Details

anomSegmentBAF uses the function segment from the DNAcopy package to perform circular binary segmentation on a metric based on BAF values. The metric for a given sample/chromosome is sqrt(min(BAF,1-BAF,abs(BAF-median(BAF))) where the median is across BAF values on the chromosome. Only BAF values for heterozygous or missing SNPs are used.

anomFilterBAF determines anomalous segments based on a combination of thresholds for number of SNP markers in the segment and on deviation from a "normal" baseline. (See num.mark.thresh,long.num.mark.thresh, sd.reg, and sd.long.) The "normal" baseline metric mean and standard deviation are found across all autosomes not segmented by anomSegmentBAF. This is why it is recommended to include all autosomes for the argument chrom.ids to ensure a more accurate baseline.

Some initial filtering is done, including possible merging of consecutive segments meeting sd.reg threshold along with other criteria (such as not spanning the centromere) and adjustment for accurate break points for possible homozygous deletions (see lrr.cut, ct.thresh, frac.thresh, run.size, and inter.size). Male samples for X chromosome are not processed.

More stringent criteria are applied to some segments (see low.frac.used,low.frac.used.num.mark, very.low.frac.used, low.qual.ids, and low.qual.frac.num.mark).

anomDetectBAF runs anomSegmentBAF with default values and then runs anomFilterBAF. Additional parameters for anomFilterBAF may be passed as arguments.

References

See references in segment in the package DNAcopy. The BAF metric used is modified from Itsara,A., et.al (2009) Population Analysis of Large Copy Number Variants and Hotspots of Human Genetic Disease. American Journal of Human Genetics, 84, 148--161.

Examples

Run this code

library(GWASdata)
data(illuminaScanADF, illuminaSnpADF)

blfile <- system.file("extdata", "illumina_bl.gds", package="GWASdata")
bl <- GdsIntensityReader(blfile)
blData <-  IntensityData(bl, scanAnnot=illuminaScanADF, snpAnnot=illuminaSnpADF)

genofile <- system.file("extdata", "illumina_geno.gds", package="GWASdata")
geno <- GdsGenotypeReader(genofile)
genoData <-  GenotypeData(geno, scanAnnot=illuminaScanADF, snpAnnot=illuminaSnpADF)

# segment BAF
scan.ids <- illuminaScanADF$scanID[1:2]
chrom.ids <- unique(illuminaSnpADF$chromosome)
snp.ids <- illuminaSnpADF$snpID[illuminaSnpADF$missing.n1 < 1]
seg <- anomSegmentBAF(blData, genoData, scan.ids=scan.ids,
                      chrom.ids=chrom.ids, snp.ids=snp.ids)

# filter segments to detect anomalies
data(centromeres.hg18)
filt <- anomFilterBAF(blData, genoData, segments=seg, snp.ids=snp.ids,
                      centromere=centromeres.hg18)

# alternatively, run both steps at once
anom <- anomDetectBAF(blData, genoData, scan.ids=scan.ids, chrom.ids=chrom.ids,
                      snp.ids=snp.ids, centromere=centromeres.hg18)

close(blData)
close(genoData)

Run the code above in your browser using DataLab