Quality control is a critical step for working with GWAS summary statistics. Processing and quality control of summary statistics includes:
- map marker ids (rsids/cpra (chr, pos, ref, alt)) to LD reference panel data
- check effect allele (flip EA, EAF, Effect)
- check effect allele frequency
- thresholds for MAF and HWE
- exclude INDELS, CG/AT and MHC region
- remove duplicated marker ids
- check which build version
- check for concordance between marker effect and LD data
Required headers for external summary statistics: marker, chr, pos, effect_allele, non_effect_allele, effect_allele_freq, effect, effect_se, stat, p, n
Required headers for internal summary statistics: rsids, chr, pos, a1, a2, af, b, seb, stat, p, n
qcStat(
Glist = NULL,
stat = NULL,
excludeMAF = 0.01,
excludeMAFDIFF = 0.05,
excludeINFO = 0.8,
excludeCGAT = TRUE,
excludeINDEL = TRUE,
excludeDUPS = TRUE,
excludeMHC = FALSE,
excludeMISS = 0.05,
excludeHWE = 1e-12
)
list of information about genotype matrix stored on disk
data frame with marker summary statistics (see required format above)
exclude marker if minor allele frequency (MAF) is below threshold (0.01 is default)
exclude marker if minor allele frequency difference (MAFDIFF) between Glist$af and stat$af is above threshold (0.05 is default)
exclude marker if info score (INFO) is below threshold (0.8 is default)
exclude marker if alleles are ambiguous (CG or AT)
exclude marker if it an insertion/deletion
exclude marker id if duplicated
exclude marker if located in MHC region
exclude marker if sample missingness (MISS) is above threshold (0.05 is default)
exclude marker if p-value for Hardy Weinberg Equilibrium test is below threshold (0.01 is default)
Peter Soerensen