Quality control is a critical step for working with summary statistics (in particular for external). Processing and quality control of GWAS summary statistics includes:
- map marker ids (rsids/cpra (chr, pos, ref, alt)) to LD reference panel data
- check effect allele (flip EA, EAF, Effect)
- check effect allele frequency
- thresholds for MAF and HWE
- exclude INDELS, CG/AT and MHC region
- remove duplicated marker ids
- check which build version
- check for concordance between marker effect and LD data
mapStat(
Glist = NULL,
stat = NULL,
excludeMAF = 0.01,
excludeMAFDIFF = 0.05,
excludeINFO = 0.8,
excludeCGAT = TRUE,
excludeINDEL = TRUE,
excludeDUPS = TRUE,
excludeMHC = FALSE,
excludeMISS = 0.05,
excludeHWE = 1e-12
)
list of information about genotype matrix stored on disk
dataframe with marker summary statistics
exclude marker if minor allele frequency (MAF) is below threshold (0.01 is default)
exclude marker if minor allele frequency difference (MAFDIFF) between Glist$af and stat$af is above threshold (0.05 is default)
exclude marker if info score (INFO) is below threshold (0.8 is default)
exclude marker if alleles are ambigous (CG or AT)
exclude marker if it an insertion/deletion
exclude marker id if duplicated
exclude marker if located in MHC region
exclude marker if missingness (MISS) is above threshold (0.05 is default)
exclude marker if p-value for Hardy Weinberg Equilibrium test is below threshold (0.01 is default)
Peter Soerensen