Quality control is a critical step for working with summary statistics (in particular for external). Processing and quality control of GWAS summary statistics includes:
- map marker ids (rsids/cpra (chr, pos, ref, alt)) to LD reference panel data - check effect allele (flip EA, EAF, Effect) - check effect allele frequency - thresholds for MAF and HWE - exclude INDELS, CG/AT and MHC region - remove duplicated marker ids - check which build version - check for concordance between marker effect and LD data
External summary statistics format: marker, chr, pos, effect_allele, non_effect_allele, effect_allele_freq, effect, effect_se, stat, p, n
Internal summary statistics format: rsids, chr, pos, a1, a2, af, b, seb, stat, p, n
gfilter(
Glist = NULL,
excludeMAF = 0.01,
excludeMISS = 0.05,
excludeINFO = NULL,
excludeCGAT = TRUE,
excludeINDEL = TRUE,
excludeDUPS = TRUE,
excludeHWE = 1e-12,
excludeMHC = FALSE,
assembly = "GRCh37"
)
A list containing information about the genotype matrix stored on disk.
A scalar threshold. Exclude markers with a minor allele frequency (MAF) below this threshold. Default is 0.01.
A scalar threshold. Exclude markers with missingness (MISS) above this threshold. Default is 0.05.
A scalar threshold. Exclude markers with an info score (INFO) below this threshold. Default is 0.8.
A logical value; if TRUE exclude markers if the alleles are ambiguous (i.e., either CG or AT combinations).
A logical value; if TRUE exclude markers that are insertions or deletions (INDELs).
A logical value; if TRUE exclude markers if their identifiers are duplicated.
A scalar threshold. Exclude markers where the p-value for the Hardy-Weinberg Equilibrium test is below this threshold. Default is 0.01.
A logical value; if TRUE exclude markers located within the MHC region.
A character string indicating the name of the genome assembly (e.g., "GRCh38").
Peter Soerensen