Learn R Programming

plinkQC (version 0.2.2)

perIndividualQC: Quality control for all individuals in plink-dataset

Description

perIndividualQC checks the samples in the plink dataset for their total missingness and heterozygosity rates, the concordance of their assigned sex to their SNP sex, their relatedness to other study individuals and their genetic ancestry.

Usage

perIndividualQC(indir, name, qcdir = indir, dont.check_sex = FALSE,
  do.run_check_sex = TRUE, do.evaluate_check_sex = TRUE,
  maleTh = 0.8, femaleTh = 0.2, externalSex = NULL,
  externalMale = "M", externalSexSex = "Sex", externalSexID = "IID",
  externalFemale = "F", fixMixup = FALSE,
  dont.check_het_and_miss = FALSE, do.run_check_het_and_miss = TRUE,
  do.evaluate_check_het_and_miss = TRUE, imissTh = 0.03, hetTh = 3,
  dont.check_relatedness = FALSE, do.run_check_relatedness = TRUE,
  do.evaluate_check_relatedness = TRUE, highIBDTh = 0.1875,
  dont.check_ancestry = FALSE, do.run_check_ancestry = TRUE,
  do.evaluate_check_ancestry = TRUE, prefixMergedDataset,
  europeanTh = 1.5, refSamples = NULL, refColors = NULL,
  refSamplesFile = NULL, refColorsFile = NULL, refSamplesIID = "IID",
  refSamplesPop = "Pop", refColorsColor = "Color",
  refColorsPop = "Pop", studyColor = "#2c7bb6", label = TRUE,
  interactive = FALSE, verbose = TRUE, path2plink = NULL,
  showPlinkOutput = TRUE)

Arguments

indir

[character] /path/to/directory containing the basic PLINK data files name.bim, name.bed, name.fam files.

name

[character] Prefix of PLINK files, i.e. name.bed, name.bim, name.fam.

qcdir

[character] /path/to/directory where results will be saved. Per default, qcdir=indir. If do.evaluate_[analysis] is set to TRUE and do.run_[analysis] is FALSE, perIndividualQC expects the analysis-specific plink output files in qcdir i.e. do.check_sex expects name.sexcheck, do.evaluate_check_het_and_miss expects name.het and name.imiss, do.evaluate_check_relatedness expects name.genome and name.imiss and do.evaluate_check_ancestry expects prefixMergeData.eigenvec. If these files are not present perIndividualQC will fail with missing file error. Setting do.run_[analysis] TRUE will execute the checks and create the required files. User needs writing permission to qcdir.

dont.check_sex

[logical] If TRUE, no sex check will be conducted; short for do.run_check_sex=FALSE and do.evaluate_check_sex=FALSE. Takes precedence over do.run_check_sex and do.evaluate_check_sex.

do.run_check_sex

[logical] If TRUE, run run_check_sex

do.evaluate_check_sex

[logical] If TRUE, run evaluate_check_sex

maleTh

[double] Threshold of X-chromosomal heterozygosity rate for males.

femaleTh

[double] Threshold of X-chromosomal heterozygosity rate for females.

externalSex

[data.frame, optional] Dataframe with sample IDs [externalSexID] and sex [externalSexSex] to double check if external and PEDSEX data (often processed at different centers) match.

externalMale

[integer/character] Identifier for 'male' in externalSex.

externalSexSex

[character] Column identifier for column containing sex information in externalSex.

externalSexID

[character] Column identifier for column containing ID information in externalSex.

externalFemale

[integer/character] Identifier for 'female' in externalSex.

fixMixup

[logical] Should PEDSEX of individuals with mismatch between PEDSEX and Sex while Sex==SNPSEX automatically corrected: this will directly change the name.bim/.bed/.fam files!

dont.check_het_and_miss

[logical] If TRUE, no heterozygosity and missingness check will be conducted; short for do.run_check_heterozygosity=FALSE, do.run_check_missingness=FALSE and do.evaluate_check_het_and_miss=FALSE. Takes precedence over do.run_check_heterozygosity, do.run_check_missingness and do.evaluate_check_het_and_miss.

do.run_check_het_and_miss
do.evaluate_check_het_and_miss

[logical] If TRUE, run evaluate_check_het_and_miss.

imissTh

[double] Threshold for acceptable missing genotype rate in any individual; has to be proportion between (0,1)

hetTh

[double] Threshold for acceptable deviation from mean heterozygosity per individual. Expressed as multiples of standard deviation of heterozygosity (het), i.e. individuals outside mean(het) +/- hetTh*sd(het) will be returned as failing heterozygosity check; has to be larger than 0.

dont.check_relatedness

[logical] If TRUE, no relatedness check will be conducted; short for do.run_check_relatedness=FALSE and do.evaluate_check_relatedness=FALSE. Takes precedence over do.run_check_relatedness and do.evaluate_check_relatedness.

do.run_check_relatedness

[logical] If TRUE, run run_check_relatedness.

do.evaluate_check_relatedness

[logical] If TRUE, run evaluate_check_relatedness.

highIBDTh

[double] Threshold for acceptable proportion of IBD between pair of individuals.

dont.check_ancestry

[logical] If TRUE, no ancestry check will be conducted; short for do.run_check_ancestry=FALSE and do.evaluate_check_ancestry=FALSE. Takes precedence over do.run_check_ancestry and do.evaluate_check_ancestry.

do.run_check_ancestry

[logical] If TRUE, run run_check_ancestry.

do.evaluate_check_ancestry

[logical] If TRUE, run evaluate_check_ancestry.

prefixMergedDataset

[character] Prefix of merged dataset (study and reference samples) used in plink --pca, resulting in prefixMergedDataset.eigenvec.

europeanTh

[double] Scaling factor of radius to be drawn around center of European reference samples, with study samples inside this radius considered to be of European descent and samples outside this radius of non-European descent. The radius is computed as the maximum Euclidean distance of European reference samples to the centre of European reference samples.

refSamples

[data.frame] Dataframe with sample identifiers [refSamplesIID] corresponding to IIDs in prefixMergedDataset.eigenvec and population identifier [refSamplesPop] corresponding to population IDs [refColorsPop] in refColorsfile/refColors. Either refSamples or refSamplesFile have to be specified.

refColors

[data.frame, optional] Dataframe with population IDs in column [refColorsPop] and corresponding colour-code for PCA plot in column [refColorsColor]. If not provided and is.null(refColorsFile) default colors are used.

refSamplesFile

[character] /path/to/File/with/reference samples. Needs columns with sample identifiers [refSamplesIID] corresponding to IIDs in prefixMergedDataset.eigenvec and population identifier [refSamplesPop] corresponding to population IDs [refColorsPop] in refColorsfile/refColors.

refColorsFile

[character, optional] /path/to/File/with/Population/Colors containing population IDs in column [refColorsPop] and corresponding colour-code for PCA plot in column [refColorsColor].If not provided and is.null(refColors) default colors for are used.

refSamplesIID

[character] Column name of reference sample IDs in refSamples/refSamplesFile.

refSamplesPop

[character] Column name of reference sample population IDs in refSamples/refSamplesFile.

refColorsColor

[character] Column name of population colors in refColors/refColorsFile

refColorsPop

[character] Column name of reference sample population IDs in refColors/refColorsFile.

studyColor

[character] Colour to be used for study population.

label

[logical] Set TRUE, to add fail IDs as text labels in scatter plot.

interactive

[logical] Should plots be shown interactively? When choosing this option, make sure you have X-forwarding/graphical interface available for interactive plotting. Alternatively, set interactive=FALSE and save the returned plot object (p_sampleQC) via ggplot2::ggsave(p=p_sampleQC, other_arguments) or pdf(outfile) print(p_sampleQC) dev.off(). If TRUE, i) depicts the X-chromosomal heterozygosity (SNPSEX) of the samples split by their PEDSEX (if do.evaluate_check_sex is TRUE), ii) creates a scatter plot with samples' missingness rates on x-axis and their heterozygosity rates on the y-axis (if do.evaluate_check_het_and_miss is TRUE), iii) depicts all pair-wise IBD-estimates as histogram (if do.evaluate_check_relatedness is TRUE) and iv) creates a scatter plot of PC1 versus PC2 color-coded for samples of reference populations and study population (if do.check_ancestry is set to TRUE).

verbose

[logical] If TRUE, progress info is printed to standard out.

path2plink

[character] Absolute path to PLINK executable (https://www.cog-genomics.org/plink/1.9/) i.e. plink should be accesible as path2plink -h. The full name of the executable should be specified: for windows OS, this means path/plink.exe, for unix platforms this is path/plink. If not provided, assumed that PATH set-up works and PLINK will be found by exec_wait('plink').

showPlinkOutput

[logical] If TRUE, plink log and error messages are printed to standard out.

Value

Named [list] with i) fail_list, a named [list] with 1. sample_missingness containing a [vector] with sample IIDs failing the missingness threshold imissTh, 2. highIBD containing a [vector] with sample IIDs failing the relatedness threshold highIBDTh, 3. outlying_heterozygosity containing a [vector] with sample IIDs failing the heterozygosity threshold hetTh, 4. mismatched_sex containing a [vector] with the sample IIDs failing the sexcheck based on SNPSEX and femaleTh/maleTh and 5. ancestry containing a vector with sample IIDs failing the ancestry check based on europeanTh and ii) p_sampleQC, a ggplot2-object 'containing' a sub-paneled plot with the QC-plots of check_sex, check_het_and_miss, check_relatedness and check_ancestry, which can be shown by print(p_sampleQC). List entries contain NULL if that specific check was not chosen.

Details

perIndividualQC wraps around the individual QC functions check_sex, check_het_and_miss, check_relatedness and check_ancestry. For details on the parameters and outputs, check these function documentations. For detailed output for fail IIDs (instead of simple IID lists), run each function individually.

Examples

Run this code
# NOT RUN {
indir <- system.file("extdata", package="plinkQC")
qcdir <- tempdir()
name <- "data"
# All quality control checks
# In this examples, run_check* already conducted and outcome files present
# in qcdir, hence dont.check_* all set to FALSE
# }
# NOT RUN {
fail_individuals <- perIndividualQC(indir=indir, qcdir=qcdir, name=name,
refSamplesFile=paste(qcdir, "/HapMap_ID2Pop.txt",sep=""),
refColorsFile=paste(qcdir, "/HapMap_PopColors.txt", sep=""),
prefixMergedDataset="data.HapMapIII", interactive=FALSE, verbose=FALSE,
do.run_check_het_and_miss=FALSE, do.run_check_relatedness=FALSE,
do.run_check_sex=FALSE, do.run_check_ancestry=FALSE)

# Only check sex and missingness/heterozygosity
fail_sex_het_miss <- perIndividualQC(indir=indir, qcdir=qcdir, name=name,
dont.check_ancestry=TRUE, dont.check_relatedness=TRUE,
interactive=FALSE, verbose=FALSE, do.run_check_het_and_miss=FALSE,
do.run_check_sex=FALSE)
# }

Run the code above in your browser using DataLab