Learn R Programming

plinkQC (version 0.2.2)

check_relatedness: Identification of related individuals

Description

Runs and evaluates results from plink --genome. plink --genome calculates identity by state (IBS) for each pair of individuals based on the average proportion of alleles shared at genotyped SNPs. The degree of recent shared ancestry, i.e. the identity by descent (IBD) can be estimated from the genome-wide IBS. The proportion of IBD between two individuals is returned by plink --genome as PI_HAT. check_relatedness finds pairs of samples whose proportion of IBD is larger than the specified highIBDTh. Subsequently, for pairs of individuals that do not have additional relatives in the dataset, the individual with the greater genotype missingness rate is selected and returned as the individual failing the relatedness check. For more complex family structures, the unrelated individuals per family are selected (e.g. in a parents-offspring trio, the offspring will be marked as fail, while the parents will be kept in the analysis). check_relatedness depicts all pair-wise IBD-estimates as histograms stratified by value of PI_HAT.

Usage

check_relatedness(indir, name, qcdir = indir, highIBDTh = 0.1875,
  imissTh = 0.03, run.check_relatedness = TRUE, interactive = FALSE,
  verbose = FALSE, path2plink = NULL, showPlinkOutput = TRUE)

Arguments

indir

[character] /path/to/directory containing the basic PLINK data files name.bim, name.bed, name.fam files.

name

[character] Prefix of PLINK files, i.e. name.bed, name.bim, name.fam, name.genome and name.imiss.

qcdir

[character] /path/to/directory to where name.genome as returned by plink --genome will be saved. Per default qcdir=indir. If run.check_relatedness is FALSE, it is assumed that plink --missing and plink --genome have been run and qcdir/name.imiss and qcdir/name.genome exist. User needs writing permission to qcdir.

highIBDTh

[double] Threshold for acceptable proportion of IBD between pair of individuals.

imissTh

[double] Threshold for acceptable missing genotype rate in any individual; has to be proportion between (0,1)

run.check_relatedness

[logical] Should plink --genome be run to determine pairwise IBD of individuals; if FALSE, it is assumed that plink --genome and plink --missing have been run and qcdir/name.imiss and qcdir/name.genome are present; check_relatedness will fail with missing file error otherwise.

interactive

[logical] Should plots be shown interactively? When choosing this option, make sure you have X-forwarding/graphical interface available for interactive plotting. Alternatively, set interactive=FALSE and save the returned plot object (p_IBD() via ggplot2::ggsave(p=p_IBD, other_arguments) or pdf(outfile) print(p_IBD) dev.off().

verbose

[logical] If TRUE, progress info is printed to standard out.

path2plink

[character] Absolute path to PLINK executable (https://www.cog-genomics.org/plink/1.9/) i.e. plink should be accesible as path2plink -h. The full name of the executable should be specified: for windows OS, this means path/plink.exe, for unix platforms this is path/plink. If not provided, assumed that PATH set-up works and PLINK will be found by exec_wait('plink').

showPlinkOutput

[logical] If TRUE, plink log and error messages are printed to standard out.

Value

Named [list] with i) fail_high_IBD containing a [data.frame] of IIDs and FIDs of individuals who fail the IBDTh in columns FID1 and IID1. In addition, the following columns are returned (as originally obtained by plink --genome): FID2 (Family ID for second sample), IID2 (Individual ID for second sample), RT (Relationship type inferred from .fam/.ped file), EZ (IBD sharing expected value, based on just .fam/.ped relationship), Z0 (P(IBD=0)), Z1 (P(IBD=1)), Z2 (P(IBD=2)), PI_HAT (Proportion IBD, i.e. P(IBD=2) + 0.5*P(IBD=1)), PHE (Pairwise phenotypic code (1, 0, -1 = AA, AU, and UU pairs, respectively)), DST (IBS distance, i.e. (IBS2 + 0.5*IBS1) / (IBS0 + IBS1 + IBS2)), PPC (IBS binomial test), RATIO (HETHET : IBS0 SNP ratio (expected value 2)). and ii) failIDs containing a [data.frame] with individual IDs [IID] and family IDs [FID] of individuals failing the highIBDTh iii) p_IBD, a ggplot2-object 'containing' all pair-wise IBD-estimates as histograms stratified by value of PI_HAT, which can be shown by print(p_IBD).

Details

check_relatedness wraps around run_check_relatedness and evaluate_check_relatedness. If run.check_relatedness is TRUE, run_check_relatedness is executed ; otherwise it is assumed that plink --genome has been run externally and qcdir/name.genome exists. check_relatedness will fail with missing file error otherwise.

For details on the output data.frame fail_high_IBD, check the original description on the PLINK output format page: https://www.cog-genomics.org/plink/1.9/formats#genome.

Examples

Run this code
# NOT RUN {
indir <- system.file("extdata", package="plinkQC")
name <- 'data'
relatednessQC <- check_relatedness(indir=indir, name=name, interactive=FALSE,
run.check_relatedness=FALSE)
# }

Run the code above in your browser using DataLab