Run LD pruning on dataset with plink --exclude range highldfile --indep-pairwise 50 5 0.2, where highldfile contains regions of high LD as provided by Anderson et (2010) Nature Protocols. Subsequently, plink --genome is run on the LD pruned, maf-filtered data. plink --genome calculates identity by state (IBS) for each pair of individuals based on the average proportion of alleles shared at genotyped SNPs. The degree of recent shared ancestry,i.e. the identity by descent (IBD) can be estimated from the genome-wide IBS. The proportion of IBD between two individuals is returned by --genome as PI_HAT.
run_check_relatedness(
indir,
name,
qcdir = indir,
highIBDTh = 0.185,
mafThRelatedness = 0.1,
path2plink = NULL,
filter_high_ldregion = TRUE,
high_ldregion_file = NULL,
genomebuild = "hg19",
showPlinkOutput = TRUE,
keep_individuals = NULL,
remove_individuals = NULL,
exclude_markers = NULL,
extract_markers = NULL,
verbose = FALSE
)
[character] /path/to/directory containing the basic PLINK data files name.bim, name.bed, name.fam files.
[character] Prefix of PLINK files, i.e. name.bed, name.bim, name.fam.
[character] /path/to/directory to save name.genome as returned by plink --genome. User needs writing permission to qcdir. Per default qcdir=indir.
[double] Threshold for acceptable proportion of IBD between pair of individuals; only pairwise relationship estimates larger than this threshold will be recorded.
[double] Threshold of minor allele frequency filter for selecting variants for IBD estimation.
[character] Absolute path to PLINK executable
(https://www.cog-genomics.org/plink/1.9/) i.e.
plink should be accessible as path2plink -h. The full name of the executable
should be specified: for windows OS, this means path/plink.exe, for unix
platforms this is path/plink. If not provided, assumed that PATH set-up works
and PLINK will be found by exec
('plink').
[logical] Should high LD regions be filtered
before IBD estimation; carried out per default with high LD regions for
hg19 provided as default via genomebuild
. For alternative genome
builds not provided or non-human data, high LD regions files can be
provided via high_ldregion_file
.
[character] Path to file with high LD regions used
for filtering before IBD estimation if filter_high_ldregion
== TRUE,
otherwise ignored; for human genome data, high LD region files are provided
and can simply be chosen via genomebuild
. Files have to be
space-delimited, no column names with the following columns: chromosome,
region-start, region-end, region number. Chromosomes are specified without
'chr' prefix. For instance:
1 48000000 52000000 1
2 86000000 100500000 2
[character] Name of the genome build of the PLINK file annotations, ie mappings in the name.bim file. Will be used to remove high-LD regions based on the coordinates of the respective build. Options are hg18, hg19 and hg38. See @details.
[logical] If TRUE, plink log and error messages are printed to standard out.
[character] Path to file with individuals to be retained in the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples not listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals.
[character] Path to file with individuals to be removed from the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals.
[character] Path to file with makers to be removed from the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All listed variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers.
[character] Path to file with makers to be included in the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All unlisted variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers.
[logical] If TRUE, progress info is printed to standard out.
Both run_check_relatedness
and its evaluation via
evaluate_check_relatedness
can simply be invoked by
check_relatedness
.
The IBD estimation is conducted on LD pruned data and in a first step, high LD regions are excluded. The regions were derived from the high-LD-regions file provided by Anderson et (2010) Nature Protocols. These regions are in NCBI36 (hg18) coordinates and were lifted to GRCh37 (hg19) and GRC38 (hg38) coordinates using the liftOver tool available here: https://genome.ucsc.edu/cgi-bin/hgLiftOver. The 'Minimum ratio of bases that must remap' which was set to 0.5 and the 'Allow multiple output regions' box ticked; for all other parameters, the default options were selected. LiftOver files were generated on July 9,2019. The commands for formatting the files are provided in system.file("extdata", 'liftOver.cmd', package="plinkQC").
# NOT RUN {
indir <- system.file("extdata", package="plinkQC")
name <- 'data'
qcdir <- tempdir()
path2plink <- '/path/to/plink'
# the following code is not run on package build, as the path2plink on the
# user system is not known.
# }
# NOT RUN {
# Relatedness estimation based in all markers in dataset
run <- run_check_relatedness(indir=indir, qcdir=qcdir, name=name,
path2plink=path2plink)
# relatedness estimation on subset of dataset
keep_individuals_file <- system.file("extdata", "keep_individuals",
package="plinkQC")
run <- run_check_relatedness(indir=indir, qcdir=qcdir, name=name,
keep_individuals=keep_individuals_file, path2plink=path2plink)
# }
Run the code above in your browser using DataLab