Evaluates and depicts results of plink --pca (via
run_check_ancestry
or externally conducted pca) on merged
genotypes from individuals to be QCed and individuals of reference population
of known genotypes. Currently, check ancestry only supports automatic
selection of individuals of European descent. It uses information from
principal components 1 and 2 returned by plink --pca to find the center of
the European reference samples (mean(PC1_europeanRef), mean(PC2_europeanRef).
It computes the maximum Euclidean distance (maxDist) of the European
reference samples from this centre. All study samples whose Euclidean
distance from the centre falls outside the circle described by the radius
r=europeanTh* maxDist are considered non-European and their IDs are returned
as failing the ancestry check.
check_ancestry creates a scatter plot of PC1 versus PC2 colour-coded for
samples of the reference populations and the study population.
evaluate_check_ancestry(indir, name, prefixMergedDataset, qcdir = indir,
europeanTh = 1.5, refSamples = NULL, refColors = NULL,
refSamplesFile = NULL, refColorsFile = NULL, refSamplesIID = "IID",
refSamplesPop = "Pop", refColorsColor = "Color",
refColorsPop = "Pop", studyColor = "#2c7bb6", interactive = FALSE)
[character] /path/to/directory containing the basic PLINK data files name.bim, name.bed, name.fam files.
[character] Prefix of PLINK files, i.e. name.bed, name.bim, name.fam.
[character] Prefix of merged dataset (study and reference samples) used in plink --pca, resulting in prefixMergedDataset.eigenvec.
[character] /path/to/directory/with/QC/results containing prefixMergedDataset.eigenvec results as returned by plink --pca. Per default qcdir=indir.
[double] Scaling factor of radius to be drawn around center of European reference samples, with study samples inside this radius considered to be of European descent and samples outside this radius of non-European descent. The radius is computed as the maximum Euclidean distance of European reference samples to the centre of European reference samples.
[data.frame] Dataframe with sample identifiers [refSamplesIID] corresponding to IIDs in prefixMergedDataset.eigenvec and population identifier [refSamplesPop] corresponding to population IDs [refColorsPop] in refColorsfile/refColors. Either refSamples or refSamplesFile have to be specified.
[data.frame, optional] Dataframe with population IDs in column [refColorsPop] and corresponding colour-code for PCA plot in column [refColorsColor]. If not provided and is.null(refColorsFile) default colors are used.
[character] /path/to/File/with/reference samples. Needs columns with sample identifiers [refSamplesIID] corresponding to IIDs in prefixMergedDataset.eigenvec and population identifier [refSamplesPop] corresponding to population IDs [refColorsPop] in refColorsfile/refColors. If both refSamplesFile and refSamples are not NULL, refSamplesFile information is used.
[character, optional] /path/to/File/with/Population/Colors containing population IDs in column [refColorsPop] and corresponding colour-code for PCA plot in column [refColorsColor].If not provided and is.null(refColors) default colors for are used. If both refColorsFile and refColors are not NULL, refColorsFile information is used.
[character] Column name of reference sample IDs in refSamples/refSamplesFile.
[character] Column name of reference sample population IDs in refSamples/refSamplesFile.
[character] Column name of population colors in refColors/refColorsFile
[character] Column name of reference sample population IDs in refColors/refColorsFile.
[character] Colour to be used for study population if plot is TRUE.
[logical] Should plots be shown interactively? When choosing this option, make sure you have X-forwarding/graphical interface available for interactive plotting. Alternatively, set interactive=FALSE and save the returned plot object (p_ancestry) via ggplot2::ggsave(p=p_ancestry, other_arguments) or pdf(outfile) print(p_ancestry) dev.off().
Named [list] with i) fail_ancestry, containing a [data.frame] with FID and IID of non-European individuals and ii) p_ancestry, a ggplot2-object 'containing' a scatter plot of PC1 versus PC2 colour-coded for samples of the reference populations and the study population, which can be shown by print(p_ancestry).
Both run_check_ancestry
and
evaluate_check_ancestry
can simply be invoked by
check_ancestry
.
# NOT RUN {
qcdir <- system.file("extdata", package="plinkQC")
name <- "data"
fail_ancestry <- evaluate_check_ancestry(indir=qcdir, name=name,
refSamplesFile=paste(qcdir, "/HapMap_ID2Pop.txt",sep=""),
refColorsFile=paste(qcdir, "/HapMap_PopColors.txt", sep=""),
prefixMergedDataset="data.HapMapIII", interactive=FALSE)
# }
Run the code above in your browser using DataLab