pcrelate
is used to estimate kinship coefficients, IBD sharing probabilities, and inbreeding coefficients using genome-wide SNP data. PC-Relate accounts for population structure (ancestry) among sample individuals through the use of ancestry representative principal components (PCs) to provide accurate relatedness estimates due only to recent family (pedigree) structure.pcrelate(genoData, pcMat = NULL, freq.type = "individual", scale = "overall", ibd.probs = TRUE, scan.include = NULL, training.set = NULL, scan.block.size = 5000,
snp.include = NULL, chromosome = NULL, snp.block.size = 10000,
MAF = 0.01, write.to.gds = FALSE, gds.prefix = NULL,
correct = TRUE, verbose = TRUE)
GenotypeData
from the package GWASTools
containing the genotype data for SNPs and samples to be used for the analysis. This object can easily be created from a matrix of SNP genotype data, PLINK files, or GDS files. Alternatively, this could be an object of class SeqVarData
from the package SeqVarTools
containing the genotype data for the sequencing variants and samples to be used for the analysis.genoData
are included.chromosome
for further details.snp.include
is NULL; if chromosome
is also NULL, then all SNPs are included.freq.type
is 'individual', if an individual's estimated individual-specific minor allele frequency at a SNP is less than this value, that SNP will be excluded from the analysis for that individual. When freq.type
is 'population', any SNP with a population minor allele frequency less than this value will be excluded from the analysis. The default value is 0.01.write.to.gds = TRUE
. If NULL, the prefix 'tmp' is used. See 'Details' for more information.pcrelate
'. A list including:
sample.id
.sample.id
. This matrix is returned only if ibd.probs = TRUE
in the input.sample.id
.pcrelate
.pcMat = NULL
and corresponds to an assumption of population homogeneity. It is important that the PCs used in pcMat
to adjust for ancestry are representative of ancestry and NOT family structure, so we recommend using PCs calculated with PC-AiR.
It is important that the order of individuals in the matrix pcMat
matches the order of individuals in genoData
.
In order to perform relatedness estimation, allele frequency estimates are required for centering and scaling genotype values. When freq.type
is 'individual', individual-specific allele frequencies calculated for each individual at each SNP using the PCs specified in pcMat
are used. When freq.type
is 'population', population average allele frequencies calculated at each SNP are used for all individuals. (Note that when freq.type
is set to 'population' there is no ancestry adjustment and the relatedness estimates will be confounded with population structure (ancestry)). There are muliple choices for how genotype values are scaled. When scale
is 'variant', centered genotype values at each SNP are divided by their expected variance under Hardy-Weinberg equilibrium. When scale
is 'overall', centered genotype values at all SNPs are divided by the average across all SNPs of their expected variances under Hardy-Weinberg equilibrium; this scaling leads to more stable behavior when using low frequency variants. When scale
is 'none', genotype values are only centered and not scaled; this won't provide accurate kinship coefficient estimates but may be useful for other purposes. At a particular SNP, the variance used for scaling is either calculated separately for each individual using their individual-specific allele frequncies (when freq.type
is 'individual') or once for all individuals using the population average allele frequency (when freq.type
is 'population'). Set freq.type
to 'individual' and scale
to 'overall' to perform a standard PC-Relate analysis; these are the defaults. If freq.type
is set to 'individual' and scale
is set to 'variant', the estimators are very similar to REAP. If freq.type
is set to 'population' and scale
is set to 'variant', the estimators are very similar to EIGENSOFT.
The optional input training.set
allows the user to specify which samples are used to estimate the ancestry effect when estimating individual-specific allele frequencies (if freq.type
is 'individual') or to estimate the population allele frequency (if freq.type
is 'population'. Ideally, training.set
is a set of mutually unrelated individuals. If prior information regarding pedigree structure is available, this can be used to select training.set
, or if pcair
was used to obtain the PCs, then the individuals in the PC-AiR 'unrelated subset' can be used. If no prior information is available, all individuals should be used.
The scan.block.size
can be specified to alleviate memory issues when working with very large data sets. If scan.block.size
is smaller than the number of individuals included in the analysis, then individuals will be analyzed in separate blocks. This reduces the memory required for the analysis, but genotype data must be read in multiple times for each block (to analyze all pairs), which increases the number of computations required. NOTE: if individuals are broken up into more than 1 block, write.to.gds
must be TRUE (see below).
If write.to.gds = TRUE
, then the output is written to two GDS files rather than returned to the R console. Use of this option requires the gdsfmt
package. The first GDS file, named ``freq.type
is 'individual') or the population allele frequency estimates at each SNP (when freq.type
is 'population'. The second GDS file, named ``
pcrelateReadKinship
, pcrelateReadInbreed
, and pcrelateMakeGRM
for functions that can be used to read in the results output by pcrelate
.
GWASTools
for a description of the package containing the following functions: GenotypeData
for a description of creating a GenotypeData
class object for storing sample and SNP genotype data, MatrixGenotypeReader
for a description of reading in genotype data stored as a matrix, and GdsGenotypeReader
for a description of reading in genotype data stored as a GDS file. Also see snpgdsBED2GDS
in the SNPRelate
package for a description of converting binary PLINK files to GDS.
# file path to GDS file
gdsfile <- system.file("extdata", "HapMap_ASW_MXL_geno.gds", package="GENESIS")
# read in GDS data
HapMap_geno <- GdsGenotypeReader(filename = gdsfile)
# create a GenotypeData class object
HapMap_genoData <- GenotypeData(HapMap_geno)
# load saved matrix of KING-robust estimates
data("HapMap_ASW_MXL_KINGmat")
# run PC-AiR
mypcair <- pcair(genoData = HapMap_genoData, kinMat = HapMap_ASW_MXL_KINGmat,
divMat = HapMap_ASW_MXL_KINGmat)
# run PC-Relate
mypcrel <- pcrelate(genoData = HapMap_genoData, pcMat = mypcair$vectors[,1],
training.set = mypcair$unrels)
close(HapMap_genoData)
Run the code above in your browser using DataLab