popkin: Estimate kinship from a genotype matrix and subpopulation assignments

Description

Given the biallelic genotypes of \(n\) individuals, this function returns the \(n \times n\) kinship matrix \(\Phi^T\) such that the kinship estimate between the most distant subpopulations is zero on average (this sets the ancestral population \(T\) to the most recent common ancestor population).

Usage

popkin(X, subpops = NULL, n = NA, lociOnCols = FALSE, memLim = NA)

Arguments

Genotype matrix, BEDMatrix object, or a function \(X(m)\) that returns the genotypes of all individuals at \(m\) successive locus blocks each time it is called, and NULL when no loci are left.

subpops

The length-\(n\) vector of subpopulation assignments for each individual. If missing, every individual is effectively treated as a different population.

Number of individuals (required only when \(X\) is a function, ignored otherwise). If \(n\) is missing but subpops is not, \(n\) is taken to be the length of subpops.

lociOnCols

If true, \(X\) has loci on columns and individuals on rows; if false (the default), loci are on rows and individuals on columns. Has no effect if \(X\) is a function. If \(X\) is a BEDMatrix object, lociOnCols=TRUE is set automatically.

memLim

Memory limit in GB, used to break up genotype data into chunks for very large datasets. Note memory usage is somewhat underestimated and is not controlled strictly. Default in Linux and Windows is 70 % of the free system memory, otherwise it is 1GB (OSX and other systems).

Value

The estimated \(n \times n\) kinship matrix \(\Phi^T\). If \(X\) has names for the individuals, they will be copied to the rows and columns of this kinship matrix.

Details

The subpopulation assignments are only used to estimate the baseline kinship (the zero value). If the user wants to re-estimate \(\Phi^T\) using different subpopulation labels, it suffices to rescale the given \(\Phi^T\) using rescalePopkin (as opposed to starting from the genotypes again, which gives the same answer less efficiently).

The matrix \(X\) must have values only in c(0,1,2,NA), encoded to count the number of reference alleles at the locus, or NA for missing data.

Examples

Run this code

# NOT RUN {
## Construct toy data
X <- matrix(c(0,1,2,1,0,1,1,0,2), nrow=3, byrow=TRUE) # genotype matrix
subpops <- c(1,1,2) # subpopulation assignments for individuals

## NOTE: for BED-formatted input, use BEDMatrix!
## "file" is path to BED file (excluding .bed extension)
# library(BEDMatrix)
# X <- BEDMatrix(file) # load genotype matrix object

Phi <- popkin(X, subpops) # calculate kinship from genotypes and subpopulation labels

# }

Run the code above in your browser using DataLab