FitAdmixedModelFindUnknowns: Fit the OriGen model and place unknown individuals who may be admixed

Description

This function fits the OriGen model and places individuals of unknown origins who may be admixed. This function estimates admixture fractions at each location rather than the probability of coming from each location.

Usage

FitAdmixedModelFindUnknowns(DataArray,SampleCoordinates,UnknownData,
	MaxGridLength=20,RhoParameter=10,LambdaParameter=100,MaskWater=TRUE)

Arguments

DataArray

An array giving the number of major/minor SNPs (defined as the most occuring in the dataset) grouped by sample sites for each SNP. The dimension of this array is [2,SampleSites,NumberSNPs].

SampleCoordinates

This is an array which gives the longitude and latitude of each of the found sample sites. The dimension of this array is [SampleSites,2], where the second dimension represents longitude and latitude respectively.

UnknownData

An array showing the unknown individuals genetic data. The dimension of this array is [NumberUnknowns,NumberSNPs].

MaxGridLength

An integer giving the maximum number of boxes to fill the longer side of the region. Note that computation time increases quadratically as this number increases, but this number also should be high enough to separate different sample sites otherwise they

RhoParameter

This is a real precision parameter weighting the amount of smoothing in the alllele frequency surface. A higher value flattens out the surface while a lower value allows for more fluctuations. The default value of 10 was used in our analysis and should

LambdaParameter

This is a real precision parameter weighting the admixture fractions algorithm. For the most part, this does not need to be changed as it seems to only affect the time to convergence.

MaskWater

Logical value that if true removes water from the plotted regions.

Value

List with the following components:
AdmixtureFractionsAn array giving the admixture fraction from the given location. In other words this is the fractional contribution of the location to the unknown individuals genetic data. The dimension of this array is [NumberLongitudeDivisions, NumberLatitudeDivisions, NumberUnknowns], where either NumberLongitudeDivisions or NumberLatitudeDivisions is equal to MaxGridLength(an input to this function) and the other is scaled so that the geodesic distance between points horizontally and vertically is equal.
DataArrayAn array giving the number of major/minor SNPs (defined as the most occuring in the dataset) grouped by sample sites for each SNP. The dimension of this array is [2, SampleSites, NumberSNPs].
NumberSNPsThis shows the integer number of SNPs found.
GridLengthAn array giving the number of longitudinal and latitudinal divisions. The dimension of this array is [2], where the first number is longitude and the second is latitude.
RhoParameterA real value showing the inputted RhoParameter value.
SampleSitesThis shows the integer number of sample sites found.
MaxGridLengthAn integer giving the maximum number of boxes to fill the longer side of the region. Note that computation time increases quadratically as this number increases, but this number also should be high enough to separate different sample sites otherwise they will be binned together as a single site. This number was part of the inputs.
SampleCoordinatesThis is an array which gives the longitude and latitude of each of the found sample sites. The dimension of this array is [SampleSites,2], where the second dimension represents longitude and latitude respectively.
GridCoordinatesAn array showing the corresponding coordinates for each longitude and latitude division. The dimension of this array is [2,MaxGridLength], with longitude coordinates coming first and latitude second. Note that one of these rows may not be filled entirely. The associated output GridLength should be used to find the lengths of the two rows. Rows not filled in entirely will contain zeroes at the end.
NumberUnknownsThis is an integer value showing the number of unknowns found in the UnknownPEDFile.
UnknownDataAn array showing the unknown individuals genetic data. The dimension of this array is [NumberUnknowns,NumberSNPs].
IsLandThis is a logical valued array that is TRUE when the given coordinates are over land and FALSE when over water. The dimension of this array is [GridLength[1],GridLength[2]].

References

Ranola J, Novembre J, Lange K (2014) Fast Spatial Ancestry via Flexible Allele Frequency Surfaces. Bioinformatics, in press.

Examples

Run this code

#this example not run because it takes longer than 5 secs
#note - type example(FunctionName, run.dontrun=TRUE) to run the example where FunctionName is
#the name of the function
#Data generation
	SampleSites=10
	NumberSNPs=4
	TestData=array(sample(2*(1:30),2*SampleSites*NumberSNPs,replace=TRUE),
		dim=c(2,SampleSites,NumberSNPs))
	#Europe is about -9 to 38 and 34 to 60
	TestCoordinates=array(0,dim=c(SampleSites,2))
	TestCoordinates[,1]=runif(SampleSites,-9,38)
	TestCoordinates[,2]=runif(SampleSites,34,60)

	#This code simulates the number of major alleles the unknown individuals have.
	NumberUnknowns=2
	TestUnknowns=array(sample(0:2,NumberUnknowns*NumberSNPs,replace=TRUE),
		dim=c(NumberUnknowns,NumberSNPs))

	#Fitting the admixed model
	#MaxGridLength is the maximum number of boxes allowed to span the region in either direction
	#Note that MaxGridLength is reduced here to allow the example to run in less than 5 secs
	#RhoParameter is a tuning constant
	print("MaxGridLength is intentionally set really low for fast examples.
		Meaningful results will most likely require a higher value.")
	trials6=FitAdmixedModelFindUnknowns(TestData,TestCoordinates,
		TestUnknowns,MaxGridLength=8,RhoParameter=10)

	#Plots the admixed surface disregarding fractions less than 0.01
	PlotAdmixedSurface(trials6)

Run the code above in your browser using DataLab