This function is used to perform genome-wide association mapping. The phenotypic and SNP data should already be read in prior to running this function
(see below for examples). AM
builds the linear mixed model iteratively, via forward selection. It is through this model building process that we
identify the SNP-trait associations. We use the extended BIC (extBIC) to decide on the 'best' model and when to stop looking for a better model.
The conservativeness of extBIC can be adjusted. If the lambda
parameter is left at is default setting, then AM
is run in its most
conservative state (i.e. false positives are minimized but this also decreases the chance of true positives).
When interested in running AM
at a certain false positive rate, use FPR4AM
. This function uses permutation to
find the lambda value for a desired false positive rate for AM
.
Below are some examples of how to use AM
for genome-wide association mapping of data.
How to perform a basic AM analysis
Suppose,
the snp data are contained in the file geno.txt which is a plain space separated
text file with no column headings. The file is located in the current working directory.
It contains numeric genotype values 0, 1, and 2 for snp genotypes
AA, AB, and BB, respectively. It also contains the value X for a missing genotype.
the phenotype data is contained in the file pheno.txt which is a plain space
separated text file containing a single column with the trait data. The first row of the file
has the column heading 'y'. This file does not contain any missing data.
The file is located in the current working directory.
there is no map data.
To analyse these data, we would use the following three functions (the parameters can be specified in any order, as well as the
functions as long as AM is run last):
geno_obj <- ReadMarker(filename='geno.txt', AA=0, AB=1, BB=2, type="text", missing='X')
pheno_obj <- ReadPheno(filename='pheno.txt')
# since lambda is not specified, this will run AM conservatively (where the false positive rate is lowest).
res <- AM(trait='y', geno=geno_obj, pheno=pheno_obj)
A table of results is printed to the screen and saved in the R object res
.
How to perform a more complicated AM analysis where the false positive rate is 5%
Suppose,
the snp data are contained in the file geno.ped which is a 'PLINK' ped file. See
ReadMarker
for details. The file is located in /my/dir. Let's assume
the file is large, say 50 gigabytes, and our computer only has 32 gigabytes of RAM.
the phenotype data is contained in the file pheno.txt which is a plain space
separated text file with six columns. The first row of the file contains the column headings.
The first column is a trait and is labeled y1.
The second column is another trait and is labeled y2. The third and fourth columns
are nuisance variables and are labeled cov1 and cov2. The fifth and sixth columns
are the first two principal components to account for population substructure and are
labeled pc1 and pc2. The file contains missing data that are coded as 99.
The file is located in /my/dir.
the map data is contained in the file map.txt, is also located in
/my/dir, and the first row has the column headings.
An 'AM' analysis is performed where the trait of interest is y2, and
the fixed effects part of the model is cov1 + cov2 + pc1 + pc2,
To analyse these data, we would run the following:
geno_obj <- ReadMarker(filename='/my/dir/geno.ped', type='PLINK', availmemGb=32)
pheno_obj <- ReadPheno(filename='/my/dir/pheno.txt', missing=99)
map_obj <- ReadMap(filename='/my/dir/map.txt')
# FPR4AM calculates the lambda value corresponding to a desired false positive rate of 5%
ans <- FPR4AM(falseposrate=0.05, numreps=200, trait='y2', fformula=c('cov1 + cov2 + pc1 + pc2'),
geno=geno_obj, pheno=pheno_obj, map=map_obj)
# performs association mapping with a 5% false positive rate
res <- AM(trait='y2', fformula=c('cov1 + cov2 + pc1 + pc2'),
geno=geno_obj, pheno=pheno_obj, map=map_obj, lambda=ans$setlambda)
A table of results is printed to the screen and saved in the R object res
.
How to perform an analysis where individuals have multiple observations
Suppose,
the snp data are contained in the file geno.ped which is a 'PLINK' ped file. See
ReadMarker
for details. The file is located in /my/dir. Let's assume
the file is large, say 50 gigabytes, and our computer only has 32 gigabytes of RAM.
the phenotype data is contained in the file pheno.txt which is a plain space
separated text file with six columns. The first row of the file contains the column headings.
The first column is a trait and is labeled y1.
The second column is another trait and is labeled y2. The third and fourth columns
are nuisance variables and are labeled cov1 and cov2. The fifth and sixth columns
are the first two principal components to account for population substructure and are
labeled pc1 and pc2. The file contains missing data that are coded as 99.
The file is located in /my/dir.
the Z matrix data are contained in the file Zmatrix.txt. The file is located in /my/dir.This file is
a design matrix that only contains zeros and ones where each row must contain only a single one in the column that matches
the individual's trait value to their corresponding genotype.
the map data is contained in the file map.txt, is also located in
/my/dir, and the first row has the column headings.
An 'AM' analysis is performed where the trait of interest is y2, and
the fixed effects part of the model is cov1 + cov2 + pc1 + pc2.
To analyse these data, we would run the following:
geno_obj <- ReadMarker(filename='/my/dir/geno.ped', type='PLINK', availmemGb=32)
pheno_obj <- ReadPheno(filename='/my/dir/pheno.txt', missing=99)
map_obj <- ReadMap(filename='/my/dir/map.txt')
Zmat_obj <- ReadZmat(filename='/my/dir/Zmatrix.txt')
res <- AM(trait='y2', fformula=c('cov1 + cov2 + pc1 + pc2'),
geno=geno_obj, pheno=pheno_obj, map=map_obj, Zmat=Zmat_obj )
A table of results is printed to the screen and saved in the R object res
.
Dealing with missing marker data
AM
can tolerate some missing marker data. However, ideally,
a specialized genotype imputation program such as 'BEAGLE' or 'PHASE2', should be
used to impute the missing marker data before being read into 'Eagle'.
Dealing with missing trait data
AM
deals automatically with individuals with missing trait data.
These individuals are removed from the analysis and a warning message is generated.
Dealing with missing explanatory variable values
AM
deals automatically with individuals with missing explanatory variable values.
These individuals are removed from the analysis and a warning message is generated
Error Checking
Most errors occur when reading in the data. However, as an extra precaution, if quiet=FALSE
, then additional
output is printed during the running of AM
. If AM
is failing, then this output can be useful for diagnosing
the problem.