cross
. The comma-delimited format
(csv
) is recommended. All formats require chromosome
assignments for the genetic markers, and assume that markers are in
their correct order.
read.cross(format=c("csv", "csvr", "csvs", "csvsr", "mm", "qtx", "qtlcart", "gary", "karl", "mapqtl", "tidy"), dir="", file, genfile, mapfile, phefile, chridfile, mnamesfile, pnamesfile, na.strings=c("-","NA"), genotypes=c("A","H","B","D","C"), alleles=c("A","B"), estimate.map=TRUE, convertXdata=TRUE, error.prob=0.0001, map.function=c("haldane", "kosambi", "c-f", "morgan"), BC.gen=0, F.gen=0, crosstype, ...)
"/"
) or double backslashes
("\\"
) to specify directory trees.csv
, csvr
and mm
.csvs
,
csvsr
, karl
, gary
and mapqtl
only).csv
formats).csvs
,
csvsr
, karl
, gary
and mapqtl
only).gary
format only).gary
format only).gary
format
only).csv
and gary
formats only). For the
csv
formats, these are interpreted globally
for the entire
file, so missing value codes in phenotypes must not be valid
genotypes, and vice versa. For the gary
format, these are
used only for the phenotype data.csv
formats only). Generally this is a vector of
length 5, with the elements corresponding to AA, AB, BB, not BB
(i.e., AA or AB), and not AA (i.e., AB or BB). Note: Pay
careful attention to the third and fourth of these; the order of
these can be confusing. If you are trying to read 4-way cross data, your file must have
genotypes coded as described below, and you need to set
genotypes=NULL
so that no re-coding gets done.
qtlcart
, mapqtl
,
and karl
: if TRUE and marker positions are not included in
the input files, the genetic map is estimated using the function
est.map
.sex
and
pgm
in the phenotype data if they available or by inference
if they are not. If FALSE, the X chromsome data is read as is."bcsft"
."bcsft"
.read.table
in the case of
csv
and csvr
formats. In particular, one may use the
argument
sep
to specify the field separator (the default is a comma),
dec
to specify the character used for the decimal point
(the default is a period), and comment.char
to specify a
character to indicate comment lines.cross
, which is a list with two components:names(geno)
contains the names of the
chromsomes. Each chromosome is itself a list, and is given class
A
or X
according to whether it is autosomal
or the X chromosome.There are two components for each chromosome: data
, a matrix
whose rows are individuals and whose columns are markers, and
map
, either a vector of marker positions (in cM) or a matrix
of dim (2 x n.mar
) where the rows correspond to marker
positions in female and male genetic distance, respectively.The genotype data gets converted into numeric codes, as follows.The genotype data for a backcross is coded as NA = missing,
1 = AA, 2 = AB.For an F2 intercross, the coding is NA = missing, 1 = AA, 2 = AB, 3
= BB, 4 = not BB (i.e. AA or AB; D in Mapmaker/qtl), 5 = not AA (i.e. AB
or BB; C in Mapmaker/qtl).For a 4-way cross, the mother and father are assumed to have
genotypes AB and CD, respectively. The genotype data for the
progeny is assumed to be phase-known, with the following coding
scheme: NA = missing, 1 = AC, 2 = BC, 3 = AD, 4 = BD, 5 = A = AC or AD,
6 = B = BC or BD, 7 = C = AC or BC, 8 = D = AD or BD, 9 = AC or BD,
10 = AD or BC, 11 = not AC, 12 = not BC, 13 = not AD, 14 = not BD.
n.ind x n.phe
) containing the
phenotypes. If a phenotype with the name id
or ID
is
included, these identifiers will be used in top.errorlod
,
plotErrorlod
, and plotGeno
as
identifiers for the individual.subset.cross
, to
assist in pulling out portions of the data.
X
or
x
. If it is labeled by a number or by Xchr
, it will be
interpreted as an autosome. The phenotype data should contain a column named "sex"
which
indicates the sex of each individual, either coded as 0
=female and
1
=male, or as a factor with levels female
/male
or
f
/m
. Case will be
ignored both in the name and in the factor levels. If no such
phenotype column is included, it will be assumed that all individuals
are of the same sex. In the case of an intercross, the phenotype data may also contain a
column named "pgm"
(for "paternal grandmother") indicating the
direction of the cross. It should be coded as 0/1 with 0 indicating
the cross (AxB)x(AxB) or (BxA)x(AxB) and 1 indicating the cross
(AxB)x(BxA) or (BxA)x(BxA). If no such phenotype column is included,
it will be assumed that all individuals come from the same direction
of cross. The internal storage of X chromosome data is quite different from that
of autosomal data. Males are coded 1=AA and 2=BB; females with pgm==0
are coded 1=AA and 2=AB; and females with pgm==1 are coded 1=BB and
2=AB. If the argument convertXdata
is TRUE, conversion to this
format is made automatically; if FALSE, no conversion is done,
summary.cross
will likely return a warning, and
most analyses will not work properly. Use of convertXdata=FALSE
(in which case the X chromosome
genotypes will not be converted to our internal standard) can be
useful for diagnosing problems in the data, but will require some
serious mucking about in the internal data structure.sep
, which will be passed
to the function read.table
). For example, in
Europe, it is common to use a comma in place of the decimal point in
numbers and so a semi-colon in place of a comma as the field
separator; such data may be read by using sep=";"
and
dec=","
. The first line should contain the phenotype names followed by the
marker names. At least one phenotype must be included; for
example, include a numerical index for each individual. The second line should contain blanks in the phenotype columns,
followed by chromosome identifiers for each marker in all other
columns. If a chromosome has the identifier X
or x
, it
is assumed to be the X chromosome; otherwise, it is assumed to be an
autosome. An optional third line should contain blanks in the phenotype
columns, followed by marker positions, in cM. Marker order is taken from the cM positions, if provided; otherwise,
it is taken from the column order. Subsequent lines should give the data, with one line for each
individual, and with phenotypes followed by genotypes. If possible,
phenotypes are made numeric; otherwise they are converted to factors. The genotype codes must be the same across all markers. For example,
you can't have one marker coded AA/AB/BB and another coded A/H/B.
This includes genotypes for the X chromosome, for which hemizygous
individuals should be coded as if they were homoyzogous. The cross is determined to be a backcross if only the first two elements
of the genotypes
string are found; otherwise, it is assumed to
be an intercross.csv
format, but rotated (or really
transposed), so that rows are columns and columns are rows.csv
format, but with separate files for the
genotype and phenotype data. The first column in the genotype data must specify individuals'
identifiers, and there must be a column in the phenotype data with
precisely the same information (and with the same name). These IDs
will be included in the data as a phenotype. If the name id
or
ID
is used, these identifiers will be used in
top.errorlod
, plotErrorlod
, and
plotGeno
as identifiers for the individual. The first row in each file contains the column names. For the
phenotype file, these are the names of the phenotypes. For the
genotype file, the first cell will be the name of the identifier
column (id
or ID
) and the subsequent fields will be the
marker names. In the genotype data file, the second row gives the chromosome IDs.
The cell in the second row, first column, must be blank. A third
row giving cM positions of markers may be included, in which case the
cell in the third row, first column, must be blank. There need be no blank rows in the phenotype data file.csvs
format, but with each file rotated
(or really transposed), so that rows are columns and columns are rows.file
, contains the genotype and phenotype
data. Rows beginning with the symbol #
are ignored. The first
line should be either data type f2 intercross
or
data type f2 backcross
. The second line should begin with
three numbers indicating the numbers of individuals, markers and
phenotypes in the file. This line may include the word symbols
followed by symbol assignments (see the documentation for mapmaker,
and cross your fingers). The rest of the lines give genotype data
followed by phenotype data, with marker and phenotype names always
beginning with the *
symbol. A second file contains the genetic map information, specified with
the argument mapfile
. The map file may be in
one of two formats. The function will determine which format of map
file is presented. The simplest format for the map file is not standard for the Mapmaker
software, but is easy to create. The file contains two or three
columns separated by white space and with no header row. The first
column gives the chromosome assignments. The second column gives the
marker names, with markers listed in the order along the chromosomes.
An optional third column lists the map positions of the markers. Another possible format for the map file is the .maps
format, which is produced by Mapmaker. The code for reading this
format was written by Brian Yandell. Marker order is taken from the map file, either by the order they are
presented or by the cM positions, if specified..cro
and .map
files
for QTL Cartographer (produced by the QTL Cartographer
sub-program, Rmap and Rcross). Note that the QTL Cartographer cross types are converted as follows:
RF1 to riself, RF2 to risib, RF0 (doubled haploids) to bc, B1 or B2 to
bc, RF2 or SF2 to f2.genfile
(default = "geno.dat"
) contains the genotype
data. The file contains one line per individual, with genotypes for
the set of markers separated by white space. Missing values are
coded as 9, and genotypes are coded as 0/1/2 for AA/AB/BB. mapfile
(default = "markerpos.txt"
) contains two
columns with no header row: the marker names in the first column and
their cM position in the second column. If marker positions are not
available, use mapfile=NULL
, and a dummy map will be inserted. phefile
(default = "pheno.dat"
) contains the phenotype
data, with one row for each mouse and one column for each phenotype.
There should be no header row, and missing values are coded as
"-"
. chridfile
(default = "chrid.dat"
) contains the
chromosome identifier for each marker. mnamesfile
(default = "mnames.txt"
) contains the marker
names. pnamesfile
(default = "pnames.txt"
) contains the names
of the phenotypes. If phenotype names file is not available, use
pnamesfile=NULL
; arbitrary phenotype names will then be
assigned.genfile
(default = "gen.txt"
) contains the genotype
data. The file contains one line per individual, with genotypes
separated by white space. Missing values are coded 0; genotypes are
coded as 1/2/3/4/5 for AA/AB/BB/not BB/not AA. mapfile
(default = "map.txt"
) contains the map
information, in the following complicated format:
n.chr
n.mar(1) rf(1,1) rf(1,2) ... rf(1,n.mar(1)-1)
mar.name(1,1)
mar.name(1,2)
...
mar.name(1,n.mar(1))
n.mar(2)
...
etc.
phefile
(default = "phe.txt"
) contains a matrix of
phenotypes, with one individual per line. The first line in the
file should give the phenotype names.genfile
corresponds to the loc file containing the genotype
data. Each marker and its genotypes should be on a single line. mapfile
corresponds to the map file containing the linkage
group assignment, marker names and their map positions. phefile
corresponds to the qua file containing the phenotypes. For the moment, only 4-way crosses are supported (CP population type
in MapQTL).csv
), rotated
comma-delimited (csvr
), comma-delimited with separate files for
genotype and phenotype data (csvs
), rotated comma-delimited
with separate files for genotype and phenotype data (csvsr
),
Mapmaker (mm
), Map Manager QTX (qtx
), Gary Churchill's
format (gary
), Karl Broman's format (karl
) and
MapQTL/JoinMap (mapqtl
). The required files and their
specification for each format appears below. The comma-delimited
formats are recommended. Note that most of these formats work only
for backcross and intercross data. The sampledata
directory in the package distribution contains
sample data files in multiple formats. Also see
http://www.rqtl.org/sampledata.
The ...
argument enables additional arguments to be passed to
the function read.table
in the case of csv
and csvr
formats. In particular, one may use the argument
sep
to specify the field separator (the default is a comma),
dec
to specify the character used for the decimal point (the
default is a period), and comment.char
to specify a character
to indicate comment lines.
subset.cross
, summary.cross
,
plot.cross
, c.cross
, clean.cross
,
write.cross
, sim.cross
, read.table
.
The sampledata
directory in the package distribution contains
sample data files in multiple formats. Also see
http://www.rqtl.org/sampledata.
## Not run:
# # CSV format
# dat1 <- read.cross("csv", dir="Mydata", file="mydata.csv")
#
# # CSVS format
# dat2 <- read.cross("csvs", dir="Mydata", genfile="mydata_gen.csv",
# phefile="mydata_phe.csv")
#
# # you can read files directly from the internet
# datweb <- read.cross("csv", "http://www.rqtl.org/sampledata",
# "listeria.csv")
#
# # Mapmaker format
# dat3 <- read.cross("mm", dir="Mydata", file="mydata.raw",
# mapfile="mydata.map")
#
# # Map Manager QTX format
# dat4 <- read.cross("qtx", dir="Mydata", file="mydata.qtx")
#
# # QTL Cartographer format
# dat5 <- read.cross("qtlcart", dir="Mydata", file="qtlcart.cro",
# mapfile="qtlcart.map")
#
# # Gary format
# dat6 <- read.cross("gary", dir="Mydata", genfile="geno.dat",
# mapfile="markerpos.txt", phefile="pheno.dat",
# chridfile="chrid.dat", mnamesfile="mnames.txt",
# pnamesfile="pnames.txt")
#
# # Karl format
# dat7 <- read.cross("karl", dir="Mydata", genfile="gen.txt",
# phefile="phe.txt", mapfile="map.txt")## End(Not run)
Run the code above in your browser using DataLab