Creates a BGData
object from a .raw file (generated with
--recodeA
in PLINK).
Other text-based file formats are supported as well by tweaking some of the
parameters as long as the records of individuals are in rows, and
phenotypes, covariates and markers are in columns.
readRAW(fileIn, header = TRUE, dataType = integer(), n = NULL,
p = NULL, sep = "", na.strings = "NA", nColSkip = 6L,
idCol = c(1L, 2L), nNodes = NULL, linked.by = "rows",
folderOut = paste0("BGData_", sub("\\.[[:alnum:]]+$", "",
basename(fileIn))), outputType = "byte", dimorder = if (linked.by ==
"rows") 2L:1L else 1L:2L, verbose = FALSE)readRAW_matrix(fileIn, header = TRUE, dataType = integer(), n = NULL,
p = NULL, sep = "", na.strings = "NA", nColSkip = 6L,
idCol = c(1L, 2L), verbose = FALSE)
readRAW_big.matrix(fileIn, header = TRUE, dataType = integer(),
n = NULL, p = NULL, sep = "", na.strings = "NA", nColSkip = 6L,
idCol = c(1L, 2L), folderOut = paste0("BGData_",
sub("\\.[[:alnum:]]+$", "", basename(fileIn))), outputType = "char",
verbose = FALSE)
The path to the plaintext file.
Whether fileIn
contains a header. Defaults to TRUE
.
The coding type of genotypes in fileIn
. Use integer()
or
double()
for numeric coding. Alpha-numeric coding is currently
not supported for readRAW
and readRAW_big.matrix
: use the
--recodeA
option of PLINK to convert the .ped file into a .raw
file. Defaults to integer()
.
The number of individuals. Auto-detect if NULL
. Defaults to
NULL
.
The number of markers. Auto-detect if NULL
. Defaults to
NULL
.
The field separator character. Values on each line of the file are
separated by this character. If sep = ""
(the default for
readRAW
the separator is "white space", that is one or more
spaces, tabs, newlines or carriage returns.
The character string used in the plaintext file to denote missing
value. Defaults to NA
.
The number of columns to be skipped to reach the genotype information
in the file. Defaults to 6
.
The index of the ID column. If more than one index is given, both
columns will be concatenated with "_". Defaults to c(1, 2)
, i.e.
a concatenation of the first two columns.
The number of nodes to create. Auto-detect if NULL
. Defaults to
NULL
.
If columns
a column-linked matrix (ColumnLinkedMatrix
) is
created, if rows
a row-linked matrix (RowLinkedMatrix
).
Defaults to rows
.
The path to the folder where to save the binary files. Defaults to the
name of the input file (fileIn
) without extension prefixed with
"BGData_".
The vmode
for ff
and type
for big.matrix
objects. Default to byte
for ff
and char
for
big.matrix
objects.
The physical layout of the underlying ff
object of each node.
Whether progress updates will be posted. Defaults to FALSE
.
Genotypes are stored in a LinkedMatrix
object where each node is an
ff
instance. Multiple ff
files are used because the array
size in ff
is limited to the largest integer which can be
represented on the system (.Machine$integer.max
) and for genetic
data this limitation is often exceeded. The LinkedMatrix
package
makes it possible to link several ff
files together by columns or by
rows and treat them similarly to a single matrix. By default a
ColumnLinkedMatrix
is used for the genotypes, but the user can
modify this using the linked.by
argument. The number of nodes to
generate is either specified by the user using the nNodes
argument
or determined internally so that each ff
object has a number of
cells that is smaller than .Machine$integer.max / 1.2
. A folder (see
folderOut
) that contains the binary flat files (named
geno_*.bin
) and an external representation of the BGData
object in BGData.RData
is created.
Genotypes are stored in a regular matrix
object. Therefore, this
function will only work if the .raw file is small enough to fit into
memory.
Genotypes are stored in a filebacked big.matrix
object. A folder
(see folderOut
) that contains the binary flat file (named
BGData.bin
), a descriptor file (named BGData.desc
), and an
external representation of the BGData
object in BGData.RData
are created.
To reload a BGData
object, it is recommended to use the
load.BGData
function instead of the load
function as
load
does not initialize ff
objects or attach
big.matrix
objects.
The data included in the first couple of columns (up to nColSkip
) is
used to populate the sample information of a BGData
object, and the
remaining columns are used to fill the genotypes. If the first row contains
a header (header = TRUE
), data in this row is used to determine the
column names for sample information and genotypes.
The genotypes can take several forms, depending on the function that is
called (readRAW
, readRAW_matrix
, or
readRAW_big.matrix
). The following sections illustrate each function
in detail.
load.BGData()
to load a previously saved
BGData
object, as.BGData()
to create
BGData
objects from non-text files (e.g. .bed files).
BGData-class
,
ColumnLinkedMatrix-class
,
RowLinkedMatrix-class
,
big.matrix-class
, and ff
for
more information on the above mentioned classes.
# Path to example data
path <- system.file("extdata", package = "BGData")
# Convert RAW files of chromosome 1 to a BGData object
bg <- readRAW(fileIn = paste0(path, "/chr1.raw"))
unlink("BGData_chr1", recursive = TRUE)
Run the code above in your browser using DataLab