These functions try to read site, tree, and core IDs from a
rwl data.frame
.
read.ids(rwl, stc = c(3, 2, 3), ignore.site.case = FALSE,
ignore.case = FALSE, fix.typos = FALSE, typo.ratio = 5,
use.cor = TRUE)autoread.ids(rwl, ignore.site.case = TRUE, ignore.case = "auto",
fix.typos = TRUE, typo.ratio = 5, use.cor = TRUE)
a vector of three integral values or character string
"auto". The numbers indicate the number of characters to split the
site code (stc[1]
), the tree IDs
(stc[2]
), and the core IDs
(stc[3]
). Defaults to c(3, 2, 3)
. If "auto",
tries to automatically determine the split locations. See Details
for further information.
a logical
flag. If TRUE
and
stc is "auto"
, correlation clustering may be used for
determining the length of the tree and core parts. See Details.
a logical
flag. If TRUE
, the
function does not distinguish between upper case and lower case
letters in the site part of the series names.
a logical
flag or "auto". If TRUE
,
the function does not distinguish between upper case and lower case
letters in the tree / core part of the series names. The default in
read.ids
is FALSE
, i.e. the difference matters. The
default in read.ids
is "auto"
, which means that the
function tries to be smart with respect to case sensitivity. In
"auto"
mode, the function generally ignores case differences,
unless doing so would result in additional duplicate combinations of
tree and core IDs. Also, when in "auto"
mode and
stc
is "auto"
, case sensitivity is used in
highly heuristic ways when deciding the boundary between the site
part and the tree part in uncertain cases.
a logical
flag. If TRUE
, the function
will try to detect and fix typing errors.
a numeric
value larger than 1
,
affecting the eagerness of the function to fix typing errors. The
default is 5. See Details.
A data.frame
with column one named "tree"
giving an
ID for each tree and column two named "core"
giving
an ID for each core. The original series IDs are
copied from rwl as rownames. The order of the rows in the output
matches the order of the series in rwl
. If more than one
site is detected, an additional third column named "site"
will
contain a site ID. All columns have integral valued
numeric
values.
Because dendrochronologists often take more than one core per tree, it is occasionally useful to calculate within vs. between tree variance. The International Tree Ring Data Bank (ITRDB) allows the first eight characters in an rwl file for series IDs but these are often shorter. Typically the creators of rwl files use a logical labeling method that can allow the user to determine the tree and core ID from the label.
Argument stc
tells how each series separate into site,
tree, and core IDs. For instance a series code might be
"ABC011"
indicating site "ABC"
, tree 1, core 1. If this
format is consistent then the stc
mask would be
c(3, 2, 3)
allowing up to three characters for the core
ID (i.e., pad to the right). If it is not possible to
define the scheme (and often it is not possible to machine read
IDs), then the output data.frame
can be built
manually. See Value for format.
The function autoread.ids
is a wrapper to read.ids
with
stc="auto"
, i.e. automatic detection of the site / tree / core
scheme, and different default values of some parameters. In automatic
mode, the names in the same rwl
can even follow different
site / tree / core schemes. As there are numerous possible encoding
schemes for naming measurement series, the function cannot always
produce the correct result.
With stc="auto"
, the site part can be one of the following.
In names mostly consisting of numbers, the longest common prefix is the site part
Alphanumeric site part ending with alphabet, when followed by numbers and alphabets
Alphabetic site part (quite complicated actual
definition). Setting ignore.case
to "auto"
allows the function to try to guess when a case change in the middle
of a sequence of alphabets signifies a boundary between the site
part and the tree part.
The characters before the first sequence of space / punctuation characters in a name that contains at least two such sequences
These descriptions are somewhat general, and the details can be found in regular expressions inside the function. If a name does not match any of the descriptions, it is matched against a previously found site part, starting from the longest.
The following ID schemes are detected and supported in the tree / core part. The detection is done per site.
Numbers in tree part, core part starts with something else
Alphabets in tree part, core part starts with something else
Alphabets, either tree part all lower case and core part all
upper case or vice versa. For this to work,
ignore.case
must be set to "auto"
or
FALSE
.
All digits. In this case, the number of characters belonging to the tree and core parts is detected with one of the following methods.
If numeric tree parts were found before, it is assumed that the core part is missing (one core per tree).
It the series are numbered continuously, one core per tree is assumed.
Otherwise, try to find a core part as the suffix so that the cores are numbered continuously.
If none of the above fits, the tree / core split of the all-digit names will be decided with the methods described further down the list, or finally with the fallback mechanism.
The combined tree / core part is empty or one character. In this case, the core part is assumed to be missing.
Tree and core parts separated by a punctuation or white space character
If the split of a tree / core part cannot be found with any of the
methods described above, the prefix of the string is matched against a
previously found tree part, starting from the longest. The fallback
mechanism for the still undecided tree / core parts is one of the
following. The first one is used if use.cor
is
TRUE
, number two if it is FALSE
.
Pairwise correlation coefficients are computed between all remaining series. Pairs of series with above median correlation are flagged as similar, and the other pairs are flagged as dissimilar. Each possible number of characters (minimum 1) is considered for the share of the tree ID. The corresponding unique would-be tree IDs determine a set of clusterings where one cluster is formed by all the measurement series of a single tree. For each clustering (allocation of characters), an agreement score is computed. The agreement score is defined as the sum of the number of similar pairs with matching cluster number and the number of dissimilar pairs with non-matching cluster number. The number of characters with the maximum agreement is chosen.
If the majority of the names in the site use k
characters for the tree part, that number is chosen. Otherwise, one
core per tree is assumed. Parameter typo.ratio
has a
double meaning as it also defines what is meant by majority here: at
least typo.ratio / (typo.ratio + 1) *
n.tot
, where n.tot is the number of names in the site.
In both fallback mechanisms, the number of characters allocated for the tree part will be increased until all trees have a non-zero ID or there are no more characters.
Suspected typing errors will be fixed by the function if
fix.typos
is TRUE
. The parameter
typo.ratio
affects the eagerness to fix typos, i.e. the
number of counterexamples required to declare a typo. The following
main typo fixing mechanisms are implemented:
If a rare site string resembles an at
least typo.ratio
times more frequent alternative, and
if fixing it would not create any name collisions, make the fix.
The alternative string must be unique, or if there is more than
one alternative, it is enough if only one of them is a look-alike
string. Any kind of substitution in one character place is
allowed if the alternative string has the same length as the
original string. The alternative string can be one character
longer or one character shorter than the original string, but only
if it involves interpreting one digit as the look-alike alphabet
or vice versa. There are requirements to how long a site string
must be in order to be eligible for replacement / typo fixing,
i.e. cannot be shortened to zero length, cannot change the only
character of a site string. The parameters
ignore.case
and ignore.site.case
have
some effect on this typo fixing mechanism.
If all tree / core parts of a
site have the same length, each character position is inspected
individually. If the characters in the i:th position are
predominantly digits (alphabets), any alphabets (digits) are
changed to the corresponding look-alike digit (alphabet) if there
is one. The look-alike groups are {0, O, o}, {1, I, i}, {5,
S, s} and {6, G}. The parameter typo.ratio
determines the decision threshold of interpreting the type of each
character position as alphabet (digit): the ratio of alphabets
(digits) to the total number of characters must be at least
typo.ratio / (typo.ratio + 1)
. If a name
differs from the majority type in more than one character
position, it is not fixed. Also, no fixes are performed if any of
them would cause a possible monotonic order of numeric prefixes to
break.
The function attempts to convert the tree and core substrings to
integral values. When this succeeds, the converted values are copied
to the output without modification. When non-integral substrings are
observed, each unique tree is assigned a unique integral value. The
same applies to cores within a tree, but there are some subtleties
with respect to the handling of duplicates. Substrings are sorted
before assigning the numeric
IDs.
The order of columns in rwl
, in most cases, does not
affect the tree and core IDs assigned to each series.
# NOT RUN {
library(utils)
data(ca533)
read.ids(ca533, stc = c(3, 2, 3))
autoread.ids(ca533)
# }
Run the code above in your browser using DataLab