- seqdata
State sequence object of class stslist
.
The sequence data to use.
Use seqdef
to create such an object.
- method
String.
The dissimilarity measure to use.
It can be "OM"
, "OMloc"
, "OMslen"
, "OMspell"
,
"OMstran"
, "HAM"
, "DHD"
, "CHI2"
, "EUCLID"
,
"LCS"
, "LCP"
, "RLCP"
, "NMS"
, "NMSMST"
,
"SVRspell"
, or "TWED"
. See the Details section.
- refseq
NULL
, Integer, State Sequence Object, or List.
Default: NULL
.
The baseline sequence to compute the distances from.
When an integer, the index of a sequence in seqdata
or 0
for the most frequent sequence.
When a state sequence object, it must contain a single sequence and have the same
alphabet as seqdata
.
When a list, it must be a list of two sets of indexes of seqdata
rows.
- norm
String.
Default: "none"
.
The normalization to use when method
is one of "OM"
,
"OMloc"
, "OMslen"
, "OMspell"
,
"OMstran"
, "TWED"
, "HAM"
, "DHD"
, "LCS"
,
"LCP"
, "RLCP"
, "CHI2"
, "EUCLID"
.
It can be "none"
, "auto"
, or, except for
"CHI2"
and "EUCLID"
, "maxlength"
,
"gmean"
, "maxdist"
, or "YujianBo"
. "auto"
is
equivalent to "maxlength"
when method
is one of "OM"
,
"HAM"
, or "DHD"
, to "gmean"
when method
is one
of "LCS"
, "LCP"
, or "RLCP"
, to YujianBo
when
method
is one of "OMloc"
, "OMslen"
, "OMspell"
,
"OMstran"
, "TWED"
. See the Details section.
- indel
Double, Vector of Doubles, or String.
Default: "auto"
.
Insertion/deletion cost(s). Applies when method
is one of "OM"
, "OMslen"
, "OMspell"
,
or "OMstran"
.
The single state-independent insertion/deletion cost when a double.
The state-dependent insertion/deletion costs when a vector of doubles.
The vector should contain an indel cost by state in the order of the alphabet.
When "auto"
, the indel is set as max(sm)/2
when sm
is
a matrix and is computed by means of seqcost
when sm
is
a string specifying a cost method.
- sm
NULL
, Matrix, Array, or String. Substitution costs.
Default: NULL
.
The substitution-cost matrix when a matrix and method
is one of
"OM"
, "OMloc"
, "OMslen"
, "OMspell"
,
"OMstran"
, "HAM"
, or "TWED"
.
The series of the substitution-cost matrices when an array and
method = "DHD"
. They are grouped in a 3-dimensional array with the
third index referring to the position in the sequence.
One of the strings "CONSTANT"
, "INDELS"
, "INDELSLOG"
,
or "TRATE"
. Designates a seqcost
method
to build sm
. "CONSTANT"
is not relevant for "DHD"
.
sm
is mandatory when method
is one of "OM"
,
"OMloc"
, "OMslen"
, "OMspell"
, "OMstran"
,
or "TWED"
.
sm
is autogenerated when method
is one of "HAM"
or
"DHD"
and sm = NULL
. See the Details section.
Note: With method = "NMS"
or method = "SVRspell"
, use
prox
instead.
- with.missing
Logical.
Default: FALSE
.
Should the non-deleted missing value be added to the alphabet as an additional
state? If FALSE
and seqdata
or refseq
contains such
gaps, an error is raised.
- full.matrix
Logical.
Default: TRUE
.
When refseq = NULL
, if TRUE
, the full distance matrix is
returned, if FALSE
, an object of class dist
is returned,
that is, a vector containing only values from the lower triangle of the
distance matrix. Objects of class dist
are smaller and can be passed
directly as arguments to most clustering functions.
- kweights
Double or vector of doubles.
Default: vector of 1
s.
The weights applied to subsequences when method
is one of "NMS"
,
"NMSMST"
, or "SVRspell"
. It contains at position \(k\) the
weight applied to the subsequences of length \(k\). It must be positive.
Its length should be equal to the number of columns of seqdata
. If shorter,
longer subsequences are ignored. If a scalar, it is transformed into
rep(kweights,ncol(sedata))
.
- tpow
Double.
Default: 1.0
.
The exponential weight of spell length when method
is one of
"OMspell"
, "NMSMST"
, or "SVRspell"
.
- expcost
Double.
Default: 0.5
.
The cost of spell length transformation when method = "OMloc"
or
method = "OMspell"
. It must be positive. The exact interpretation is
distance-dependent.
- context
Double.
Default: 1-2*expcost
.
The cost of local insertion when method = "OMloc"
. It must be positive.
- link
String.
Default: "mean"
.
The function used to compute substitution costs when method = "OMslen"
.
One of "mean"
(arithmetic average) or "gmean"
(geometric mean
as in the original proposition of Halpin 2010).
- h
Double.
Default: 0.5
.
It must be greater than or equal to 0.
The exponential weight of spell length when method = "OMslen"
.
The gap penalty when method = "TWED"
. It corresponds to the lambda
in Halpin (2014), p 88. It is usually chosen in the range [0,1]
- nu
Double.
Stiffness when method = "TWED"
. It must be strictly greater than 0
and is usually less than 1.
See Halpin (2014), p 88.
- transindel
String.
Default: "constant"
.
Method for computing transition indel costs when method = "OMstran"
.
One of "constant"
(single indel of 1.0), "subcost"
(based on
substitution costs), or "prob"
(based on transition probabilities).
- otto
Double.
The origin-transition trade-off weight when method = "OMstran"
. It
must be in [0, 1].
- previous
Logical.
Default: FALSE
.
When method = "OMstran"
, should we also account for the transition
from the previous state?
- add.column
Logical.
Default: TRUE
.
When method = "OMstran"
, should the last column (and also the first
column when previous = TRUE
) be duplicated? When sequences have different
lengths, should the last (first) valid state be duplicated.
- breaks
List of ordered pairs of integers.
Default: NULL
.
The list of the possibly overlapping intervals when method = "CHI2"
or method = "EUCLID"
. Each interval is defined by the pair c(t1,t2)
of the start t1
and end t2
positions of the interval.
- step
Integer.
Default: 1
.
The length of the intervals when method = "CHI2"
or
method = "EUCLID"
and breaks = NULL
. It must be positive.
It must also be even when overlap = TRUE
.
- overlap
Logical.
Default: FALSE
.
When method = "CHI2"
or method = "EUCLID"
and
breaks = NULL
, should the intervals overlap?
- weighted
Logical.
Default: TRUE
.
When method
is "CHI2"
or when sm
is a string (method),
should the distributions of the states account for the sequence weights
in seqdata
? See seqdef
.
- global.pdotj
Numerical vector, "obs"
, or NULL
.
Default: NULL
.
Only for method = "CHI2"
.
The vector of state proportions to be used as marginal distribution. When NULL
, the state distribution on the corresponding interval is used. When "obs"
, the overall state distribution in seqdata
is used for all intervals. When a vector of proportions, it is used as marginal distribution for all intervals.
- prox
NULL
or Matrix.
Default: NULL
.
The matrix of state proximities when method = "NMS"
or
method = "SVRspell"
.
- check.max.size
Logical. Should seqdist
stop when maximum allowed number of unique sequences is exceeded? Caution, setting FALSE
may produce unexpected results or even crash R.
- opt.args
List. List of additional non-documented arguments for development usage.