Returns either an as small as possible set of non redundant representatives covering (having in their neighborhood) a desired percentage of all sequences, or a given number of patterns with highest coverage. Special cases are single representatives such as the medoid or the sequence pattern with densest neighborhood. See plot.stslist.rep
for the plot method and seqplot
for other plot options.
seqrep(seqdata, criterion = "density", score = NULL, decreasing = TRUE,
coverage = 0.25, nrep = NULL, pradius = 0.10, dmax = NULL, diss = NULL,
weighted = TRUE, trep, tsim, dist.matrix, ...)
An object of class stslist.rep
. This is actually a
state sequence object (containing a list of state sequences) with the
following additional attributes:
a vector with the representative score of each sequence in the original set given the chosen criterion.
a matrix with the distance of each sequence to its nearest representative.
vector with, for each sequence, the representative that represents it.
list with indexes of occurrences of each representative in original data.
a data frame with quality measures for each representative sequence: number \(na\) of sequences attributed to the representative, number \(nb\) of sequences in the representative's neighborhood, mean distance \(MD\) to the representative and a few other indexes.
overall quality measure.
Print, plot and summary methods are available. More elaborated plots
are produced by the seqplot
function using the type="r"
argument, or the seqrplot
alias.
a state sequence object as defined by the seqdef
function.
the representativeness criterion for sorting the candidate list. One of "freq"
(sequence
frequency), "density"
(neighborhood density), "mscore"
(mean state frequency), "dist"
(centrality) and "prob"
(sequence likelihood). See details.
an optional vector of representativeness scores for sorting the sequences in the candidate list. The length of the vector must be equal to the number of sequences in the sequence object.
if a score vector is provided, indicates whether the objects in the candidate list must be sorted in
ascending or descending order of this score. Default is TRUE
, i.e. descending. The first object in the candidate list
is then supposed to be the most representative.
coverage threshold, i.e., minimum proportion of sequences that should have a representative in their
neighborhood (neighborhood radius is defined by pradius
).
number of representative sequences. If NULL
(default), the size of the representative set is
controlled by coverage
.
neighborhood
radius as a percentage of the maximum (theoretical)
distance dmax
. Defaults to 0.1 (10%). Sequence \(y\) is
redundant to sequence \(x\) when it is in the neighborhood of \(x\), i.e., within a distance
pradius*dmax
from \(x\).
maximum theoretical distance. The dmax
value is used to derive the neighborhood radius as pradius*dmax
. If NULL
, the value of dmax
is derived from the dissimilarity matrix.
matrix of pairwise dissimilarities between sequences in seqdata
. If NULL
, the
matrix is computed by calling the seqdist
function. In that case, optional arguments to be passed to
the seqdist
function (see ...
hereafter) should also be provided.
logical: Should weights assigned to the state sequence object be accounted for? (See seqdef
.)
Set as FALSE
to ignore the weights.
Deprecated. Use coverage
instead.
Deprecated. Use pradius
instead.
Deprecated. Use diss
instead.
optional arguments to be passed to the seqdist
function, mainly dist.method
specifying the
metric for computing the distance matrix, norm
for normalizing the distances, indel
and sm
for
indel and substitution costs when Optimal Matching metric is chosen. See seqdist
manual page for
details.
Alexis Gabadinho and Gilbert Ritschard
The representative set is obtained by an heuristic. Representatives are selected by successively extracting from the sequences sorted by their representativeness score those which are not redundant with already retained representatives. The selection stops when either the desired coverage or the wanted number of representatives is reached. Sequences are sorted either by the values provided as score
argument or by specifying one of the following as criterion
argument: "freq"
(sequence frequency), "density"
(neighborhood density), "mscore"
(mean state frequency), "dist"
(centrality), and "prob"
(sequence likelihood).
With the sequence frequency criterion, the more frequent a sequence the more representative it is supposed to be. Therefore, sequences are sorted in decreasing frequency order.
The neighborhood density is the
number---density---of sequences in the neighborhood of the
sequence. This requires to set the neighborhood radius
pradius
. Sequences are
sorted in decreasing density order.
The mean state frequency criterion is the mean value of the transversal frequencies of the successive states. Let \(s=s_{1}s_{2}\cdots s_{\ell}\) be a sequence of length \(\ell\) and \((f_{s_1}, f_{s_2}, \ldots, f_{s_\ell})\) the frequencies of the states at (time-)position \((t_1, t_2,\ldots t_{\ell})\). The mean state frequency is the sum of the state frequencies divided by the sequence length $$ MSF(s)=\frac{1}{\ell} \sum_{i=1}^{\ell} f_{s_{i}} $$ The lower and upper boundaries of \(MSF\) are \(0\) and \(1\). \(MSF\) is equal to \(1\) when all the sequences in the set are identical, i.e. when there is a single sequence pattern. The most representative sequence is the one with the highest score.
The centrality criterion is the sum of distances to all other sequences. The smallest the sum, the most representative is the sequence.
The sequence likelihood \(P(s)\) is defined as the product of the probability with which each of its observed
successive state is supposed to occur at its position.
Let \(s=s_{1}s_{2} \cdots s_{\ell}\) be a sequence of length \(\ell\). Then
$$
P(s)=P(s_{1},1) \cdot P(s_{2},2) \cdots P(s_{\ell},\ell)
$$
with \(P(s_{t},t)\) the probability to observe state \(s_t\) at position \(t\).
The question is how to determinate the state probabilities \(P(s_{t},t)\). One commonly used method for
computing them is to postulate a Markov Chain model, which can be of various order. The implemented criterion considers the
probabilities derived from the first order Markov model, that is each \(P(s_{t},t)\), \(t>1\) is set to the
transition rate \(p(s_t|s_{t-1})\) estimated across sequences from the observations at positions \(t\)
and \(t-1\). For \(t=1\), we set \(P(s_1,1)\) to the observed frequency of the state \(s_1\) at position 1.
The likelihood \(P(s)\) being generally very small, we use \(-\log P(s)\) as sorting criterion. The latter quantity reaches its minimum for \(P(s)\) equal to 1, which leads to sort the sequences in ascending order of their score.
Use criterion="dist"
(centrality) and nrep=1
to get the medoid, and criterion="density"
and nrep=1
to get the densest sequence pattern.
For more details, see Gabadinho & Ritschard, 2013.
Gabadinho A, Ritschard G (2013). "Searching for typical life trajectories applied to child birth histories", In R Lévy, E. Widmer (eds.), Gendered Life Courses, pp. 287-312. Vienna: LIT.
Gabadinho A, Ritschard G, Studer M, Müller NS (2011). "Extracting and Rendering Representative Sequences", In A Fred, JLG Dietz, K Liu, J Filipe (eds.), Knowledge Discovery, Knowledge Engineering and Knowledge Management, volume 128 of Communications in Computer and Information Science (CCIS), pp. 94-106. Springer-Verlag.
seqplot
, plot.stslist.rep
, dissrep
, disscenter
## Defining a sequence object with the data in columns 10 to 25
## (family status from age 15 to 30) in the biofam data set
data(biofam)
biofam.lab <- c("Parent", "Left", "Married", "Left+Marr",
"Child", "Left+Child", "Left+Marr+Child", "Divorced")
biofam.seq <- seqdef(biofam[,10:25], labels=biofam.lab)
## Computing the distance matrix
costs <- seqsubm(biofam.seq, method="TRATE")
biofam.om <- seqdist(biofam.seq, method="OM", sm=costs)
## Representative set using the neighborhood density criterion
biofam.rep <- seqrep(biofam.seq, diss=biofam.om, criterion="density")
biofam.rep
summary(biofam.rep)
plot(biofam.rep)
## plot by groups represented by the representatives
seqdplot(biofam.seq, group=attr(biofam.rep,"Rep.group"), border=NA)
## indexes of sequences represented by 1st representative
r1.grp <- which(attr(biofam.rep,"Rep.group")==1)
## indexes of occurrences of the first representative sequence
attr(biofam.rep,"idx.rep")[[1]]
Run the code above in your browser using DataLab