seqlogp: Logarithm of the probabilities of state sequences

Description

Logarithm of the probabilities of state sequences. The probability of a sequence is defined as the product of the probabilities of the successive states in the sequence. State probabilities can either be provided or be computed with one of a few basic models.

Usage

seqlogp(seqdata, prob="trate", time.varying=TRUE,
        begin="freq", weighted=TRUE, with.missing=FALSE)

Value

Vector of the negative logarithm $-\log P(s)$ of the sequence probabilities.

Arguments

seqdata: A state sequence object as produced by seqdef.
prob: String or numeric array. If a string, either "trate" or "freq" to select a probability model to compute the state probabilities. If a numeric array, a matrix or 3-dimensional array of transition probabilities. See details.
time.varying: Logical. If TRUE, the probabilities (transitions or frequencies) are computed separately for each time $t$ point.
begin: String of numeric vector. Distribution used to determine the probability of the first state. If a vector, the probabilites to use. If a string, either "freq" or global.freq. With freq, the observed distribution at first position is used. If global.freq, the overall distribution is used. Default is "freq".
weighted: Logical. Should we account for the weights when present in seqdata? Default is TRUE.
with.missing: Logical. Should non void missing states be treated as regular values? Default is FALSE.

Author

Matthias Studer, Alexis Gabadinho, and Gilbert Ritschard

Details

The sequence likelihood $P(s)$ is defined as the product of the probability with which each of its observed successive state is supposed to occur at its position. Let $s=s_{1}s_{2} \cdots s_{\ell}$ be a sequence of length $\ell$. Then $$ P(s)=P(s_{1},1) \cdot P(s_{2},2) \cdots P(s_{\ell},\ell) $$ with $P(s_{t},t)$ the probability to observe state $s_t$ at position $t$.

There are different ways to determine the state probabilities $P(s_t,t)$. The method is chosen by means of the prob argument.

With prop = "freq", the probability $P(s_{t},t)$ is set as the observed relative frequency at position $t$. In that case, the probability does not depend on the probabilities of transition. By default (time.varying=TRUE), the relative frequencies are computed separately for each position $t$. With time.varying=FALSE, the relative frequencies are computed over the entire covered period, i.e. the same frequencies are used at each $t$.

Option prop = "trate" assumes that each $P(s_t,t)$, $t>1$ is set as the transition probability $p(s_t|s_{t-1})$. The state distribution used to determine the probability of the first state $s_1$ is set by means of the begin argument (see below). With the default time.varying=TRUE), the transition probabilities are estimated separately at each position, yielding an array of transition matrices. With time.varying=FALSE, the transition probabilities are assumed to be constant over the successive positions and are estimated over the entire sequence duration, i.e. from all observed transitions.

Custom transition probabilities can be provided by passing a matrix or a 3-dimensional array as prob argument.

The distribution used at the first position is set by means of the begin argument. You can either pass the distribution (probabilities of the states in the alphabet including the missing value when with.missing=TRUE), or specify "freq" for the observed distribution at the first position, or global.freq for the overall state distribution.

The likelihood $P(s)$ being generally very small, seqlogp returns $-\log P(s)$. The latter quantity is minimal when $P(s)$ is equal to $1$.

Examples

Run this code

## Creating the sequence objects using weigths
data(biofam)
biofam.seq <-  seqdef(biofam, 10:25, weights=biofam$wp00tbgs)

## Computing sequence probabilities
biofam.prob <- seqlogp(biofam.seq)
## Comparing the probability of each cohort
cohort <- biofam$birthyr>1940
boxplot(biofam.prob~cohort)

Run the code above in your browser using DataLab