seqformat: Conversion between sequence formats

Description

Convert a sequence data set from one format to another.

Usage

seqformat(data, var = NULL, from, to, compress = FALSE, nrep = NULL, tevent,
  stsep = NULL, covar = NULL, SPS.in = list(xfix = "()", sdsep = ","),
  SPS.out = list(xfix = "()", sdsep = ","), id = 1, begin = 2, end = 3,
  status = 4, process = TRUE, pdata = NULL, pvar = NULL, limit = 100,
  overwrite = TRUE, fillblanks = NULL, tmin = NULL, tmax = NULL, missing = "*",
  with.missing = TRUE, right="DEL", compressed, nr)

Value

A data frame for SRS, TSE, and SPELL, a matrix otherwise.

When from="SPELL", outcome has an attribute issues with indexes of sequences with issues (truncated sequences, missing start time, spells before birth year, ...)

Arguments

data

Data frame, matrix, stslist state sequence object, or character string vector. The data to use. (Tibble will be converted with as.data.frame).

A data frame or a matrix with sequence data in one or more columns when from = "STS" or from = "SPS". If sequence data are in a single column or in a string vector, they are assumed to be in the compressed form (see stsep).

A data frame with sequence data in one or more columns when from = "SPELL". If sequence data has not four columns ordered as individual ID, spell start time, spell end time, and spell state status, use var or id / begin / end / status.

A state sequence object when from = "STS" or from is not specified.

var

NULL, List of Integers or Strings. Default: NULL. The indexes or the names of the columns with the sequence data in data. If NULL, all columns are considered.

from

String. The format of the input sequence data. It can be "STS", "SPS", or "SPELL". It is not needed if data is a state sequence object.

to

String. The format of the output data. It can be "STS", "DSS", "SPS", "SRS", "SPELL", or "TSE".

compress

Logical. Default: FALSE. When to = "STS", to = "DSS", or to = "SPS", should the sequences (row vector of states) be concatenated into strings? See seqconc.

nrep

Integer. The number of shifted replications when to = "SRS".

tevent

Matrix. The transition-definition matrix when to = "TSE". It should be of size \(d * d\) where \(d\) is the number of distinct states appearing in the sequences. The cell \((i,j)\) lists the events associated with a transition from state \(i\) to state \(j\). It can be created with seqetm.

stsep

NULL, Character. Default: NULL. The separator between states in the compressed form (strings) when from = "STS" or from = "SPS". If NULL, seqfcheck is called for detecting automatically a separator among "-" and ":". Other separators must be specified explicitly. See seqdecomp.

covar

List of Integers or Strings. The indexes or the names of additional columns in data to include as covariates in the output when to = "SRS". The covariates are replicated across the shifted replicated rows.

SPS.in

List. Default: list(xfix = "()", sdsep = ","). The specifications for the state-duration couples in the input data when from = "SPS". The first specification, xfix, specifies the prefix/suffix character. Use a two-character string if the prefix and the suffix differ. Use xfix = "" when no prefix/suffix are present. The second specification, sdsep, specifies the state/duration separator.

SPS.out

List. Default: list(xfix = "()", sdsep = ","). The specifications for the state-duration couples in the output data when to = "SPS". See SPS.in above.

id

NULL, Integer, String, List of Integers or Strings. Default: 1.

When from = "SPELL", the index or the name of the column containing the individual IDs in data (after var filtering).

When to = "TSE", the index or the name of the column containing the individual IDs in data (after var filtering) or the unique individual IDs. If id is not manually specified, id is set as NULL for backward compatibility with TraMineR 1.8-13 behaviour. If id is manually or automatically set as NULL, the original individual IDs are ignored and replaced by the indexes of the sequences in the input data.

When from = "SPELL" and to = "TSE", the index or the name of the column containing the individual IDs in data (after var filtering). The TSE output will use the original individual IDs.

begin

Integer or String. Default: 2. The index or the name of the column containing the spell start times in data (after var filtering) when from = "SPELL". Start times should be positive integers.

end

Integer or String. Default: 3. The index or the name of the column containing the spell end times in data (after var filtering) when from = "SPELL". End times should be positive integers.

status

Integer or String. Default: 4. The index or the name of the column containing the spell statuses in data (after var filtering) when from = "SPELL".

process

Logical. Default: TRUE. When from = "SPELL", if TRUE, create sequences on a process time axis, if FALSE, create sequences on a calendar time axis.

This process argument as well as the associated pdata and pvar arguments are intended for data containing spell data with calendar begin and end times. When those times are ages, use process = FALSE with pdata=NULL to use those ages as process times. Option process = TRUE does currently not work for age times.

pdata

NULL, "auto", or data frame. Default: NULL. (tibble will be converted with as.data.frame).

If NULL, the start and end times of each spell are supposed to be, if process = TRUE, ages, if process = FALSE, years when from = "SPELL".

If "auto", ages are computed using the start time of the first spell of each individual as her/his birthdate when from = "SPELL" and process = TRUE. For from = "SPELL" and process = FALSE, "auto" is equivalent to NULL.

A data frame containing the ID and the birth time of the individuals when from = "SPELL" or to = "SPELL". Use pvar to specify the column names. The ID is used to match the birth time of each individual with the sequence data. The birth time should be integer. It is the start time from which the positions on the time axis are computed. It also serves to compute tmin and to guess tmax when the latter are NULL, from = "SPELL", and process = FALSE.

pvar

List of Integers or Strings. The indexes or names of the columns of the data frame pdata that contain the ID and the birth time of the individuals in that order.

limit

Integer. Default: 100. The maximum age of age sequences when from = "SPELL" and process = TRUE. Age sequences will be considered to start at 1 and to end at limit.

overwrite

Logical. Default: TRUE. When from = "SPELL", if TRUE, the most recent episode overwrites the older one when they overlap each other, if FALSE, in case of overlap, the most recent episode starts after the end of the previous one.

fillblanks

Character. The value to fill gaps between episodes when from = "SPELL".

tmin

NULL or Integer. Default: NULL. The start time of the axis when from = "SPELL" and process = FALSE. If NULL, the value is the minimum of the spell start times (see begin) or the minimum of the birth time of the individuals (see pdata when it is a data frame and process = FALSE).

tmax

NULL or Integer. Default: NULL. The end time of the axis when from = "SPELL" and process = FALSE. If NULL, the value is the maximum of the spell end times (see end) or the sum of the maximum of the spell end times and of the maximum of the birth time of the individuals (see pdata when it is a data frame and process = FALSE).

missing

String. Default: "*". The code for missing states in data. It will be replaced by NA in the output data. Ignored when data is a state sequence object (see seqdef), in which case the attribute nr is used as missing value code.

with.missing

Logical. Default: TRUE. When to = "SPELL", should the spells of missing states be included?

right

One of "DEL" or NA. Default: "DEL". When to = "SPELL" and with.missing=TRUE, set right=NA to include the end spells of missing states.

compressed

Deprecated. Use compress instead.

nr

Deprecated. Use missing instead.

Author

Alexis Gabadinho, Pierre-Alexandre Fonta, Nicolas S. Müller, Matthias Studer, and Gilbert Ritschard.

Details

The seqformat function is used to convert data from one format to another. The input data is first converted into the STS format and then converted to the output format. Depending on input and output formats, some information can be lost in the conversion process. The output is a matrix or a data frame, NOT a sequence stslist object. To process, print or plot the sequences with TraMineR functions, you will have to first transform the data frame into a stslist state sequence object with seqdef. See Gabadinho et al. (2009) and Ritschard et al. (2009) for more details on longitudinal data formats and converting between them.

When data are in "SPELL" format (from = "SPELL"), the begin and end times are expected to be positions in the sequences. Therefore they should be strictly positive integers. With process=TRUE, the outcome sequences will be aligned on ages (process duration since birth), while with process=FALSE they will be aligned on dates (position on the calendar time). If process=TRUE, values in the begin and end columns of data are assumed to be ages when pdata is NULL and integer dates otherwise. If process=FALSE, begin and end values are assumed to be integer dates when pdata is NULL and ages otherwise.

To convert from person-period data use from = "SPELL" and set both begin and end as the column index or name of the time variable. Alternatively, use the reshape command of stats, which is more efficient.

References

Gabadinho, A., G. Ritschard, M. Studer and N. S. Müller (2009). Mining Sequence Data in R with the TraMineR package: A user's guide. Department of Econometrics and Laboratory of Demography, University of Geneva.

Ritschard, G., A. Gabadinho, M. Studer and N. S. Müller. Converting between various sequence representations. in Ras, Z. & Dardzinska, A. (eds.) Advances in Data Management, Springer, 2009, 223, 155-175.

Examples

Run this code

## ========================================
## Examples with raw STS sequences as input
## ========================================

## Loading a data frame with sequence data in the columns 13 to 24
data(actcal)

## Converting to SPS format
actcal.SPS.A <- seqformat(actcal, 13:24, from = "STS", to = "SPS")
head(actcal.SPS.A)

## Converting to compressed SPS format with no
## prefix/suffix and with "/" as state/duration separator
actcal.SPS.B <- seqformat(actcal, 13:24, from = "STS", to = "SPS",
  compress = TRUE, SPS.out = list(xfix = "", sdsep = "/"))
head(actcal.SPS.B)

## Converting to compressed DSS format
actcal.DSS <- seqformat(actcal, 13:24, from = "STS", to = "DSS",
  compress = TRUE)
head(actcal.DSS)


## ==============================================
## Examples with a state sequence object as input
## ==============================================

## Loading a data frame with sequence data in the columns 10 to 25
data(biofam)

## Limiting the number of considered cases to the first 20
biofam <- biofam[1:20, ]

## Creating a state sequence object
biofam.labs <- c("Parent", "Left", "Married", "Left/Married",
  "Child", "Left/Child", "Left/Married/Child", "Divorced")
biofam.short.labs <- c("P", "L", "M", "LM", "C", "LC", "LMC", "D")
biofam.seq <- seqdef(biofam, 10:25, alphabet = 0:7,
  states = biofam.short.labs, labels = biofam.labs)

## Converting to SPELL format
bf.spell <- seqformat(biofam.seq, from = "STS", to = "SPELL",
  pdata = biofam, pvar = c("idhous", "birthyr"))
head(bf.spell)


## ======================================
## Examples with SPELL sequences as input
## ======================================

## Loading two data frames: bfspell20 and bfpdata20
## bfspell20 contains the first 20 biofam sequences in SPELL format
## bfpdata20 contains the IDs and the years at which the
## considered individuals were aged 15
data(bfspell)

## Converting to STS format with alignement on calendar years
bf.sts.y <- seqformat(bfspell20, from = "SPELL", to = "STS",
  id = "id", begin = "begin", end = "end", status = "states",
  process = FALSE)
head(bf.sts.y)

## Converting to STS format with alignement on ages
bf.sts.a <- seqformat(bfspell20, from = "SPELL", to = "STS",
  id = "id", begin = "begin", end = "end", status = "states",
  process = TRUE, pdata = bfpdata20, pvar = c("id", "when15"),
  limit = 16)
names(bf.sts.a) <- paste0("a", 15:30)
head(bf.sts.a)


## ==================================
## Examples for TSE and SPELL output
## in presence of missing values
## ==================================

data(ex1) ## STS data with missing values
## creating the state sequence object with by default
## the end missings coded as void ('%')
sqex1 <- seqdef(ex1[,1:13])
as.matrix(sqex1)

## Creating state-event transition matrices
ttrans <- seqetm(sqex1, method='transition')
tstate <- seqetm(sqex1, method='state')

## Converting into time stamped events
seqformat(sqex1, from = "STS", to = "TSE", tevent = ttrans)
seqformat(sqex1, from = "STS", to = "TSE", tevent = tstate)

## Converting into vertical spell data
seqformat(sqex1, from = "STS", to = "SPELL", with.missing=TRUE)
seqformat(sqex1, from = "STS", to = "SPELL", with.missing=TRUE, right=NA)
seqformat(sqex1, from = "STS", to = "SPELL", with.missing=FALSE)

Run the code above in your browser using DataLab