rankclust: Model-based clustering for multivariate partial ranking

Description

This functions estimates a clustering of ranking data, potentially multivariate, partial and containing tied, based on a mixture of multivariate ISR model [2]. By specifying only one cluster, the function performs a modelling of the ranking data using the multivariate ISR model. The estimation is performed thanks to a SEM-Gibbs algorithm.

Usage

rankclust(
  data,
  m = ncol(data),
  K = 1,
  criterion = "bic",
  Qsem = 100,
  Bsem = 20,
  RjSE = m * (m - 1)/2,
  RjM = m * (m - 1)/2,
  Ql = 500,
  Bl = 100,
  maxTry = 3,
  run = 1,
  detail = FALSE
)

Value

An object of class Rankclust (See Output-class and Rankclust-class). If the output object is named res. You can access the result by res[number of groups]@slotName where slotName is an element of the class Output.

Arguments

data: a matrix in which each row is a ranking (partial or not; for partial ranking, missing elements must be 0 or NA. Tied are replaced by the lowest position they share). For multivariate rankings, the rankings of each dimension are placed end to end in each row. The data must be in ranking notation (see Details or convertRank functions).
m: a vector composed of the sizes of the rankings of each dimension (default value is the number of column of the matrix data).
K: an integer or a vector of integer with the number of clusters.
criterion: criterion "bic" or "icl", criterion to minimize for selecting the number of clusters.
Qsem: the total number of iterations for the SEM algorithm (default value=40).
Bsem: burn-in period for SEM algorithm (default value=10).
RjSE: a vector containing, for each dimension, the number of iterations of the Gibbs sampler used both in the SE step for partial rankings and for the presentation orders generation (default value=mj(mj-1)/2).
RjM: a vector containing, for each dimension, the number of iterations of the Gibbs sampler used in the M step (default value=mj(mj-1)/2)
Ql: number of iterations of the Gibbs sampler for estimation of log-likelihood (default value=100).
Bl: burn-in period for estimation of log-likelihood (default value=50).
maxTry: maximum number of restarts of the SEM-Gibbs algorithm in the case of non convergence (default value=3).
run: number of runs of the algorithm for each value of K.
detail: boolean, if TRUE, time and others information will be print during the process (default value FALSE).

Author

Quentin Grimonprez

Details

The ranks have to be given to the package in the ranking notation (see convertRank function), with the following convention:

- missing positions are replaced by 0

- tied are replaced by the lowest position they share

See the vignette dataFormat for mode details (RShowDoc("dataFormat", package = "Rankcluster")).

The ranking representation r=(r_1,...,r_m) contains the ranks assigned to the objects, and means that the ith object is in r_ith position.

The ordering representation o=(o_1,...,o_m) means that object o_i is in the ith position.

Let us consider the following example to illustrate both notations: a judge, which has to rank three holidays destinations according to its preferences, O1 = Countryside, O2 =Mountain and O3 = Sea, ranks first Sea, second Countryside, and last Mountain. The ordering result of the judge is o = (3, 1, 2) whereas the ranking result is r = (2, 3, 1).

References

[1] C.Biernacki and J.Jacques (2013), A generative model for rank data based on sorting algorithm, Computational Statistics and Data Analysis, 58, 162-176.

[2] J.Jacques and C.Biernacki (2012), Model-based clustering for multivariate partial ranking data, Inria Research Report n 8113.

Examples

Run this code

data(big4)
result <- rankclust(big4$data, K = 2, m = big4$m, Ql = 200, Bl = 100, maxTry = 2)

if(result@convergence) {
  summary(result)

  partition <- result[2]@partition
  tik <- result[2]@tik
}

Run the code above in your browser using DataLab