Cluster the rows of a logical matrix using the Proximus algorithm. The compression rate of the algorithm can be influenced by the choice of the maximum cluster radius and the minimum cluster size.
proximus(x, max.radius = 2, min.size = 1, min.retry = 10,
max.iter = 16, debug = FALSE)
An object of class proximus
with the following components:
the number of rows of the data matrix.
the number of columns of the data matrix.
a list containing the approximations (patterns).
a vector of row (presence set) indexes.
a vector of column (dominant attribute set) indexes.
the number of ones in the approximated submatrix.
the absolute error reduction by the approximation.
see arguments.
see arguments.
rownames of the data matrix.
colnames of the data matrix.
a logical matrix.
the maximum number of bits a member in a row set may deviate from its dominant pattern.
the minimum split size of a row set.
number of retries to split a pure rank-one approximation (translates into a resampling rate).
the maximum number of iterations for finding a local rank-one approximation.
optional debugging output.
Christian Buchta
Deep recursions may exhaust your computer.
The intended area of application is the compression of high-dimensional binary data into representative patterns. For instance, purchase incidence (market basket data) or term-document matrices may be preprocessed by Proximus for later association rule mining.
The algorithm is of a recursive partitioning type. Specifically, at each step a binary split is attempted using a local rank-one approximation of the current submatrix (row set). That is a specialization of principal components to binary data which represents a matrix as the outer product of two binary vectors. The node expansion stops if a submatrix is pure, i.e., the column (presence set) vector indicates all the rows and the Hamming distances from the row (dominant attribute set) pattern vector, or the size of the row set, are less than or equal the specified threshold. In the case the rank-one approximation does not result in a split but the radius constraint is violated, the matrix is split using a random row and the radius constraint.
The debug option can be used to gain some insight into how the algorithm
proceeds: a right angle bracket indicates a split and the return to
a recursion level is indicated by a left one. Leafs in the recursion tree
are indicated by an asterisk and retries by a plus sign. The number of
retries is bounded by the size of the current set divided by
min.retry
.
Double angle brackets indicate a random split (see above). The numbers
between square brackets indicate the current set size, the size of the
presence (sub)set, and its radius. The adjoining numbers indicate the
depth of the recursion and the count of retries. Finally, a count of
the leaf nodes found so far is shown to the right of an asterisk.
M. Koyutürk, A. Graham, and N. Ramakrishnan. Compression, Clustering, and Pattern Discovery in Very High-Dimensional Discrete-Attribute Data Sets. IEEE Transactions On Knowledge and Data Engineering, Vol. 17, No. 4, (April) 2005.
summary.proximus
for summaries,
fitted
for obtaining the approximated matrix and the
pattern labels of the samples, and
lmplot
for plotting logical matrices.
x <- matrix(sample(c(FALSE, TRUE), 200, rep=TRUE), ncol=10)
pr <- proximus(x, max.radius=8)
summary(pr)
### example from paper
x <- rlbmat()
pr <- proximus(x, max.radius=8, debug=TRUE)
op <- par(mfrow=c(1,2), pty="s")
lmplot(x, main="Data")
box()
lmplot(fitted(pr)$x, main="Approximation")
box()
par(op)
Run the code above in your browser using DataLab