The main idea of the P3C algorithm is to use statistical distributions for the
task of finding clusters. To this end each dimension is first split into
1+log_2(nrow(data)) bins and the chi-square test is used to compute the
probability that the sizes of these bins are uniformly distributed. If this
probability is bigger than 1-ChiSquareAlpha, nothing happens. Otherwise
the largest bins will be removed until this is the case. The bins that were
removed in this way are then used to find clusters. To this end, bins that are
adjacent are merged. Then clusters are formed by taking a bin from one
dimension and determining the probability of sharing as many points as it does
with each bin from another dimension. The bin is then intersected with the bin
from another dimension where this probability is the lowest, given that it is
at least lower than 1.00E-PoissonThreshold and this is repeated until
no such bin is found.
probability of not being uniformly distributed that the
points in a dimension are allowed to have without assuming that there is a
cluster visible from this dimension
PoissonThreshold
maximum probability for a bin in another dimension to
deviate from the observed bin as much as it does that is allowed. The value
used for this will be 1.00*10^-PoissonThreshold.
References
Gabriela Moise, Jörg Sander and Martin Ester P3C: A Robust
Projected Clustering Algorithm In Proc. 6th IEEE International Conference
on Data Mining 2006