Learn R Programming

shipunov (version 1.17.1)

NC.dist: Normalized Compression Distance

Description

Calculates the normalized compression distance

Usage

NC.dist(data, method="gzip", character=TRUE)

Value

Distance object with distances among rows of 'data'

Arguments

data

Matrix (or data frame) with variables that should be used in the computation of the distance between rows.

method

Taken from memCompress(): either "gzip", or "bzip2", or "xz"; the last is very slow

character

Convert to character mode (default), or use as raw?

Author

Alexey Shipunov

Details

NC.dist() computes the distance based on the sizes of the compressed vectors. It is calculated as

dissimilarity(x, y) = B(x, y) - max(B(x), B(y)) / min(B(x), B(y))

where B(x) and B(y) are the bytesizes of the compressed 'x' and 'y', and B(x, y) is the comressed bytesize of concatenated 'x' and 'y'. The algorithm uses basic memCompress() function.

If argument is the data frame, NC.dist() internally converts it into the matrix. All columns by default will be converted into character mode (and if 'character=FALSE', into raw). This default behavior allows NC.dist() to be the universal distance which also does not mind NAs and zeroes.

References

Cilibrasi, R., & Vitanyi, P. M. (2005). Clustering by compression. Information Theory, IEEE Transactions on, 51(4), 1523-1545.

See Also

Examples

Run this code

## converts variables into character, universal method
iris.nc <- NC.dist(iris[, -5])
iris.hnc <- hclust(iris.nc, method="ward.D2")
## amazingly, it works even for vectors with length=4 (iris data rows)
plot(prcomp(iris[, -5])$x, col=cutree(iris.hnc, 3))

## using variables as raw, it is good when they are uniform
iris.nc2 <- NC.dist(iris[, -5], character=FALSE)
iris.hnc2 <- hclust(iris.nc2, method="ward.D2")
plot(prcomp(iris[, -5])$x, col=cutree(iris.hnc2, 3))

## bzip2 uses Burrows-Wheeler transform
NC.dist(matrix(runif(100), ncol=10), method="bzip2")

Run the code above in your browser using DataLab