Learn R Programming

IDmining (version 1.0.7)

MBRM_parallel: Morisita-Based Filter for Redundancy Minimization (Parallel)

Description

Executes the MBRM algorithm for unsupervised feature selection (CPU parallel computing).

Usage

MBRM_parallel(X, scaleQ, m=2, C=NULL, ID_tot=NULL, ncores=4)

Arguments

X

A \(N \times E\) matrix, data.frame or data.table where \(N\) is the number of data points and \(E\) is the number of variables (or features). Each variable is rescaled to the \([0,1]\) interval by the function.

scaleQ

A vector containing the values of \(\ell^{-1}\) chosen by the user (see Details).

m

The value of the parameter m (by default: m=2).

C

The number of steps of the SFS procedure (by default: C = E).

ID_tot

The value of the full data ID if it is known a priori (by default: the value of ID_tot is estimated using the Morisita estimator of ID witin the function).

ncores

Number of workers (by default: ncores = 4).

Value

A list of four elements:

  1. a vector containing the identifier numbers of the original features in the order they are selected through the Sequential Forward Selection (SFS) search procedure.

  2. the names of the corresponding features.

  3. the corresponding ID estimates.

  4. the ID estimate of the full data set.

Details

  1. \(\ell\) is the edge length of the grid cells (or quadrats). Since the the variables (and consenquently the grid) are rescaled to the \([0,1]\) interval, \(\ell\) is equal to \(1\) for a grid consisting of only one cell.

  2. \(\ell^{-1}\) is the number of grid cells (or quadrats) along each axis of the Euclidean space in which the data points are embedded.

  3. \(\ell^{-1}\) is equal to \(Q^{(1/E)}\) where \(Q\) is the number of grid cells and \(E\) is the number of variables (or features).

  4. \(\ell^{-1}\) is directly related to \(\delta\) (see References).

  5. \(\delta\) is the diagonal length of the grid cells.

  6. The values of \(\ell^{-1}\) in scaleQ must be chosen according to the linear part of the \(\log\)-\(\log\) plot relating the \(\log\) values of the multipoint Morisita index to the \(\log\) values of \(\delta\) (or, equivalently, to the \(\log\) values of \(\ell^{-1}\)) (see logMINDEX).

References

J. Golay and M. Kanevski (2017). Unsupervised feature selection based on the Morisita estimator of intrinsic dimension, Knowledge-Based Systems 135:125-134.

Examples

Run this code
# NOT RUN {
bf <- Butterfly(10000)

bf_select <- MBRM_parallel(bf[,-9], 5:25, ncores=2)
var_order <- bf_select[[2]]
var_perf  <- bf_select[[3]]

# }
# NOT RUN {
dev.new(width=5, height=4)
plot(var_perf,type="b",pch=16,lwd=2,xaxt="n",xlab="", ylab="",
     col="red",ylim=c(0,max(var_perf)),panel.first={grid(lwd=1.5)})
axis(1,1:length(var_order),labels=var_order)
mtext(1,text="Added Features (from left to right)",line=2.5,cex=1)
mtext(2,text="Estimated ID",line=2.5,cex=1)

bf_large <- Butterfly(10^5)
system.time(MBRM(bf_large[,-9], 5:25))
system.time(MBRM_parallel(bf_large[,-9], 5:25))
# }

Run the code above in your browser using DataLab