FOCI is a forward stepwise algorithm that uses the conditional dependence coefficient (codec
)
at each step, instead of the multiple correlation coefficient
as in ordinary forward stepwise. If stop
== TRUE, the process is stopped at the first instance of
nonpositive codec, thereby selecting a subset of variables. Otherwise, a set of covariates of size
num_features
, ordered according to predictive power (as measured by codec) is produced.
Parallel computation:
The computation can be lengthy, so the package offers two kinds of
parallel computation.
The first, controlled by the argument numCores
,
specifies the number of cores to be used on the host
machine. If at a given step there are k candidate variables
under consideration for inclusion, these k tasks are assigned
to the various cores.
The second approach, controlled by the argument parPlat
("parallel platform"), involves the user first setting up a cluster via
the parallel package. The data are divided into chunks by rows,
with each cluster node applying FOCI to its data chunk. The
union of the results is then formed, and fed through FOCI one more
time to adjust the discrepancies. The idea is that that last step
will not be too lengthy, as the number of candidate variables has
already been reduced. A cluster size of r may actually
produce a speedup factor of more than r (Matloff 2016).
Potentially the best speedup is achieved by using the two approaches
together.
The first approach cannot be used on Windows platforms, as
parallel::mcapply
has no effect. Windows users should thus
use the second approach only.
In addition to speed, the second approach is useful for diagnostics, as
the results from the different chunks gives the user an
idea of the degree of sampling variability in the
FOCI results.
In the second approach, a random permutation is applied to the
rows of the dataset, as many datasets are sorted by one or more
columns.
Note that if a certain value of a feature is rare in the
full dataset, it may be absent entirely in some chunk.