Learn R Programming

DistributionFitR (version 0.1)

globalfit: Detect continuity and fit multiple distributions to given data

Description

Given a numerical data vector, this function fits multiple distributions with the maximum likelihood method and returns an object containing the best fitted parameters and information criteria. Refer to the “Examples” section or the result class globalfit on how to sort and output the results with e.g. summary.

Usage

globalfit(data, continuity = NULL, method = "MLE",
  verbose = TRUE, packages = "stats",
  append_packages = FALSE, cores = NULL,
  max_dim_discrete = Inf, sanity = 1,
  timeout = 5
  )

Arguments

data

numeric vector of data points.

continuity

logical; if TRUE, the data is fitted with continuous distributions. If no input is given, the data will be tested for continuity.

method

character; method for parameter estimation. So far only Maximum-Likelihood is implemented, thus this argument must be "MLE".

verbose

logical; if TRUE, show progress and packages from where to fit distributions.

packages

either a character vector with names of packages; or a list such as those returned by getFamilies or NULL, i.e. all families known by this package (recommended).

default: “stats”.

append_packages

logical; if TRUE (default) appends packages specified in the argument packages to the standard search list, if FALSE globalfit will use only those packages and ignore the standard search list.

max_dim_discrete

non-negative integer; distributions with more non-continuous parameters than max_dim_discrete will not be considered. Manual setting is recommended if calculation speed has to be cut down.

cores

integer; number of CPU cores to be used in the calculations of best fitted parameters and information criteria.

sanity

either a positive numeric or logical; if it is a positive numeric, it controls a sanity check where obviously bad fits are filtered out. The smaller the number, the stricter the check will be executed and the more potential distributions will be rejected.

If sanity = FALSE a sanity check is not carried out. (DistributionFitR generally depends on other packages to supply reasonable distribution functions.)

Default is 1.

timeout

logical or numeric. if it is a positive numeric, it gives the seconds until timeout for the underlying optimiser optim.

If timeout = FALSE no timeout is performed.

Value

globalfit returns an object of class globalfit.

Details

If there is no continuity input given, this function first tests via multiple criteria whether the data is continuously or discretely distributed. Given that information, the related distributions from getFamilies() are fitted to the data via maximum likelihood method and information criteria are calculated. For discrete data not in the form of integers only, an appropriate linear transformation is applied to ensure stable optimization.

Since DistributionFitR technically allows for comparing over all distributions in any R-package, computation speed is likely to be an issue. The following may help:

  • using argument packages with append_packages = FALSE to restrict the search to certain packages

  • discarding distributions with too many discrete parameters using argument max_dim_discrete

  • specifying timeout, which affects the maximum time spent on each distribution (not overall!). The value in timeout will not be translated directly to the actual maximum time due to differing number of times optim is run under different algorithms.

Examples

Run this code
# NOT RUN {
  # Example 1
  data <- rnorm(n = 100, mean = 70, sd = 4)
  r <- globalfit(data, cores = if(interactive()) NULL else 2)
  summary(r)

  # continuous or discrete
  
# }
# NOT RUN {
  # Example 2
  # Alternatively, it is possible to input whether the data is
  globalfit(data, continuity = TRUE)

  # Example 3
  # fit over all distribution in the standard search list
  globalfit(data, packages = NULL)
  
# }

Run the code above in your browser using DataLab