Learn R Programming

propagate (version 1.0-6)

fitDistr: Fitting distributions to observations/Monte Carlo simulations

Description

This function fits 32 different continuous distributions by (weighted) NLS to the histogram of Monte Carlo simulation results as obtained by propagate or any other vector containing large-scale observations. Finally, the fits are sorted by ascending BIC.

Usage

fitDistr(object, nbin = 100, weights = FALSE, verbose = TRUE, 
         brute = c("fast", "slow"), plot = c("hist", "qq"),  distsel = NULL, ...)

Arguments

object

an object of class 'propagate' or a vector containing observations.

nbin

the number of bins in the histogram.

weights

numeric or logical. Either a numeric vector of weights, or if TRUE, the distributions are fitted with weights = 1/(counts per bin).

verbose

logical. If TRUE, steps of the analysis are printed to the console.

brute

complexity of the brute force approach. See 'Details'.

plot

if "hist", a plot with the "best" distribution (in terms of lowest BIC) on top of the histogram is displayed. If "qq", a QQ-Plot will display the difference between the observed and fitted quantiles.

distsel

a vector of distribution numbers to select from the complete cohort as listed below, e.g. c(1:10, 15).

...

other parameters to be passed to the plots.

Value

A list with the following items:

stat: the by BIC value ascendingly sorted distribution names, including RSS and MSE. fit: a list of the results from nls.lm for each distribution model, also sorted ascendingly by BIC values. par: a list of the estimated parameters of the models in fit. se: a list of the parameters' standard errors, calculated from the square root of the covariance matrices diagonals. dens: a list with all density function used for fitting, sorted as in fit. bestfit: the best model in terms of lowest BIC. bestpar: the parameters of bestfit. bestse: the parameters' standard errors of bestfit. fitted: the fitted values of bestfit. residuals: the residuals of bestfit.

Details

Fits the following 32 distributions using (weighted) residual sum-of-squares as the minimization criterion for nls.lm: 1) Normal distribution (dnorm) => https://en.wikipedia.org/wiki/Normal_distribution 2) Skewed-normal distribution (propagate:::dsn) => https://en.wikipedia.org/wiki/Skew_normal_distribution 3) Generalized normal distribution (propagate:::dgnorm) => https://en.wikipedia.org/wiki/Generalized_normal_distribution 4) Log-normal distribution (dlnorm) => https://en.wikipedia.org/wiki/Log-normal_distribution 5) Scaled and shifted t-distribution (propagate:::dst) => GUM 2008, Chapter 6.4.9.2. 6) Logistic distribution (dlogis) => https://en.wikipedia.org/wiki/Logistic_distribution 7) Uniform distribution (dunif) => https://en.wikipedia.org/wiki/Uniform_distribution_(continuous) 8) Triangular distribution (propagate:::dtriang) => https://en.wikipedia.org/wiki/Triangular_distribution 9) Trapezoidal distribution (propagate:::dtrap) => https://en.wikipedia.org/wiki/Trapezoidal_distribution 10) Curvilinear Trapezoidal distribution (propagate:::dctrap) => GUM 2008, Chapter 6.4.3.1 11) Gamma distribution (dgamma) => https://en.wikipedia.org/wiki/Gamma_distribution 12) Inverse Gamma distribution (propagate:::dinvgamma) => https://en.wikipedia.org/wiki/Inverse-gamma_distribution 13) Cauchy distribution (dcauchy) => https://en.wikipedia.org/wiki/Cauchy_distribution 14) Laplace distribution (propagate:::dlaplace) => https://en.wikipedia.org/wiki/Laplace_distribution 15) Gumbel distribution (propagate:::dgumbel) => https://en.wikipedia.org/wiki/Gumbel_distribution 16) Johnson SU distribution (propagate:::dJSU) => https://en.wikipedia.org/wiki/Johnson_SU_distribution 17) Johnson SB distribution (propagate:::dJSB) => https://www.mathwave.com/articles/johnson_sb_distribution.html 18) Three-parameter Weibull distribution (propagate:::dweibull2) => https://en.wikipedia.org/wiki/Weibull_distribution 19) Two-parameter beta distribution (dbeta2) => https://en.wikipedia.org/wiki/Beta_distribution#Two_parameters_2 20) Four-parameter beta distribution (propagate:::dbeta2) => https://en.wikipedia.org/wiki/Beta_distribution#Four_parameters_2 21) Arcsine distribution (propagate:::darcsin) => https://en.wikipedia.org/wiki/Arcsine_distribution 22) Von Mises distribution (propagate:::dmises) => https://en.wikipedia.org/wiki/Von_Mises_distribution 23) Inverse Gaussian distribution (propagate:::dinvgauss) => https://en.wikipedia.org/wiki/Inverse_Gaussian_distribution 24) Generalized Extreme Value distribution (propagate:::dgevd) => https://en.wikipedia.org/wiki/Generalized_extreme_value_distribution 25) Rayleigh distribution (propagate:::drayleigh) => https://en.wikipedia.org/wiki/Rayleigh_distribution 26) Chi-square distribution (dchisq) => https://en.wikipedia.org/wiki/Chi-squared_distribution 27) Exponential distribution (dexp) => https://en.wikipedia.org/wiki/Exponential_distribution 28) F-distribution (df) => https://en.wikipedia.org/wiki/F-distribution 29) Burr distribution (dburr) => https://en.wikipedia.org/wiki/Burr_distribution 30) Chi distribution (dchi) => https://en.wikipedia.org/wiki/Chi_distribution 31) Inverse Chi-square distribution (dinvchisq) => https://en.wikipedia.org/wiki/Inverse-chi-squared_distribution 32) Cosine distribution (dcosine) => https://en.wikipedia.org/wiki/Raised_cosine_distribution

All distributions are fitted with a brute force approach, in which the parameter space is extended over three orders of magnitude \((0.1, 1, 10)\times \beta_i\) when brute = "fast", or five orders \((0.01, 0.1, 1, 10, 100)\times \beta_i\) when brute = "slow". Approx. 20-90s are needed to fit for the fast version, depending mainly on the number of bins.

The goodness-of-fit (GOF) is calculated with BIC from the (weighted) log-likelihood of the fit: $$\rm{ln}(L) = 0.5 \cdot \left(-N \cdot \left(\rm{ln}(2\pi) + 1 + \rm{ln}(N) - \sum_{i=1}^n log(w_i) + \rm{ln}\left(\sum_{i=1}^n w_i \cdot x_i^2\right) \right) \right)$$ $$\rm{BIC} = - 2\rm{ln}(L) + (N - k)ln(N)$$ with \(x_i\) = the residuals from the NLS fit, \(N\) = the length of the residual vector, \(k\) = the number of parameters of the fitted model and \(w_i\) = the weights.

In contrast to some other distribution fitting softwares (i.e. Easyfit, Mathwave) that use residual sum-of-squares/Anderson-Darling/Kolmogorov-Smirnov statistics as GOF measures, the application of BIC accounts for increasing number of parameters in the distribution fit and therefore compensates for overfitting. Hence, this approach is more similar to ModelRisk (Vose Software) and as employed in fitdistr of the 'MASS' package. Another application is to identify a possible distribution for the raw data prior to using Monte Carlo simulations from this distribution. However, a decent number of observations should be at hand in order to obtain a realistic estimate of the proper distribution. See 'Examples'. The code for the density functions can be found in file "distr-densities.R".

IMPORTANT: It can be feasible to set weights = TRUE in order to give more weight to bins with low counts. See 'Examples'. ALSO: Distribution fitting is highly sensitive to the number of defined histogram bins, so it is advisable to change this parameter and inspect if the order of fitted distributions remains stable.

References

Continuous univariate distributions, Volume 1. Johnson NL, Kotz S and Balakrishnan N. Wiley Series in Probability and Statistics, 2.ed (2004).

A guide on probability distributions. R-forge distributions core team. http://dutangc.free.fr/pub/prob/probdistr-main.pdf.

Univariate distribution relationships. Leemis LM and McQueston JT. The American Statistician (2008), 62: 45-53.

Examples

Run this code
# NOT RUN {
## Linear example, small error
## => Normal distribution.
EXPR1 <- expression(x + 2 * y)
x <- c(5, 0.01)
y <- c(1, 0.01)
DF1 <- cbind(x, y)
RES1 <- propagate(expr = EXPR1, data = DF1, type = "stat", 
                  do.sim = TRUE, verbose = TRUE)
fitDistr(RES1)

## Ratio example, larger error
## => Gamma distribution.
EXPR2 <- expression(x/2 * y)
x <- c(5, 0.1)
y <- c(1, 0.02)
DF2 <- cbind(x, y)
RES2 <- propagate(expr = EXPR2, data = DF2, type = "stat", 
                  do.sim = TRUE, verbose = TRUE)
fitDistr(RES2)

## Exponential example, large error
## => Log-Normal distribution.
EXPR3 <- expression(x^(2 * y))
x <- c(5, 0.1)
y <- c(1, 0.1)
DF3 <- cbind(x, y)
RES3 <- propagate(expr = EXPR3, data = DF3, type = "stat", 
                  do.sim = TRUE, verbose = TRUE)
fitDistr(RES3)

## Rectangular input distributions result
## in Curvilinear Trapezoidal output distribution.
A <- runif(100000, 20, 25)
B <- runif(100000, 3, 3.5)
DF4 <- cbind(A, B)
EXPR4 <- expression(A + B)
RES4 <- propagate(EXPR4, data = DF4, type = "sim", 
                 use.cov = FALSE, do.sim = TRUE)
fitDistr(RES4)        

## Fitting with 1/counts as weights.
EXPR5 <- expression(x + 2 * y)
x <- c(5, 0.05)
y <- c(1, 0.05)
DF5 <- cbind(x, y)
RES5 <- propagate(expr = EXPR5, data = DF5, type = "stat", 
                  do.sim = TRUE, verbose = TRUE, weights = TRUE)
fitDistr(RES5)

## Using only selected distributions.
EXPR6 <- expression(x + sin(y))
x <- c(5, 0.1)
y <- c(1, 0.2)
DF6 <- cbind(x, y)
RES6 <- propagate(expr = EXPR6, data = DF6, type = "stat", 
                  do.sim = TRUE)
fitDistr(RES6, distsel = c(1:10, 15, 28))
# }

Run the code above in your browser using DataLab