fitdist: Fit of univariate distributions to non-censored data

Description

Fit of univariate distributions to non-censored data by maximum likelihood, quantile matching or moment matching.

Usage

fitdist(data, distr, method = c("mle", "mme", "qme", "mge"), 
    start=NULL, fix.arg=NULL,  ...) 
## S3 method for class 'fitdist':
print(x,...)
## S3 method for class 'fitdist':
plot(x,breaks="default",...)
## S3 method for class 'fitdist':
summary(object,...)

Arguments

data

A numeric vector.

distr

A character string "name" naming a distribution for which the corresponding density function dname, the corresponding distribution function pname and the corresponding quantile function qname

method

A character string coding for the fitting method: "mle" for 'maximum likelihood estimation', "mme" for 'moment matching estimation', "qme" for 'quantile matching estimation' and "mge" for 'max

start

An named list giving the initial values of parameters of the named distribution. This argument may be omitted for some distributions for which reasonable starting values are computed (see details), and will not be taken into account if a

fix.arg

An optional named list giving the values of parameters of the named distribution that must kept fixed rather than estimated. The use of this argument is not possible if method="mme" and a closed formula is used.

an object of class 'fitdist'.

object

an object of class 'fitdist'.

breaks

If "default" the histogram is plotted with the function hist with its default breaks definition. Else breaks is passed to the function hist. This argument is not taken into account with discre

...

further arguments to be passed to generic functions, or to one of the functions "mledist", "mmedist", "qmedist" or "mgedist" depending of the chosen method (see the help pages of these fu

Value

fitdist returns an object of class 'fitdist', a list with following components,
estimatethe parameter estimates
methodthe character string coding for the fitting method : "mle" for 'maximum likelihood estimation', "mme" for 'matching moment estimation' and "qme" for 'matching quantile estimation'
sdthe estimated standard errors or NULL if not available
corthe estimated correlation matrix or NULL if not available
loglikthe log-likelihood
aicthe Akaike information criterion
bicthe the so-called BIC or SBC (Schwarz Bayesian criterion)
nthe length of the data set
datathe dataset
distnamethe name of the distribution
fix.argthe named list giving the values of parameters of the named distribution that must kept fixed rather than estimated by maximum likelihood or NULL if there are no such parameters.
dotsthe list of further arguments passed in ...to be used in bootdist in iterative calls to mledist, mmedist, qmedist, mgedist or NULL if no such arguments

Details

When method="mle", maximum likelihood estimations of the distribution parameters are computed using the function mledist. When method="mme", the estimated values of the distribution parameters are computed by a closed formula for the following distributions : "norm", "lnorm", "pois", "exp", "gamma", "nbinom", "geom", "beta", "unif" and "logis". For distributions characterized by one parameter ("geom", "pois" and "exp"), this parameter is simply estimated by matching theoretical and observed means, and for distributions characterized by two parameters, these parameters are estimated by matching theoretical and observed means and variances (Vose, 2000). For other distributions, the theoretical and the empirical moments are matched numerically, by minimization of the sum of squared differences between observed and theoretical moments. In this last case, further arguments are needed in the call to fitdist: order and memp (see mmedist for details). When method = "qme", the function carries out the quantile matching numerically, by minimization of the sum of squared differences between observed and theoretical quantiles. The use of this method requires an additional argument probs, defined as the numeric vector of the probabilities for which the quantile matching is done, of length equal to the number of parameters to estimate (see qmedist for details). When method = "mge", the distribution parameters are estimated by maximization of goodness-of-fit (or minimization of a goodness-of-fit distance). The use of this method requires an additional argument gof coding for the goodness-of-fit distance chosen. One may use the classical Cramer-von Mises distance ("CvM"), the classical Kolmogorov-Smirnov distance ("KS"), the classical Anderson-Darling distance ("AD") which gives more weight to the tails of the distribution, or one of the variants of this last distance proposed by Luceno (2006) (see mgedist for more details). This method is not suitable for discrete distributions. By default direct optimization of the log-likelihood (or other criteria depending of the chosen method) is performed using optim, with the "Nelder-Mead" method for distributions characterized by more than one parameter and the "BFGS" method for distributions characterized by only one parameter. The method used in optim may be chosen or another optimization method may be chosen using ... argument (see mledist for details). For the following named distributions, reasonable starting values will be computed if start is omitted : "norm", "lnorm", "exp" and "pois", "cauchy", "gamma", "logis", "nbinom" (parametrized by mu and size), "geom", "beta" and "weibull". Note that these starting values may not be good enough if the fit is poor. The function is not able to fit a uniform distribution. With the parameter estimates, the function returns the log-likelihood whatever the estimation method and for maximum likelihood estimation the standard errors of the estimates calculated from the Hessian at the solution found by optim or by the user-supplied function passed to mledist. NB: if your data are particularly low or high, a scaling may be needed before the optimization process. See example (14). The plot of an object of class "fitdist" returned by fitdist uses the function plotdist. An object of class "fitdist" or a list of objects of class "fitdist" corresponding to various fits using the same data set may also be plotted as cumulative distributions using the function cdfcomp.

References

Cullen AC and Frey HC (1999) Probabilistic techniques in exposure assessment. Plenum Press, USA, pp. 81-155. Venables WN and Ripley BD (2002) Modern applied statistics with S. Springer, New York, pp. 435-446. Vose D (2000) Risk analysis, a quantitative guide. John Wiley & Sons Ltd, Chischester, England, pp. 99-143.

Examples

Run this code

# (1) basic fit of a gamma distribution by maximum likelihood estimation
#
data(groundbeef)
serving <- groundbeef$serving
fitg <- fitdist(serving,"gamma")
summary(fitg)
plot(fitg)
cdfcomp(fitg,addlegend=FALSE)


# (2) use the moment matching estimation (using a closed formula)
#

fitgmme <- fitdist(serving,"gamma",method="mme")
summary(fitgmme)

# (3) fit and comparison of various fits
#
fitW <- fitdist(serving,"weibull")
fitg <- fitdist(serving,"gamma")
fitln <- fitdist(serving,"lnorm")
summary(fitW)
summary(fitg)
summary(fitln)
cdfcomp(list(fitW,fitg,fitln),legendtext=c("Weibull","gamma","lognormal"))

# (4) defining your own distribution functions, here for the Gumbel distribution
# for other distributions, see the CRAN task view 
# dedicated to probability distributions
#
dgumbel <- function(x,a,b) 1/b*exp((a-x)/b)*exp(-exp((a-x)/b))
pgumbel <- function(q,a,b) exp(-exp((a-q)/b))
qgumbel <- function(p,a,b) a-b*log(-log(p))

fitgumbel <- fitdist(serving,"gumbel",start=list(a=10,b=10))
summary(fitgumbel)
plot(fitgumbel)

# (5) fit a discrete distribution (Poisson)
#

x2<-c(rep(4,1),rep(2,3),rep(1,7),rep(0,12))
f2<-fitdist(x2,"pois")
plot(f2)
summary(f2)
gofstat(f2)

# (6) how to change the optimisation method?
#

fitdist(serving,"gamma",optim.method="Nelder-Mead")
fitdist(serving,"gamma",optim.method="BFGS") 
fitdist(serving,"gamma",optim.method="SANN")

# (7) custom optimization function
#

#create the sample
mysample <- rexp(100, 5)
mystart <- 8

res1 <- fitdist(mysample, dexp, start= mystart, optim.method="Nelder-Mead")

#show the result
summary(res1)

#the warning tell us to use optimise, because the Nelder-Mead is not adequate.

#to meet the standard 'fn' argument and specific name arguments, we wrap optimize,
myoptimize <- function(fn, par, ...) 
{
    res <- optimize(f=fn, ..., maximum=FALSE)  
    #assume the optimization function minimize
    
    standardres <- c(res, convergence=0, value=res$objective, 
        par=res$minimum, hessian=NA)
    
    return(standardres)
}

#call fitdist with a 'custom' optimization function
res2 <- fitdist(mysample, dexp, start=mystart, custom.optim=myoptimize, 
    interval=c(0, 100))

#show the result
summary(res2)


# (8) custom optimization function - another example with the genetic algorithm
#
#set a sample
    x1 <- c(6.4, 13.3, 4.1, 1.3, 14.1, 10.6, 9.9, 9.6, 15.3, 22.1,
         13.4, 13.2, 8.4, 6.3, 8.9, 5.2, 10.9, 14.4) 
    fit1 <- fitdist(x1, "gamma")
    summary(fit1)

    #wrap genoud function rgenoud package
    mygenoud <- function(fn, par, ...) 
    {
        require(rgenoud)
        res <- genoud(fn, starting.values=par, ...)        
        standardres <- c(res, convergence=0)
            
        return(standardres)
    }

    #call fitdist with a 'custom' optimization function
    fit2 <- fitdist(x1, "gamma", custom.optim=mygenoud, nvars=2,    
        Domains=cbind(c(0,0), c(10, 10)), boundary.enforcement=1, 
        print.level=1, hessian=TRUE)

    summary(fit2)

# (9) estimation of the standard deviation of a normal distribution 
# by maximum likelihood with the mean fixed at 10 using the argument fix.arg
#
x1 <- c(6.4, 13.3, 4.1, 1.3, 14.1, 10.6, 9.9, 9.6, 15.3, 22.1,
         13.4, 13.2, 8.4, 6.3, 8.9, 5.2, 10.9, 14.4) 
fitdist(x1,"norm",start=list(sd=5),fix.arg=list(mean=10))

# (10) fit of a Weibull distribution to serving size data 
# by maximum likelihood estimation
# or by quantile matching estimation (in this example 
# matching first and third quartiles)
#
data(groundbeef)
serving <- groundbeef$serving

fWmle <- fitdist(serving,"weibull")
summary(fWmle)
plot(fWmle)
gofstat(fWmle)

fWqme <- fitdist(serving,"weibull",method="qme",probs=c(0.25,0.75))
summary(fWqme)
plot(fWqme)
gofstat(fWqme)


# (11) Fit of a Pareto distribution by numerical moment matching estimation
#
require(actuar)
    #simulate a sample
    x4 <- rpareto(1000, 6, 2)

    #empirical raw moment
    memp <- function(x, order)
        ifelse(order == 1, mean(x), sum(x^order)/length(x))


    #fit
    fP <- fitdist(x4, "pareto", method="mme",order=c(1, 2), memp="memp", 
    start=c(10, 10), lower=1, upper=Inf)
    summary(fP)

# (12) Fit of a Weibull distribution to serving size data by maximum 
# goodness-of-fit estimation using all the distances available
# 

data(groundbeef)
serving <- groundbeef$serving
(f1 <- fitdist(serving,"weibull",method="mge",gof="CvM"))
(f2 <- fitdist(serving,"weibull",method="mge",gof="KS"))
(f3 <- fitdist(serving,"weibull",method="mge",gof="AD"))
(f4 <- fitdist(serving,"weibull",method="mge",gof="ADR"))
(f5 <- fitdist(serving,"weibull",method="mge",gof="ADL"))
(f6 <- fitdist(serving,"weibull",method="mge",gof="AD2R"))
(f7 <- fitdist(serving,"weibull",method="mge",gof="AD2L"))
(f8 <- fitdist(serving,"weibull",method="mge",gof="AD2"))
cdfcomp(list(f1,f2,f3,f4,f5,f6,f7,f8))
cdfcomp(list(f1,f2,f3,f4,f5,f6,f7,f8),xlogscale=TRUE,xlim=c(8,250),verticals=TRUE)

# (13) Fit of a uniform distribution using Cramer-von Mises or
# Kolmogorov-Smirnov distance
# 

u <- runif(50,min=5,max=10)

fuCvM <- fitdist(u,"unif",method="mge",gof="CvM")
summary(fuCvM)
plot(fuCvM)
gofstat(fuCvM)

fuKS <- fitdist(u,"unif",method="mge",gof="KS")
summary(fuKS)
plot(fuKS)
gofstat(fuKS)

# (14) scaling problem
#

x <- c(-0.00707717, -0.000947418, -0.00189753, 
-0.000474947, -0.00190205, -0.000476077, 0.00237812, 0.000949668, 
0.000474496, 0.00284226, -0.000473149, -0.000473373, 0, 0, 0.00283688, 
-0.0037843, -0.0047506, -0.00238379, -0.00286807, 0.000478583, 
0.000478354, -0.00143575, 0.00143575, 0.00238835, 0.0042847, 
0.00237248, -0.00142281, -0.00142484, 0, 0.00142484, 0.000948767, 
0.00378609, -0.000472478, 0.000472478, -0.0014181, 0, -0.000946522, 
-0.00284495, 0, 0.00331832, 0.00283554, 0.00141476, -0.00141476, 
-0.00188947, 0.00141743, -0.00236351, 0.00236351, 0.00235794, 
0.00235239, -0.000940292, -0.0014121, -0.00283019, 0.000472255, 
0.000472032, 0.000471809, -0.0014161, 0.0014161, -0.000943842, 
0.000472032, -0.000944287, -0.00094518, -0.00189304, -0.000473821, 
-0.000474046, 0.00331361, -0.000472701, -0.000946074, 0.00141878, 
-0.000945627, -0.00189394, -0.00189753, -0.0057143, -0.00143369, 
-0.00383326, 0.00143919, 0.000479272, -0.00191847, -0.000480192, 
0.000960154, 0.000479731, 0, 0.000479501, 0.000958313, -0.00383878, 
-0.00240674, 0.000963391, 0.000962464, -0.00192586, 0.000481812, 
-0.00241138, -0.00144963)

for(i in 6:0)
    cat(i, try(fitdist(x*10^i, "cauchy", method="mle")$estimate, silent=TRUE), "")

Run the code above in your browser using DataLab