fitdistcens: Fitting of univariate distributions to censored data

Description

Fits a univariate distribution to censored data by maximum likelihood.

Usage

fitdistcens(censdata, distr, start=NULL, fix.arg=NULL, 
  keepdata = TRUE, keepdata.nb=100, …)
# S3 method for fitdistcens
print(x, …)
# S3 method for fitdistcens
plot(x, …)
# S3 method for fitdistcens
summary(object, …)
# S3 method for fitdistcens
logLik(object, …)
# S3 method for fitdistcens
vcov(object, …)
# S3 method for fitdistcens
coef(object, …)

Arguments

censdata

A dataframe of two columns respectively named left and right, describing each observed value as an interval. The left column contains either NA for left censored observations, the left bound of the interval for interval censored observations, or the observed value for non-censored observations. The right column contains either NA for right censored observations, the right bound of the interval for interval censored observations, or the observed value for non-censored observations.

distr

A character string "name" naming a distribution, for which the corresponding density function dname and the corresponding distribution function pname must be defined, or directly the density function.

start

A named list giving the initial values of parameters of the named distribution. This argument may be omitted for some distributions for which reasonable starting values are computed (see the 'details' section of mledist).

fix.arg

An optional named list giving the values of parameters of the named distribution that must be kept fixed rather than estimated by maximum likelihood.

an object of class "fitdistcens".

object

an object of class "fitdistcens".

keepdata

a logical. If TRUE, dataset is returned, otherwise only a sample subset is returned.

keepdata.nb

When keepdata=FALSE, the length of the subset returned.

…

further arguments to be passed to generic functions, to the function plotdistcens in order to control the type of ecdf-plot used for censored data, or to the function mledist in order to control the optimization method.

Value

fitdistcens returns an object of class "fitdistcens", a list with the following components:

estimate

the parameter estimates.

method

the character string coding for the fitting method : only "mle" for 'maximum likelihood estimation'.

the estimated standard errors.

cor

the estimated correlation matrix, NA if numerically not computable or NULL if not available.

vcov

the estimated variance-covariance matrix, NULL if not available.

loglik

the log-likelihood.

aic

the Akaike information criterion.

bic

the the so-called BIC or SBC (Schwarz Bayesian criterion).

censdata

the censored data set.

distname

the name of the distribution.

fix.arg

the named list giving the values of parameters of the named distribution that must be kept fixed rather than estimated by maximum likelihood or NULL if there are no such parameters.

fix.arg.fun

the function used to set the value of fix.arg or NULL.

dots

the list of further arguments passed in … to be used in bootdistcens to control the optimization method used in iterative calls to mledist or NULL if no such arguments.

convergence

an integer code for the convergence of optim/constrOptim defined as below or defined by the user in the user-supplied optimization function. 0 indicates successful convergence. 1 indicates that the iteration limit of optim has been reached. 10 indicates degeneracy of the Nealder-Mead simplex. 100 indicates that optim encountered an internal error.

discrete

always FALSE.

weights

the vector of weigths used in the estimation process or NULL.

Generic functions:

print

The print of a "fitdist" object shows few traces about the fitting method and the fitted distribution.

summary

The summary provides the parameter estimates of the fitted distribution, the log-likelihood, AIC and BIC statistics, the standard errors of the parameter estimates and the correlation matrix between parameter estimates.

plot

The plot of an object of class "fitdistcens" returned by fitdistcens uses the function plotdistcens.

logLik

Extracts the estimated log-likelihood from the "fitdistcens" object.

vcov

Extracts the estimated var-covariance matrix from the "fitdistcens" object (only available When method = "mle").

coef

Extracts the fitted coefficients from the "fitdistcens" object.

Details

Maximum likelihood estimations of the distribution parameters are computed using the function mledist. By default direct optimization of the log-likelihood is performed using optim, with the "Nelder-Mead" method for distributions characterized by more than one parameter and the "BFGS" method for distributions characterized by only one parameter. The algorithm used in optim can be chosen or another optimization function can be specified using … argument (see mledist for details). start may be omitted (i.e. NULL) for some classic distributions (see the 'details' section of mledist). Note that when errors are raised by optim, it's a good idea to start by adding traces during the optimization process by adding control=list(trace=1, REPORT=1) in … argument.

The function is not able to fit a uniform distribution. With the parameter estimates, the function returns the log-likelihood and the standard errors of the estimates calculated from the Hessian at the solution found by optim or by the user-supplied function passed to mledist.

By default (keepdata = TRUE), the object returned by fitdist contains the data vector given in input. When dealing with large datasets, we can remove the original dataset from the output by setting keepdata = FALSE. In such a case, only keepdata.nb points (at most) are kept by random subsampling keepdata.nb-4 points from the dataset and adding the component-wise minimum and maximum. If combined with bootdistcens, be aware that bootstrap is performed on the subset randomly selected in fitdistcens. Currently, the graphical comparisons of multiple fits is not available in this framework.

Weighted version of the estimation process is available for method = "mle" by using weights=…. See the corresponding man page for details. It is not yet possible to take into account weighths in functions plotdistcens, plot.fitdistcens and cdfcompcens (developments planned in the future).

References

Venables WN and Ripley BD (2002), Modern applied statistics with S. Springer, New York, pp. 435-446.

Delignette-Muller ML and Dutang C (2015), fitdistrplus: An R Package for Fitting Distributions. Journal of Statistical Software, 64(4), 1-34.

Examples

Run this code

# NOT RUN {
# (1) Fit of a lognormal distribution to bacterial contamination data
#
data(smokedfish)
fitsf  <-  fitdistcens(smokedfish,"lnorm")
summary(fitsf)
# default plot using the Wang technique (see ?plotdiscens for details)
plot(fitsf)
# plot using the Turnbull algorithm (see ?plotdiscens for details)
# with confidence intervals for the empirical distribution
plot(fitsf, NPMLE = TRUE, NPMLE.method = "Turnbull", Turnbull.confint = TRUE)
# basic plot using intervals and points (see ?plotdiscens for details)
plot(fitsf, NPMLE = FALSE)
# plot of the same fit using the Turnbull algorithm in logscale
cdfcompcens(fitsf,main="bacterial contamination fits",
    xlab="bacterial concentration (CFU/g)",ylab="F",
    addlegend = FALSE,lines01 = TRUE, xlogscale = TRUE, xlim = c(1e-2,1e2))
# zoom on large values of F
cdfcompcens(fitsf,main="bacterial contamination fits",
    xlab="bacterial concentration (CFU/g)",ylab="F",
    addlegend = FALSE,lines01 = TRUE, xlogscale = TRUE, 
    xlim = c(1e-2,1e2),ylim=c(0.4,1))

# (2) Fit of a normal distribution on acute toxicity values 
# of fluazinam (in decimal logarithm) for
# macroinvertebrates and zooplancton, using maximum likelihood estimation
# to estimate what is called a species sensitivity distribution 
# (SSD) in ecotoxicology
#

data(fluazinam)
log10EC50 <-log10(fluazinam)
fln <- fitdistcens(log10EC50,"norm")
fln
summary(fln)
plot(fln)

# (3) defining your own distribution functions, here for the Gumbel distribution
# for other distributions, see the CRAN task view dedicated to 
# probability distributions
#

dgumbel  <-  function(x,a,b) 1/b*exp((a-x)/b)*exp(-exp((a-x)/b))
pgumbel  <-  function(q,a,b) exp(-exp((a-q)/b))
qgumbel  <-  function(p,a,b) a-b*log(-log(p))
fg <- fitdistcens(log10EC50,"gumbel",start=list(a=1,b=1))
summary(fg)
plot(fg)

# (4) comparison of fits of various distributions
# 

fll <- fitdistcens(log10EC50,"logis")
summary(fll)

cdfcompcens(list(fln,fll,fg),legendtext=c("normal","logistic","gumbel"),
xlab = "log10(EC50)")

# (5) how to change the optimisation method?
#

fitdistcens(log10EC50,"logis",optim.method="Nelder-Mead")
fitdistcens(log10EC50,"logis",optim.method="BFGS") 
fitdistcens(log10EC50,"logis",optim.method="SANN") 

# (6) custom optimisation function - example with the genetic algorithm
#
# }
# NOT RUN {
    #wrap genoud function rgenoud package
    mygenoud  <-  function(fn, par, ...) 
    {
        require(rgenoud)
        res  <-  genoud(fn, starting.values=par, ...)        
        standardres  <-  c(res, convergence=0)
            
        return(standardres)
    }

    # call fitdistcens with a 'custom' optimization function
    fit.with.genoud <- fitdistcens(log10EC50,"logis", custom.optim=mygenoud, nvars=2,    
        Domains=cbind(c(0,0), c(5, 5)), boundary.enforcement=1, 
        print.level=1, hessian=TRUE)

    summary(fit.with.genoud)
# }
# NOT RUN {
# (7) estimation of the mean of a normal distribution 
# by maximum likelihood with the standard deviation fixed at 1 using the argument fix.arg
#
flnb <- fitdistcens(log10EC50, "norm", start = list(mean = 1),fix.arg = list(sd = 1))

# (8) Fit of a lognormal distribution on acute toxicity values of fluazinam for
# macroinvertebrates and zooplancton, using maximum likelihood estimation
# to estimate what is called a species sensitivity distribution 
# (SSD) in ecotoxicology, followed by estimation of the 5 percent quantile value of 
# the fitted distribution (which is called the 5 percent hazardous concentration, HC5,
# in ecotoxicology) and estimation of other quantiles.

data(fluazinam)
log10EC50 <-log10(fluazinam)
fln <- fitdistcens(log10EC50,"norm")

quantile(fln, probs = 0.05)
quantile(fln, probs = c(0.05, 0.1, 0.2))

# (9) Fit of a lognormal distribution on 72-hour acute salinity tolerance (LC50 values)
# of riverine macro-invertebrates using maximum likelihood estimation

data(salinity)
log10LC50 <-log10(salinity)
fln <- fitdistcens(log10LC50,"norm")
plot(fln)

# }

Run the code above in your browser using DataLab