logcon: Compute Log-Concave MLE Based on Censored or Exact Data

Description

Based on independent intervals $X_i = [L_i,R_i]$, where $-\infty < L_i \leq R_i \leq \infty$, compute the maximum likelihood estimator of a (sub)probability density $\phi$ and the remaining mass $p_0$ at infinity (also known as cure parameter) under the assumption that the former is log-concave. Computation is based on an EM algorithm. For further information see Duembgen, Rufibach, and Schuhmacher (2013, preprint).

Usage

logcon(x, adapt.p0=FALSE, p0=0, knot.prec=IQR(x[xlogConCens(x, adapt.p0=FALSE, p0=0, knot.prec=IQR(x[x
logconcure(x, p0=0, knot.prec=IQR(x[x

Value

An object of class lcdensity for which reasonable plot, print, and summary methods are available.

If the argument x is a two-column matrix (censored data case), such an object has the following components.

basedOn: the string "censored" for the type of data the solution is based on.
status: currently only 0 if the algorithm converged; and 1 otherwise. Note that in most cases even with status 1 the returned solution is very close to the truth. The 1 is often due to the fact that the termination criterion is not so well balanced yet.
x: the data entered.
tau: the ordered vector of different interval endpoints.
domind1, domind2: the indices of the tau-element at which the domain of the MLE $\phi$ starts/ends.
tplus: the grid vector. tau[domind1:domind2] augmented by points of subdivision.
isKnot: 0-1 value. For the finite elements of tplus a 1 if $\phi$ has a knot at this position, 0 otherwise.
phi: the vector of $\phi$-values at the finite elements of tplus.
phislr: if $\sup({\rm dom}(\phi)) = \infty$, the slope of $\phi$ after the last knot. Otherwise $-\infty$.
phislr.range: a vector of length 2 specifying a range of possible values for phislr. This is for the (rather rare) situations that mass may be shifted between the interval from the rightmost tau-point to infinity and the cure parameter without changing the likelihood. Otherwise phislr.range is NA.
cure: the cure parameter. Either the original argument p0 if adapt.p0 was FALSE, otheriwse the estimated cure parameter obtained by the alternating maximization procedure.
cure.range: a vector of length 2 specifying a range of possible values for cure or NA. See phislr.range.
Fhat: the vector of values of the distribution function $F$ of $\exp \circ \, \phi$ at the finite elements of tplus.
Fhatfin: the computed value of $\lim_{t \to \infty} F(t)$.

Arguments

x: a two-column matrix of $n \geq 2$ rows containing the data intervals, or a vector of length $n \geq 2$ containing the exact data points.
adapt.p0: logical. Should the algorithm be allowed to adapt $p_0$? In this case an alternating maximization procedure is used that is believed to always yield a joint maximizer $(\hat{\phi},\hat{p_0})$. For the much slower (but maybe safer) profile likelihood maximization method, see the function cure.profile.
p0: a number from 0 to 1 specifying the mass at infinity. If the algorithm is allowed to adapt $p_0$, this argument only specifies the starting value. Otherwise it is assumed that the true cure parameter $p_0$ is equal to this number. In particular, for the default setting of 0, a proper probability density $\phi$ is estimated.
knot.prec: the maximal distance between two consecutive grid points, where knots (points at which the resulting log-subdensity $\phi$ may change slope) can be positioned. See details.
reduce: logical. Should the domain of the (sub)density be reduced whenever the mass at the left or the right boundary becomes too small?
control: a list of control parameters for the more technical aspects of the algorithm; usually the result of a call to lc.control.

Author

Dominic Schuhmacher dominic.schuhmacher@mathematik.uni-goettingen.de
Kaspar Rufibach kaspar.rufibach@gmail.com
Lutz Duembgen duembgen@stat.unibe.ch

Details

Based on the data intervals $X_i = [L_i,R_i]$ described above, function logcon computes a concave, piecewise linear function $\phi$ and a probability $p_0$ which satisfy $\int \exp \phi(x) \, dx = 1-p_0$ and jointly maximize the (normalized) log-likelihood.

$$\ell(\phi, p_0) = \frac{1}{n} \sum_{i=1}^n \biggl[ 1\{L_i = R_i\} \phi(X_i) + 1\{L_i < R_i\} \log \biggl( \int_{L_i}^{R_i} \exp \phi(x) \; dx + 1\{R_i = \infty\} p_0 \biggr) \ \biggr],$$

If x is a two-column matrix, it is assumed to contain the left and right interval endpoints in the correct order. Intervals may have length zero (both endpoints equal) or be unbounded to the right (right endpoint is Inf). Computation is based on an EM algorithm, where the M-step uses an active set algorithm for computing the log-concave MLE for exact data with weights. The active set algorithm was described in Duembgen, Huesler, and Rufibach (2007) and Duembgen and Rufibach (2011) and is available in the R package logcondens. It has been re-implemented in C for the current package because of speed requirements. The whole algorithm for censored data has been indicated in Duembgen, Huesler, and Rufibach (2007) and was elaborated in Duembgen, Schuhmacher, and Rufibach (2013, preprint).

If x is a vector argument, it is assumed to contain the exact data points. In this case the active set algorithm is accessed directly.

In order to obtain a finite dimensional optimization problem the (supposed) domain of $\phi$ is subdivided by a grid. Stretches between interval endpoints where for theoretical reasons no knots (points where the slope of $\phi$ changes) can lie are left out. The argument kink.prec gives the maximal distance we allow between consecutive grid points in stretches where knots can lie. Say plotint(x) to see the grid.

The EM algorithm works only for fixed dimensionality of the problem, but the domain of the function $\phi$ is not a priori known. Therefore there is an outer loop starting with the largest possible domain, given by the minimal and maximal endpoints of all the intervals, and decreasing the domain as soon as the EM steps let $\phi$ become very small towards the boundary. “Very small” means that the integral of $\exp \circ \, \phi$ over the first or last stretch between interval endpoints within the current domain falls below a certain threshold red.thresh, which can be controlled via lc.control.

Domain reduction tends to be rather conservative. If the computed solution has a suspiciously steep slope at any of the domain boundaries, the recommended strategy is to enforce a smaller domain by increasing the parameters domind1l and/or domind2r via lc.control. The function loglike may be used to compare the (normalized) log-likelihoods of the results.

logConCens is an alias for logcon. It is introduced to provide unified naming with the main functions in the packages logcondens and logcondiscr.

logconcure is the same as logcon with adapt.p0 = TRUE fixed.

References

Duembgen, L., Huesler, A., and Rufibach, K. (2007). Active set and EM algorithms for log-concave densities based on complete and censored data. Technical Report 61. IMSV, University of Bern. https://arxiv.org/abs/0707.4643

Duembgen, L. and Rufibach, K., (2011). logcondens: Computations Related to Univariate Log-Concave Density Estimation. Journal of Statistical Software, 39(6), 1-28. tools:::Rd_expr_doi("10.18637/jss.v039.i06")

Duembgen, L., Rufibach, K., and Schuhmacher, D. (2014). Maximum-likelihood estimation of a log-concave density based on censored data. Electronic Journal of Statistics, 8(1), 1405-1437. tools:::Rd_expr_doi("10.1214/14-EJS930")