GeoHiSSE: Hidden Geographic State Speciation and Extinction

Description

Sets up and executes a GeoHiSSE model (Hidden Geographic State Speciation and Extinction) on a very large phylogeny and character distribution.

Usage

GeoHiSSE(phy, data, f=c(1,1,1), turnover=c(1,2,3), eps=c(1,2), 
hidden.states=FALSE, trans.rate=NULL, assume.cladogenetic=TRUE, 
condition.on.survival=TRUE, root.type="madfitz", root.p=NULL, sann=TRUE,
sann.its=1000, bounded.search=TRUE,  max.tol=.Machine$double.eps^.50,
mag.san.start=0.5, starting.vals=NULL, turnover.upper=1000, 
eps.upper=3, trans.upper=100, restart.obj=NULL, ode.eps=0, dt.threads=1)

Value

GeoHiSSE returns an object of class geohisse.fit. This is a list with elements:

$loglik: the maximum negative log-likelihood.
$AIC: Akaike information criterion.
$AICc: Akaike information criterion corrected for sample-size.
$solution: a matrix containing the maximum likelihood estimates of the model parameters.
$index.par: an index matrix of the parameters being estimated.
$f: user-supplied sampling frequencies.
$hidden.states: a logical indicating whether hidden states were included in the model.
$assume.cladogenetic: a logical indicating whether cladogenetic events were allowed at nodes.
$condition.on.surivival: a logical indicating whether the likelihood was conditioned on the survival of two lineages and the speciation event subtending them.
$root.type: indicates the user-specified root prior assumption.
$root.p: indicates whether the user-specified fixed root probabilities.
$phy: user-supplied tree
$data: user-supplied dataset
$trans.matrix: the user-supplied transition matrix
$max.tol: relative optimization tolerance.
$starting.vals: The starting values for the optimization.
$upper.bounds: the vector of upper limits to the optimization search.
$lower.bounds: the vector of lower limits to the optimization search.
$ode.eps: The ode.eps value used for the estimation.

Arguments

phy: a phylogenetic tree, in ape “phylo” format and with internal nodes labeled denoting the ancestral selective regimes.
data: a matrix (or dataframe) with two columns containing species information. First column has the species names and second column has area codes. Values for the areas need to be 0, 1, or 2, where 0 is the widespread area '01', 1 is endemic area '00' and 2 is endemic area '11'. See 'Details'.
f: vector of length 3 with the estimated proportion of extant species in state 1 (area '00'), state 2 (area '1'), and state 0 (widespread area '01') that are included in the phylogeny. A value of c(0.25, 0.25, 0.5) means that 25 percent of species in areas '00' and '11' and 50 percent of species in area '01' are included in the phylogeny. By default all species are assumed to be sampled.
turnover: a numeric vector of length equal to 3+(number of hidden.states * 3). A GeoSSE model has 3 turnover parameters: tau00, tau11 and tau01. A GeoHiSSE model with one hidden area has 6 speciation parameters: tau00A, tau11A, tau01A, tau00B, tau11B, and tau01B, and so on. The length of the numeric vector needs to match the number of speciation parameters in the model.
eps: a numeric vector of length equal to 2+(number of hidden.states * 2). A GeoSSE model has 2 extinct fraction parameters: ef00 and ef11. A GeoHiSSE model with one hidden area has 4 extinct.frac parameters: ef00A, ef11A, ef00B, and ef11B, and so on. The length of the numeric vector needs to match the number of extinct.frac parameters in the model.
hidden.states: a logical indicating whether the model includes hidden.states. The default is FALSE.
trans.rate: provides the transition rate model. See function TransMatMakerGeoHiSSE.
assume.cladogenetic: assumes that cladogenetic events occur at nodes. The default is TRUE.
condition.on.survival: a logical indicating whether the likelihood should be conditioned on the survival of two lineages and the speciation event subtending them (Nee et al. 1994). The default is TRUE.
root.type: indicates whether root summarization follow the procedure described by FitzJohn et al. 2009, “madfitz” or Herrera-Alsina et al. 2018, “herr_als”.
root.p: a vector indicating fixed root state probabilities. The default is NULL. Order of the areas in the vector need to follow: root.p[1] = 1 (endemic area '0'); root.p[2] = 2 (endemic area '1'); root.p[3] = 0 (widespread area '01').
sann: a logical indicating whether a two-step optimization procedure is to be used. The first includes a simulate annealing approach, with the second involving a refinement using subplex. The default is TRUE.
sann.its: a numeric indicating the number of times the simulated annealing algorithm should call the objective function.
bounded.search: a logical indicating whether or not bounds should be enforced during optimization. The default is is TRUE.
max.tol: supplies the relative optimization tolerance to subplex.
mag.san.start: Sets the extinction fraction to estimate the starting values for the diversification parameters. The equation used is based on Magallon and Sanderson (2001), and follows the procedure used in the original GeoSSE implementation.
starting.vals: a numeric vector of length 3 with starting values for the model for all areas and hidden states. Position [1] sets turnover, [2] sets extinction fraction, and [3] dispersal rates.
turnover.upper: sets the upper bound for the speciation parameters.
eps.upper: sets the upper bound for the extirpation parameters.
trans.upper: sets the upper bound for the transition rate parameters.
restart.obj: an object that contains everything to restart an optimization.
ode.eps: sets the tolerance for the integration at the end of a branch. Essentially if the sum of compD is less than this tolerance, then it assumes the results are unstable and discards them. The default is set to zero, but in testing a value of 1e-8 can sometimes produce stable solutions for both easy and very difficult optimization problems.
dt.threads: sets the number of threads available to data.table. In practice this need not change from the default of 1 thread, as we have not seen any speedup from allowing more threads.

Author

Jeremy M. Beaulieu

Details

This function sets up and executes a more complex and faster version of the GeoHiSSE model (for the original function see GeoHisse.old). One of the main differences here is that the model allows up to 10 hidden categories, and implements a more efficient means of carrying out the branch calculation. Specifically, we break up the tree into carry out all descendent branch calculations simultaneously, combine the probabilities based on their shared ancestry, then repeat for the next set of descendent . In testing, we've found that as the number of taxa increases, the calculation becomes much more efficient. In future versions, we will likely allow for multicore processing of these calculations to further improve speed. Also, note this function has replaced the version of GeoSSE that is currently available (see GeoHisse.old).

The other main difference is that, like HiSSE, we employ a modified optimization procedure. In other words, rather than optimizing birth and death separately, GeoHisse optimizes orthogonal transformations of these variables: we let tau = birth+death define "net turnover", and we let eps = death/birth define the “extinction fraction”. This reparameterization alleviates problems associated with overfitting when birth and death are highly correlated, but both matter in explaining the diversity pattern.

To setup a model, users input vectors containing values to indicate how many free parameters are to be estimated for each of the variables in the model. This is done using the turnover and extinct.frac parameters. One needs to specify a value for each of the parameters of the model, when two parameters show the same value, then the parameters are set to be linked during the estimation of the model. For example, a GeoHiSSE model with 1 hidden area and all free parameters has turnover = 1:6. The same model with speciation rates constrained to be the same for all hidden states has turnover = c(1,2,3,1,2,3). This same format applies to extinct.frac.

Once the model is specified, the parameters can be estimated using the subplex routine (default), or use a two-step process (i.e., sann=TRUE) that first employs a stochastic simulated annealing procedure, which is later refined using the subplex routine.

The “trans.rate” input is the transition model and has an entirely different setup than speciation and extirpation rates. See TransMatMakerGeoHiSSE function for more details.

For user-specified “root.p”, you should specify the probability for each area. If you are doing a hidden model, there will be six areas: 0A, 1A, 2A, 0B, 1B, 2B. So if you wanted to say the root had to be in area 0 (widespread distribution), you would specify “root.p = c(0.5, 0, 0, 0.5, 0, 0)”. In other words, the root has a 50% chance to be in one of the areas 0A or 0B.

For the “root.type” option, we are currently maintaining the previous default of “madfitz”. However, it was recently pointed out by Herrera-Alsina et al. (2018) that at the root, the individual likelihoods for each possible state should be conditioned prior to averaging the individual likelihoods across states. This can be set doing “herr_als”. It is unclear to us which is exactly correct, but it does seem that both “madfitz” and “herr_als” behave exactly as they should in the case of character-independent diversification (i.e., reduces to likelihood of tree + likelihood of trait model). We've also tested the behavior and the likelihood differences are very subtle and the parameter estimates in simulation are nearly indistinguishable from the “madfitz” conditioning scheme. We provide both options and encourage users to try both and let us know conditions in which the result vary dramatically under the two root implementations. We suspect they do not.

Also, note, that in the case of “root.type=user” and “root.type=equal” are no longer explicit “root.type” options. Instead, either “madfitz” or “herr_als” are specified and the “root.p” can be set to allow for custom root options.

References

Caetano, D.S., B.C. O'Meara, and J.M. Beaulieu. 2018. Hidden state models improve state-dependent diversification approaches, including biogeographic models. Evolution, 72:2308-2324.

Beaulieu, J.M, and B.C. O'Meara. 2016. Detecting hidden diversification shifts in models of trait-dependent speciation and extinction. Syst. Biol. 65:583-601.

FitzJohn R.G., W.P. Maddison, and S.P. Otto. 2009. Estimating trait-dependent speciation and extinction rates from incompletely resolved phylogenies. Syst. Biol. 58:595-611.

Maddison W.P., P.E. Midford, and S.P. Otto. 2007. Estimating a binary characters effect on speciation and extinction. Syst. Biol. 56:701-710.

Nee S., R.M. May, and P.H. Harvey. 1994. The reconstructed evolutionary process. Philos. Trans. R. Soc. Lond. B Biol. Sci. 344:305-311.