make.mark.model: Create a MARK model for analysis

Description

Creates a MARK model object that contains a MARK input file with PIMS and design matrix specific to the data and model structure and formulae specified for the model parameters.

Usage

make.mark.model(
  data,
  ddl,
  parameters = list(),
  title = "",
  model.name = NULL,
  initial = NULL,
  call = NULL,
  default.fixed = TRUE,
  options = NULL,
  profile.int = FALSE,
  chat = NULL,
  simplify = TRUE,
  input.links = NULL,
  parm.specific = FALSE,
  mlogit0 = TRUE,
  hessian = FALSE,
  accumulate = TRUE,
  icvalues = NULL,
  wrap = TRUE,
  nodes = 101,
  useddl = FALSE,
  check.model = FALSE
)

Value

model: a MARK object except for the elements output and results. See mark for a detailed description of the list contents.

Arguments

data: Data list resulting from function process.data
ddl: Design data list from function make.design.data
parameters: List of parameter formula specifications
title: Title for the analysis (optional)
model.name: Model name to override default name (optional)
initial: Vector of named or unnamed initial values for beta parameters or previously run model (optional)
call: Pass function call when this function is called from another function (e.g.mark) (internal use)
default.fixed: if TRUE, real parameters for which the design data have been deleted are fixed to default values
options: character string of options for Proc Estimate statement in MARK .inp file
profile.int: if TRUE, requests MARK to compute profile intervals
chat: pre-specified value for chat used by MARK in calculations of model output
simplify: if FALSE, does not simplify PIM structure
input.links: specifies set of link functions for parameters with non-simplified structure
parm.specific: if TRUE, forces a link to be specified for each parameter
mlogit0: if TRUE, any real parameter that is fixed to 0 and has an mlogit link will have its link changed to logit so it can be simplified
hessian: if TRUE specifies to MARK to use hessian rather than second partial matrix
accumulate: if TRUE accumulate like data values into frequencies
icvalues: numeric vector of individual covariate values for computation of real values
wrap: if TRUE, data lines are wrapped to be length 80; if length of a row is not a problem set to FALSE and it will run faster
nodes: number of integration nodes for individual random effects (min 15, max 505, default 101)
useddl: If TRUE and there are no missing rows or parameters (deleted) then it will use ddl in place of full.ddl that is created internally.
check.model: if TRUE, code does an internal consistency check between PIMs and design data when making model.

Author

Jeff Laake

Details

This function is called by mark to create the model but it can be called directly to create but not run the model. All of the arguments have default values except for the first 2 that specify the processed data list (data) and the design data list (ddl). If only these 2 arguments are specified default models are used for the parameters. For example, following with the example from process.data and make.design.data, the default model can be created with:

mymodel=make.mark.model(proc.example.data,ddl)

The call to make.mark.model creates a model object but does not do the analysis. The function returns a list that includes several fields including a design matrix and the MARK input file that will be used with MARK.EXE to run the analysis from function run.mark.model. The following shows the names of the list elements in mymodel:

 names(mymodel) [1] "data" "model" "title" "model.name"
"links" [6] "mixtures" "call" "parameters" "input" "number.of.groups" [11]
"group.labels" "nocc" "begin.time" "covariates" "fixed" [16] "design.matrix"
"pims" "design.data" "strata.labels" "mlogit.list" [21] "simplify"

The list is defined to be a mark object which is done by assigning a class vector to the list. The classes for an R object can be viewed with the class function as shown below:

 class(mymodel) [1] "mark" "CJS"

Each MARK model has 2 class values. The first identifies it as a mark object and the second identifies the type of mark analysis, which is the default "CJS" (recaptures only) in this case. The use of the class feature has advantages in using generic functions and identifying types of objects. An object of class mark is defined in more detail in function mark.

To fit non-trivial models it is necessary to understand the remaining calling arguments of make.mark.model and R formula notation. The model formulae are specified with the calling argument parameters. It uses a similar list structure as the parameters argument in make.design.data. It expects to get a list with elements named to match the parameters in the particular analysis (e.g., Phi and p in CJS) and each list element is a list, so it is a list of lists). For each parameter, the possible list elements are formula, link, fixed, component, component.name, remove.intercept. In addition, for closed capture models and robust design model, the element share is included in the list for p (capture probabilities) and GammaDoublePrime (respectively) to indicate whether the model is shared (share=TRUE) or not-shared (the default) (share=FALSE) with c (recapture probabilities) and GammaPrime respectively.

formula specifies the model for the parameter using R formula notation. An R formula is denoted with a ~ followed by variables in an equation format possibly using the + , *, and : operators. For example, ~sex+age is an additive model with the main effects of sex and age. Whereas, ~sex*age includes the main effects and the interaction and it is equivalent to the formula specified as ~sex+age+sex:age where sex:age is the interaction term. The model ~age+sex:age is slightly different in that the main effect for sex is dropped which means that intercept of the age effect is common among the sexes but the age pattern could vary between the sexes. The model ~sex*Age which is equivalent to ~sex + Age + sex:Age has only 4 parameters and specifies a linear trend with age and both the intercept and slope vary by sex. One additional operator that can be useful is I() which allows computation of variables on the fly. For example, the addition of the Agesq variable in the design data (as described above) can be avoided by using the notation ~Age + I(Age^2) which specifies use of a linear and quadratic effect for age. Note that specifying the model ~age + I(age^2) would be incorrect and would create an error because age is a factor variable whereas Age is not.

As an example, consider developing a model in which Phi varies by age and p follows a linear trend over time. This model could be specified and run as follows:

 p.Time=list(formula=~Time) Phi.age=list(formula=~age)
Model.p.Time.Phi.age=make.mark.model(proc.example.data,ddl,
parameters=list(Phi=Phi.age,p=p.Time))
Model.p.Time.Phi.age=run.mark.model(Model.p.Time.Phi.age)

The first 2 commands define the p and Phi models that are used in the parameter list in the call to make.mark.model. This is a good approach for defining models because it clearly documents the models, the definitions can then be used in many calls to make.mark.model and it will allow a variety of models to be developed quickly by creating different combinations of the parameter models. Using the notation above with the period separating the parameter name and the description (eg., p.Time) gives the further advantage that all possible models can be developed quickly with the functions create.model.list and mark.wrapper.

Model formula can use any field in the design data and any individual covariates defined in data. The restrictions on individual covariates that was present in versions before 1.3 have now been removed. You can now use interactions of individual covariates with all design data covariates and products of individual covariates. You can specify interactions of individual covariates and factor variables in the design data with the formula notation. For example, ~region*x1 describes a model in which the slope of x1 varies by region. Also, ~time*x1 describes a model in which the slope for x1 varied by time; however, there would be only one value of the covariate per animal so this is not a time varying covariate model. Models with time varying covariates are now more easily described with the improvements in version 1.3 but there are restrictions on how the time varying individual covariates are named. An example would be a trap dependence model in which capture probability on occasion i+1 depends on whether they were captured on occasion i. If there are n occasions in a CJS model, the 0/1 (not caught/caught) for occasions 1 to n-1 would be n-1 individual covariates to describe recapture probability for occasions 2 to n. For times 2 to n, a design data field could be defined such that the variable timex is 1 if time==x and 0 otherwise. The time varying covariates must be named with a time suffix on the base name of the covariate. In this example they would be named as x2,. . .,xn and the model could be specified as ~time + x for time variation and a constant trap effect or as ~time + time:x for a trap effect that varied by time. If in the process.data call, the argument begin.time was set to the year 1990, then the variables would have to be named x1991,x1992,... because the first recapture occasion would be 1991. Note that the times are different for different parameters. For example, survival is labeled based on the beginning of the time interval which would be 1990 so the variables should be named appropriately for the parameter model in which they will be used.

In previous versions to handle time varying covariates with a constant effect, it was necessary to use the component feature of the parameter specification to be able to append one or more arbitrary columns to the design matrix. That is no longer required for individual covariates and the component feature was removed in v2.0.8.

There are three other elements of the parameter list that can be useful on occasion. The first is link which specifies the link function for transforming between the beta and real parameters. The possible values are "logit", "log", "identity" and "mlogit(*)" where * is a numeric identifier. The "sin" link is not allowed because all models are specified using a design matrix. The typical default values are assigned to each parameter (eg "logit" for probabilities, "log" for N, and "mlogit" for pent in POPAN), so in most cases it will not be necessary to specify a link function.

The second is fixed which allows real parameters to be set at fixed values. While this is still allowed, if parameters are to be fixed for all models, it is best to assign the field fix in the design dataframe and use NA for the values to be estimated and the real fixed value for those values to be fixed. But if the values are only to be fixed for a few models then you can still use the fixed approach described here. The values for fixed can be either a single value or a list with 5 alternate forms for ease in specifying the fixed parameters. Specifying fixed=value will set all parameters of that type to the specified value. For example, Phi=list(fixed=1) will set all Phi to 1. This can be useful for situations like specifying F in the Burnham/Barker models to all have the same value of 1. Fixed values can also be specified as a list in which values are specified for certain indices, times, ages, cohorts, and groups. The first 3 will be the most useful. The first list format is the most general and flexible but it requires an understanding of the PIM structure and index numbers for the real parameters. For example,

Phi=list(formula=~time, fixed=list(index=c(1,4,7),value=1))

specifies Phi varying by time, but the real parameters 1,4,7 are set to 1. The value field is either a single constant or its length must match the number of indices. For example,

Phi=list(formula=~time, fixed=list(index=c(1,4,7),value=c(1,0,1)))

sets real parameters 1 and 7 to 1 and real parameter 4 to 0. Technically, the index/value format for fixed is not wedded to the parameter type (i.e., values for p can be assigned within Phi list), but for the sake of clarity they should be restricted to fixing real parameters associated with the particular parameter type. The time and age formats for fixed will probably be the most useful. The format fixed=list(time=x, value=y) will set all real parameters (of that type) for time x to value y. For example,

p=list(formula=~time,fixed=list(time=1986,value=1))

sets up time varying capture probability but all values of p for 1986 are set to 1. This can be quite useful to set all values to 0 in years with no sampling (e.g.,

fixed=list(time=c(1982,1984,1986), value=0)

). The age, cohort and group formats work in a similar fashion. It is important to recognize that the value you specify for time, age, cohort and group must match the values in the design data list. This is another reason to add binned fields for age, time etc with add.design.data after creating the default design data with make.design.data. Also note that the values for time and cohort are affected by the begin.time argument specified in process.data. Had I not specified begin.time=1980, to set p in the last occasion (1986), the specification would be

p=list(formula=~time,fixed=list(time=7,value=1))

because begin.time defaults to 1. The advantage of the time-, age-, and cohort- formats over the index-format is that it will work regardless of the group definition which can easily be changed by changing the groups argument in process.data. The index-format will be dependent on the group structure because the indexing of the PIMS will most likely change with changes in the group structure.

Starting with version 3.0.0 deleting design data as deacribed in this paragraph is no longer allowed. Instead, add a fix field to the design data and set the fixed values as described above.

#### No longer supported### Parameters can also be fixed at default values by deleting the specific rows of the design data. See make.design.data and material below. The default value for fixing parameters for deleted design data can be changed with the default=value in the parameter list.

Another useful element of the parameter list is the remove.intercept argument. It is set to TRUE to forcefully remove the intercept. In R notation this can be done by specifying the formula notation ~-1+... but in formula with nested interactions of factor variables and additive factor variables the -1 notation will not remove the intercept. It will simply adjust the column definitions but will keep the same number of columns and the model will be over-parameterized. The problem occurs with nested factor variables like tostratum within stratum for multistrata designs (see mstrata). As shown in that example, you can build a formula -1+stratum:tostratum to have transitions that are stratum-specific. If however you also want to add a sex effect and you specify -1+sex+stratum:tostratum it will add 2 columns for sex labelled M and F when in fact you only want to add one column because the intercept is already contained within the stratum:tostratum term. The argument remove.intercept will forcefully remove the intercept but it needs to be able to find a column with all 1's. For example, Psi=list(formula=~sex+stratum:tostratum,remove.intercept=TRUE) will work but Psi=list(formula=~-1+sex+stratum:tostratum,remove.intercept=TRUE) will not work. Also, the -1 notation should be used when there is not an added factor variable because

Psi=list(formula=~stratum:tostratum,remove.intercept=TRUE)

will not work because while the stratum:tostratum effectively includes an intercept it is equivalent to using an identity matrix and is not specified as treatment contrast with one of the columns as all 1's.

The final argument of the parameter list is contrast which can be used to change the contrast used for model.matrix. It uses the default if none specified. The form is shown in ?model.matrix.

The argument simplify determines whether the pims are simplified such that only indices for unique and fixed real parameters are used. For example, with an all different PIM structure with CJS with K occasions there are K*(K-1) real parameters for Phi and p. However, if you use simplify=TRUE with the default model of Phi(.)p(.), the pims are re-indexed to be 1 for all the Phi's and 2 for all the p's because there are only 2 unique real parameters for that model. Using simplify can speed analysis markedly and probably should always be used. This was left as an argument only to test that the simplification was working and produced the same likelihood and real parameter estimates with and without simplification. It only adjust the rows of the design matrix and not the columns. There are some restrictions for simplification. Real parameters that are given a fixed value are maintained in the design matrix although it does simplify among the fixed parameters. For example, if there are 50 real parameters all fixed to a value of 1 and 30 all fixed to a value of 0, they are reduced to 2 real parameters fixed to 1 and 0. Also, real parameters such as Psi in Multistrata and pent in POPAN that use multinomial logits are not simplified because they must maintain the structure created by the multinomial logit link. All other parameters in those models are simplified. The only downside of simplification is that the labels for real parameters in the MARK output are unreliable because there is no longer a single label for the real parameter because it may represent many different real parameters in the all-different PIM structure. This is not a problem with the labels in R because the real parameter estimates are translated back to the all-different PIM structure with the proper labels.

Be careful specifying interactions and make sure that all combinations are present when interacting 2 or more factor variables. For example, with 2 factors A and B, each with 2 levels, the data are fully crossed if there are data with values A1&B1, A1&B2, A2&B1 and A2&B2. If data exist for each of the 4 combinations then you can described the interaction model as ~A*B and it will estimate 4 values. However, if data are missing for one of more of the 4 cells then the "interaction" formula should be specified as ~-1+A:B or ~-1+B:A or ~-1+A the combinations with data. An example of this could be a marking program with multiple sites which resighted at all occasions but which only marked at sites on alternating occasions. In that case time is nested within site and time-site interaction models would be specified as ~-1+time:site. Another example is age*time when you only mark young of the year. In that case, the first occasion only contains young, the second occasion newly marked young and age 1 if occasions are a year apart. In that case, using something like above ~-1+time:age is useful.

The argument title supplies a character string that is used to label the output. The argument model.name is a descriptor for the model used in the analysis. The code constructs a model name from the formula specified in the call (e.g., Phi(~1)p(~time)) but on occasion the name may be too long or verbose, so it can be over-ridden with the model.name argument.

The final argument initial can be used to provide initial estimates for the beta parameters. It is either 1) a single starting value for each parameter, 2) an unnamed vector of values (one for each parameter), 3) named vector of values, or 4) the name of mark object that has already been run. For cases 3 and 4, the code only uses appropriate initial beta estimates in which the column names of the design matrix (for model) or vector names match the column names in the design matrix of the model to be run. Any remaining beta parameters without an initial value specified are assigned a 0 initial value. If case 4 is used the models must have the same number of rows in the design matrix and thus presumably the same structure. As long as the vector elements are named (#3), the length of the initial vector no longer needs to match the number of parameters in the new model as long as the elements are named. The names can be retrieved either from the column names of the design matrix or from rownames(x$results$beta) where x is the name of the mark object.

options is a character string that is tacked onto the Proc Estimate statement for the MARK .inp file. It can be used to request options such as NoStandDM (to not standardize the design matrix) or SIMANNEAL (to request use of the simulated annealing optimization method) or any existing or new options that can be set on the estimate proc.

Examples

Run this code


# \donttest{
# This example is excluded from testing to reduce package check time
data(dipper)
#
# Process data
#
dipper.processed=process.data(dipper,groups=("sex"))
#
# Create default design data
#
dipper.ddl=make.design.data(dipper.processed)
#
# Add Flood covariates for Phi and p that have different values
#
dipper.ddl$Phi$Flood=0
dipper.ddl$Phi$Flood[dipper.ddl$Phi$time==2 | dipper.ddl$Phi$time==3]=1
dipper.ddl$p$Flood=0
dipper.ddl$p$Flood[dipper.ddl$p$time==3]=1
#
#  Define range of models for Phi
#
Phidot=list(formula=~1)
Phitime=list(formula=~time)
PhiFlood=list(formula=~Flood)
#
#  Define range of models for p
#
pdot=list(formula=~1)
ptime=list(formula=~time)
#
# Make assortment of models
#
dipper.phidot.pdot=make.mark.model(dipper.processed,dipper.ddl,
  parameters=list(Phi=Phidot,p=pdot))
dipper.phidot.ptime=make.mark.model(dipper.processed,dipper.ddl,
  parameters=list(Phi=Phidot,p=ptime))
dipper.phiFlood.pdot=make.mark.model(dipper.processed,dipper.ddl,
  parameters=list(Phi=PhiFlood, p=pdot))
# }

Run the code above in your browser using DataLab