For each type of parameter in the analysis model (e.g, p, Phi, r), this
function creates design data based on the parameter index matrix (PIM) in
MARK terminology. Design data relate parameters to the sampling and data
structures; whereas data
relate to the object(animal) being sampled.
Design data relate parameters to time, age, cohort and group structure. Any
variables in the design data can be used in formulas to specify the model in
make.mark.model
.
make.design.data(
data,
parameters = list(),
remove.unused = FALSE,
right = TRUE,
common.zero = FALSE
)
The function value is a list of data frames. The list contains a data frame for each type of parameter in the model (e.g., Phi and p for CJS). The names of the list elements are the parameter names (e.g., Phi). The structure of the dataframe depends on the calling arguments and the model & data structure as described in the details above.
Processed data list; resulting value from process.data
Optional list containing a list for each type of parameter (list of lists); each parameter list is named with the parameter name (eg Phi); each parameter list can contain vectors named age.bins,time.bins and cohort.bins
subtract.stratum | a vector of strata letters (one for each strata) |
that specifies the tostratum that is computed by subtraction | |
for mlogit parameters like Psi | |
age.bins | bins for binning ages |
time.bins | bins for binning times |
cohort.bins | bins for binning cohorts |
pim.type | either "all" for all different, "time" for column time structure, or |
"constant" for all values the same within the PIM |
If TRUE, unused design data are deleted; see details below (as of v3.0.0 this argument is no longer used)
If TRUE, bin intervals are closed on the right
if TRUE, uses a common begin.time to set origin (0) for Time variable defaults to FALSE for legacy reasons but should be set to TRUE for models that share formula like p and c with the Time model
Jeff Laake
After processing the data, the next step is to create the design data for building the models which is done with this function. The design data are different than the capture history data that relates to animals. The types of parameters and design data are specific to the type of analysis. For example, consider a CJS analysis that has parameters Phi and p. If there are 4 occasions, there are 3 cohorts and potentially 6 different Phi and 6 different p parameters for a single group. The format for each parameter information matrix (PIM) in MARK is triangular. RMark uses the all different formulation for PIMS by default, so the PIMs would be
Phi p 1 2 3 7 8 9 4 5 10 11 6 12
If you chose pim.type="time" for each parameter in "CJS", then the PIMS are structured as
Phi p 1 2 3 4 5 6 2 3 5 6 3 6
That structure is only useful if there is only a single release cohort represented by the PIM. If you choose this option and there is more than one cohort represented by the PIM then it will restrict the possible set of models that can be represented.
Each of these parameters relates to different times, different cohorts (time of initial release) and different ages (at least in terms of time since first capture). Thus we can think of a data frame for each parameter that might look as follows for Phi for the all different structure:
Index time cohort age 1 1 1 0 2 2 1 1 3 3 1 2 4 2 2 0 5 3 2 1
6 3 3 0
With this design data, one can envision models that describe Phi in terms of the variables time, cohort and age. For example a time model would have a design matrix like:
Int T2 T3 1 1 0 0 2 1 1 0 3
1 0 1 4 1 1 0 5 1 0 1 6 1 0 1
Or a time + cohort model might look like
Int T2 T3 C2 C3 1 1 0 0 0 0 2 1 1 0 0 0 3 1 0 1 0 0 4 1 1 0 1
0 5 1 0 1 1 0 6 1 0 1 0 1
While you could certainly develop these designs
manually within MARK, the power of the R code rests with the internal R
function model.matrix
which can take the design data and
create the design matrix from a formula specification such as ~time
or ~time+cohort
alleviating the need to create the design matrix
manually. While many analyses may only need age, time or cohort, it is
quite possible to extend the kind of design data, to include different
functions of these variables or add additional variables such as effort.
One could consider design data for p as follows:
Index time
cohort age effort juvenile 7 1 1 1 10 1 8 2 1 2 5 0 9 3 1 3 15 0 10 2 2 1 5
1 11 3 2 2 15 0 12 3 3 1 15 1
The added columns represent a time dependent
covariate (effort) and an age variable of juvenile/adult. With these design
data, it is easy to specify different models such as ~time
,
~effort
, ~effort+age
or ~effort+juvenile
.
With the simplest call:
ddl=make.design.data(proc.example.data)
the function creates default design data for each type of parameter in the
model as defined by proc.example.data$model
. If
proc.example.data
was created with the call in the first example of
process.data
, the model is "CJS" (the default model) so the
return value is a list with 2 data frames: one for Phi and one for p. They
can be accessed as ddl$Phi
(or ddl[["Phi"]]
) and ddl$p
(or ddl[["p"]]
) or as ddl[[1]]
and ddl[[2]]
respectively. Using the former notation is easier and safer because it is
independent of the ordering of the parameters in the list. For this example,
there are 16 groups and each group has 21 Phi parameters and 21 p
parameters. Thus, there are 336 rows (parameters) in the design data frame
for both Phi and p and thus a total of 772 parameters.
The default fields in each dataframe are typically group
, cohort
,
age
, time
, Cohort
, Age
, and Time
. The
first 4 fields are factor variables, whereas Cohort
, Age
and
Time
are numeric covariate versions of cohort
, age
, and
time
shifted so the first value is always zero. However, there are
additional fields that are added depending on the capture-recapture model and
the parameter in the model. For example, in multistrata models the default data
include stratum in survival(S) and stratum and tostratum in Psi, the transition
probabilities. Also, for closed capture heterogeneity models a factor variable
mixture
is included. It is always best to examine the design data after
creating them because those fields are your "data" for building models in
addition to individual covariates in the capture history data.
If groups
were created in the call to
process.data
, then the factor variables used to create the
groups
are also included in the design data for each type of
parameter. If one of the grouping variables is an age variable it is named
initial.age.class
to recognize explicitly that it represents a static
initial age and to avoid naming conflicts with age
and Age
variables which represent dynamic age variables of the age of the animal
through time. Non-age related grouping variables are added to the design
data using their names in data
. For example if
proc.example.data
is defined using the first example in
process.data
, then the fields sex
, initial.age.class
(in place of age
in this case), and region
are added to the
design data in addition to the group
variable that has 16 levels. The
levels of the group
variable are created by pasting (concatenating)
the values of the grouping factor in order. For example, M11 is sex=M, age
class=1 and region=1.
By default, the factor variables age
, time
, and cohort
are created such that there is a factor level for each unique value. By
specfying values for the argument parameters
, the values of
age
, time
, and cohort
can be binned (put into
intervals) to reduce the number of levels of each factor variable. The
argument parameters
is specified as a list of lists. The first level
of the list specifies the parameter type and the second level specifies the
variables (age
, time
, or cohort
) that will be binned
and the cutpoints (endpoints) for the intervals. For example, if you
expected that survival may change substantially to age 3 (i.e. first 3 years
of life) but remain relatively constant beyond then, you could bin the ages
for survival as 0,1,2,3-8. Likewise, as well, you could decide to bin time
into 2 intervals for capture probability in which effort and expected
capture probability might be constant within each interval. This could be
done with the following call using the argument parameters
:
ddl=make.design.data(proc.example.data,
parameters=list(Phi=list(age.bins=c(0,0.5,1,2,8)),
p=list(time.bins=c(1980,1983,1986))))
In the above example, age
is binned for Phi but not for p; likewise
time
is binned for p but not for Phi. The bins for age were defined
as 0,0.5,1,2,8 because the intervals are closed ("]" - inclusive) on the
right by default and open ("(" non-inclusive) on the left, except for the
first interval which is closed on the left. Had we used 0,1,2,8, 0 and 1
would have been in a single interval. Any value less than 1 and greater
than 0 could be used in place of 0.5. Alternatively, the same bins could be
specified as:
ddl=make.design.data(proc.example.data,
parameters=list(Phi=list(age.bins=c(0,1,2,3,8)),
p=list(time.bins=c(1980,1984,1986))),right=FALSE)
To create the design data and maintain flexibility, I recommend creating the
default design data and then adding other binned variables with the function
add.design.data
. The 2 binned variables defined above can be
added to the default design data as follows:
ddl=make.design.data(proc.example.data)
ddl=add.design.data(proc.example.data,ddl,parameter="Phi",type="age",
bins=c(0,.5,1,2,8),name="agebin")
ddl=add.design.data(proc.example.data,ddl,parameter="p",type="time",
bins=c(1980,1983,1986),name="timebin")
Adding the other binned variables to the default design data, allows models
based on either time, timebin, or Time for p and age, agebin or Age for Phi.
Any number of additional binned variables can be defined as long as they are
given unique names. Note that R is case-specific so ~Time
specifies a
model which is a linear trend over time ((e.g. Phi(T) or p(T) in MARK)
whereas ~time
creates a model with a different factor level for each
time
in the data (e.g. Phi(t) or p(t) in MARK) and ~timebin
creates a model with 2 factor levels 1980-1983 and 1984-1986.
Some circumstances may require direct manipulation of the design data to
create the needed variable when simple binning is not sufficient or when the
design data is a variable other than one related to time
, age
,
cohort
or group
(e.g., effort index). This can be done with
any of the vast array of R commands. For example, consider a situation in
which 1983 and 1985 were drought years and you wanted to develop a model in
which survival was different in drought and non-drought years. This could
be done with the following commands:
ddl$Phi$drought=0
ddl$Phi$drought[ddl$phi$time==1983 | ddl$Phi$time==1985]= 1
The first command creates a variable named drought in the Phi design data
and initializes it with 0. The second command changes the drought variable
to 1 for the years 1983 and 1985. The single brackets [] index a data frame,
matrix or vector. In this case ddl$Phi$drought
is a vector and
ddl$Phi$time==1983 | ddl$Phi$time==1985
selects the values in which
time is equal (==) to 1983 or ("|") 1985. A simpler example might occur if
we want to create a function of one of the continuous variables. If we
wanted to define a model for p that was a function of age and age squared,
we could add the age squared variable as:
ddl$p$Agesq=ddl$p$Age^2
Any of the fields in the design data can be used in formulae for the
parameters. However, it is important to recognize that additional variables
you define and add to the design data are specific to a particular type of
parameter (e.g., Phi, p, etc). Thus, in the above example, you could not use
Agesq in a model for Phi without also adding it to the Phi design data. As
described in make.mark.model
, there is actually a simpler way
to add simple functions of variables to a formula without defining them in
the design data.
The above manipulations are sufficient if there is only one or two variables
that need to be added to the design data. If there are many covariates that
are time(occasion)-specific then it may be easier to setup a dataframe with
the covariate data and use merge_design.covariates
.
The fields that are automatically created in the design data depends on the
model. For example, with models such as "POPAN" or any of the "Pradel"
models, the PIM structure is called square which really means that it is a
single row and all the rows are the same length for each group. Thus,
rectangular or row may have been a better descriptor. Regardless, in this
case there is no concept of a cohort within the PIM which is equivalent to a
row within a triangular PIM for "CJS" type models. Thus, for parameters with
"Square" PIMS the cohort (and Cohort) field is not generated. The cohort
field is also not created if pim.type="time"
for "Triangular" PIMS,
because that structure has the same structure for each row (cohort) and
adding cohort effects would be inappropriate.
For models with "Square" PIMS or pim.type="time"
for "Triangular"
PIMS, it is possible to create a cohort variable by defining the cohort
variable as a covariate in the capture history data and using it as a
variable for creating groups. As with all grouping variables, it is added
to the design data. Now the one caution with "Square" PIMS is that they are
all the same length. Imagine representing a triangular PIM with a set of
square PIMS with each row being a cohort. The resulting set of PIMS is now
rectangular but the lower portion of the matrix is superfluous because the
parameters represent times prior to the entry of the cohort, assuming that
the use of cohort is to represent a birth cohort. This is only problematic
for these kinds of models when the structure accomodates age and the concept
of a birth cohort. The solution to the problem is to delete the design data
for the superfluous parameters after is is created (see warning below).
For example, let us presume that you used cohort with 3 levels
as a grouping variable for a model with "Square" PIMS which has 3 occasions.
Then, the PIM structure would look as follows for Phi:
Phi 1 2 3 4 5 6 7 8 9
. If each row represented a cohort that entered at occasions 1,2,3 then parameters 4,7,8 are superfluous or could be thought of as representing cells that are structural zeros in the model because no observations can be made of those cohorts at those times.
After creating the design data, the unneeded rows can be deleted with R
commands or you can use the argument remove.unused=TRUE
. As an
example, a code segment might look as follows if chdata
was defined
properly:
mydata=process.data(chdata,model="POPAN",groups="cohort")
ddl=make.design.data(mydata) ddl$Phi=ddl$Phi[-c(4,7,8),]
If cohort and time were suitably defined an easier solution especially for a larger problem would be
ddl$Phi=ddl$Phi[as.numeric(ddl$Phi$time)>=as.numeric(ddl$Phi$cohort),]
Which would only keep parameters in which the time is the same or greater
than the cohort. Note that time and cohort would be factor variables and <
and > do not make sense which is the reason for the as.numeric
which
translates the factor to a numeric ordering of factors (1,2,...) but not the
numeric value of the factor level (e.g., 1987,1998). Thus, the above
assumes that both time and cohort have the same factor levels. The design
data is specific to each parameter, so the unneeded parameters need to be
deleted from design data of each parameter.
However, all of this is done automatically by setting the argument
remove.unused=TRUE
. It functions differently depending on the type
of PIM structure. For models with "Triangular" PIMS, unused design data are
determined based on the lack of a release cohort. For example, if there
were no capture history data that started with 0 and had a 1 in the second
position ("01.....") that would mean that there were no releases on occasion
2 and row 2 in the PIM would not be needed so it would be removed from the
design data. If remove.unused=TRUE
the design data are removed for
any missing cohorts within each group. For models with "Square" PIMS, cohort
structure is defined by a grouping variable. If there is a field named
"cohort" within the design data, then unused design data are defined to
occur when time < cohort. This is particularly useful for age structured
models which define birth cohorts. In that case there will be sampling
times prior to the birth of the cohort which are not relevant and should be
treated as "structural zeros" and not as a zero based on stochastic events.
If the design data are removed, when the model is constructed with
make.mark.model
, the argument default.fixed
controls
what happens with the real parameters defined by the missing design data.
If default.fixed=TRUE
, then the real parameters are fixed at values
that explain with certainty the observed data (e.g., p=0). That is
necessary for models with "Square" PIMS (eg, POPAN and Pradel models) that
include each capture-history position in the probability calculation. For
"Triangular" PIMS with "CJS" type models, the capture(encounter) history
probability is only computed for occasions past the first "1", the release.
Thus, when a cohort is missing there are no entries and the missing design
data are truly superfluous and default.fixed=FALSE
will assign the
missing design data to a row in the design matrix which has all 0s. That
row will show as a real parameter of (0.5 for a logit link) but it is not
included in any parameter count and does not affect any calculation. The
advantage in using this approach is that the R code recognizes these and
displays blanks for these missing parameters, so it makes for more readable
output when say every other cohort is missing. See
make.mark.model
for more information about deleted design data
and what this means to development of model formula.
For design data of "Multistrata" models, additional fields are added to
represent strata. A separate PIM is created for each stratum for each
parameter and this is denoted in the design data with the addition of the
factor variable stratum
which has a level for each stratum. In
addition, for each stratum a dummy variable is created with the name of the
stratum (strata.label
)and it has value 1 when the parameter is for
that stratum and 0 otherwise. Using these variables with the interaction
operator ":" in formula allows more flexibility in creating model structure
for some strata and not others. All "Multistrata" models contain "Psi"
parameters which represent the transitions from a stratum to all other
strata. Thus if there are 3 strata, there are 6 PIMS for the "Psi"
parameters to represent transition from A to B, A to C, B to A, B to C, C to
A and C to B. The "Psi" parameters are represented by multimonial logit
links and the probability of remaining in the stratum is determined by
substraction. To represent these differences, a factor variable
tostratum
is created in addition to stratum
. Likewise, dummy
variables are created for each stratum with names created by pasting "to"
and the strata label (e.g., toA, toB etc). Some examples of using these
variables to create models for "Psi" are given in
make.mark.model
.
######WARNING########
Deleting design data for mlogit parameters like Psi in the multistate
model can fail if you do things like delete certain transitions. Deleting
design data is no longer allowed. It is better
to add the field fix. It should be assigned the value NA for parameters that
are estimated and a fixed real value for those that are fixed. Here is an example
with the mstrata data example:
data(mstrata)
# deleting design data approach to fix Psi A to B to 0 (DON'T use this approach)
dp=process.data(mstrata,model="Multistrata")
ddl=make.design.data(dp)
ddl$Psi=ddl$Psi[!(ddl$Psi$stratum=="A" & ddl$Psi$tostratum=="B"),]
ddl$Psi
summary(mark(dp,ddl,output=FALSE,delete=TRUE),show.fixed=TRUE)
#new approach using fix to set Phi=1 for time 2 (USE this approach)
ddl=make.design.data(dp)
ddl$Psi$fix=NA
ddl$Psi$fix[ddl$Psi$stratum=="A" & ddl$Psi$tostratum=="B"]=0
ddl$Psi
summary(mark(dp,ddl,output=FALSE,delete=TRUE),show.fixed=TRUE)
process.data
,merge_design.covariates
,
add.design.data
, make.mark.model
,
run.mark.model
data(example.data)
proc.example.data=process.data(example.data)
ddl=make.design.data(proc.example.data)
ddl=add.design.data(proc.example.data,ddl,parameter="Phi",type="age",
bins=c(0,.5,1,2,8),name="agebin")
ddl=add.design.data(proc.example.data,ddl,parameter="p",type="time",
bins=c(1980,1983,1986),name="timebin")
Run the code above in your browser using DataLab