dfuncEstim: Estimate a detection function from distance-sampling data

Description

Fit a specific detection function to off-transect or off-point (radial) distances using maximum likelihood. Distance functions are fitted to individual distance observations, not histogram bin heights, despite plot methods that draw histogram bars.

Usage

dfuncEstim(
  formula,
  detectionData,
  siteData,
  likelihood = "halfnorm",
  pointSurvey = FALSE,
  w.lo = units::set_units(0, "m"),
  w.hi = NULL,
  expansions = 0,
  series = "cosine",
  x.scl = units::set_units(0, "m"),
  g.x.scl = 1,
  observer = "both",
  warn = TRUE,
  transectID = NULL,
  pointID = "point",
  outputUnits = NULL,
  control = RdistanceControls()
)

Value

An object of class 'dfunc'. Objects of class 'dfunc' are lists containing the following components:

parameters: The vector of estimated parameter values. Length of this vector for built-in likelihoods is one (for the function's parameter) plus the number of expansion terms plus one if the likelihood is either 'hazrate' or 'uniform' (hazrate and uniform have two parameters).
varcovar: The variance-covariance matrix for coefficients of the distance function, estimated by the inverse of the Hessian of the fit evaluated at the estimates. There is no guarantee this matrix is positive-definite and should be viewed with caution. Error estimates derived from bootstrapping are generally more reliable.
loglik: The maximized value of the log likelihood (more specifically, the minimized value of the negative log likelihood).
convergence: The convergence code. This code is returned by optim. Values other than 0 indicate suspect convergence.
like.form: The name of the likelihood. This is the value of the argument likelihood.
w.lo: Left-truncation value used during the fit.
w.hi: Right-truncation value used during the fit.
detections: A data frame of detections within the strip or circle used in the fit. Column 'dist' contains the observed distances. Column 'groupSize' contains group sizes associated with the values of 'dist'. Group sizes are only used in abundEstim. This data frame contains only distances between w.lo and w.hi. Another component of the returned object, i.e., model.frame contains all observations in the input data, including those outside the strip.
covars: Either NULL if no covariates are included in the detection function, or a model.matrix containing the covariates used in the fit. Factors in in the model.matrix version have been expanded into 0-1 indicator variables based on R contrasts in effect at the time of the call. Only covariates associated with distances inside the strip or circle are included.
model.frame: A model.frame object containing observed distances (the 'response'), covariates specified in the formula, and group sizes if they were specified. If specified, the name of the group size column is "offset(-variable-)", not "groupsize(-variable-)", because internally it is easier to treat group sizes as an offset in the model. This component is a proper model.frame and contains both 'terms' and 'contrasts' attributes.
siteID.cols: A vector containing the transect ID column names in detectionData and siteData. Transect IDs can be a composite of two or more columns and hence this component can have length greater than 1.
expansions: The number of expansion terms used during estimation.
series: The type of expansion used during estimation.
call: The original call of this function.
call.x.scl: The input or user requested distance at which the distance function is scaled.
call.g.x.scl: The input value specifying the height of the distance function at a distance of call.x.scl.
call.observer: The value of input parameter observer. The input observer parameter is only applicable when g.x.scl is a data frame.
fit: The fitted object returned by optim. See documentation for optim.
factor.names: The names of any factors in formula.
pointSurvey: The input value of pointSurvey. This is TRUE if distances are radial from a point. FALSE if distances are perpendicular off-transect.
formula: The formula specified for the detection function.
control: A list containing values of the 'control' parameters set by RdistanceControls.
outputUnits: The measurement units used for output. All distance measurements are converted to these units internally.
x.scl: The actual distance at which the distance function is scaled to some value. i.e., this is the actual x at which g(x) = g.x.scl. Note that call.x.scl = x.scl unless call.x.scl == "max", in which case x.scl is the distance at which g() is maximized.
g.x.scl: The actual height of the distance function at a distance of x.scl. Note that g.x.scl = call.g.x.scl unless call.g.x.scl is a multiple observer data frame, in which case g.x.scl is the actual height of the distance function at x.scl computed from the multiple observer data frame.

Arguments

formula

A standard formula object (e.g., dist ~ 1, dist ~ covar1 + covar2). The left-hand side (before ~) is the name of the vector containing distances (off-transect or radial). The right-hand side (after ~) contains the names of covariate vectors to fit in the detection function. Covariates can be either detection level and appear in detectionData or transect level and appear in siteData. Regular R scoping rules apply.

Group Sizes: Non-unity group sizes are specified using groupsize() in the formula. That is, when group sizes are not all 1, they must be entered as a column in detectionData and specified using groupsize() as part of formula. For example, d ~ habitat + groupsize(number) specifies that distances appear in variable d, one covariate named habitat is to be fitted, and column number contains the number of individuals associated with each detection. If group sizes are not specified, all group sizes are assumed to be 1.

detectionData

A data frame containing detection distances (either perpendicular for line-transect or radial for point-transect designs), with one row per detected object or group. This data frame must contain at least the following information:

Detection Distances: A single column containing detection distances must be specified on the left-hand side of formula. As of Rdistance version 3.0.0, the detection distances must have measurement units attached. Attach measurements units to distances using library(units);units()<-. For example, library(units) followed by units(df$dist) <- "m" or units(df$dist) <- "ft" will work. Alternatively, df$dist <- units::set_units(df$dist, "m") also works.
Site IDs: The ID of the transect or point (i.e., the 'site') where each object or group was detected. The site ID column(s) (see arguments transectID and pointID) must specify the site (transect or point) so that this data frame can be merged with siteData.
In a later release, Rdistance will allow detection-level covariates. When that happens, detection-level covariates will appear in this data frame.

See example data set sparrowDetectionData. See also Input data frames below for information on when detectionData and siteData are required inputs.

siteData

A data.frame containing site (transect or point) IDs and any site level covariates to include in the detection function. Every unique surveyed site (transect or point) is represented on one row of this data set, whether or not targets were sighted at the site. See arguments transectID and pointID for an explanation of the way in which distance and site data frames are merged. See section Relationship between data frames (transect and point ID's) for additional details.

See Data frame requirements for situations in which detectionData only, detectionData and siteData, or neither are required.

likelihood

String specifying the likelihood to fit. Built-in likelihoods at present are "uniform", "halfnorm", "hazrate", "negexp", and "Gamma". See vignette for a way to use user-define likelihoods.

pointSurvey

A logical scalar specifying whether input data come from point-transect surveys (TRUE), or line-transect surveys (FALSE).

w.lo

Lower or left-truncation limit of the distances in distance data. This is the minimum possible off-transect distance. Default is 0. If w.lo is greater than 0, it must be assigned measurement units using units(w.lo) <- "<units>" or w.lo <- units::set_units(w.lo, "<units>"). See examples in the help for set_units.

w.hi

Upper or right-truncation limit of the distances in dist. This is the maximum off-transect distance that could be observed. If unspecified (i.e., NULL), right-truncation is set to the maximum of the observed distances. If w.hi is specified, it must have associated measurement units. Assign measurement units using units(w.hi) <- "<units>" or w.hi <- units::set_units(w.hi, "<units>"). See examples in the help for set_units.

expansions

A scalar specifying the number of terms in series to compute. Depending on the series, this could be 0 through 5. The default of 0 equates to no expansion terms of any type. No expansion terms are allowed (i.e., expansions is forced to 0) if covariates are present in the detection function (i.e., right-hand side of formula includes something other than 1).

series

If expansions > 0, this string specifies the type of expansion to use. Valid values at present are 'simple', 'hermite', and 'cosine'.

x.scl

The x coordinate (a distance) at which to scale the sightability function to g.x.scl, or the string "max". When x.scl is specified (i.e., not 0 or "max"), it must have measurement units assigned using either library(units);units(x.scl) <- '<units>' or x.scl <- units::set_units(x.scl, <units>). See units::valid_udunits() for valid symbolic units. See Details for more on scaling the sightability function.

g.x.scl

Height of the distance function at coordinate x. The distance function will be scaled so that g(x.scl) = g.x.scl. If g.x.scl is not a data frame, it must be a numeric value (vector of length 1) between 0 and 1. See Details.

observer

A numeric scalar or text string specifying whether observer 1 or observer 2 or both were full-time observers. This parameter dictates which set of observations form the denominator of a double observer system. If, for example, observer 2 was a data recorder and part-time observer, or if observer 2 was the pilot, set observer = 1. If observer = 1, observations by observer 1 not seen by observer 2 are ignored. The estimate of detection in this case is the ratio of number of targets seen by both observers to the number seen by both plus the number seen by just observer 2. If observer = "both", the computation goes both directions.

warn

A logical scalar specifying whether to issue an R warning if the estimation did not converge or if one or more parameter estimates are at their boundaries. For estimation, warn should generally be left at its default value of TRUE. When computing bootstrap confidence intervals, setting warn = FALSE turns off annoying warnings when an iteration does not converge. Regardless of warn, after completion all messages about convergence and boundary conditions are printed by print.dfunc, print.abund, and plot.dfunc.

transectID

A character vector naming the transect ID column(s) in detectionData and siteData. If transects are not identified in columns named 'siteID' (the default for both data frames), you need to specify which column(s) uniquely identify transects. transectID can have length greater than 1, in which case unique transects are identified by the composite columns.

pointID

When point-transects are used, this is the ID of points on a transect. When pointSurvey=TRUE, the combination of transectID and pointID specify unique sampling sites. See Input data frames.

If single points are surveyed, meaning surveyed points were not grouped into transects, each 'transect' consists of one point. In this case, set transectID equal to the point's ID and set pointID equal to 1 for all points.

outputUnits

A string giving the symbolic measurment units that results should be reported in. Any distance measurement unit in units::valid_udunits() will work. The strings for common distance symbolic units are: "m" for meters, "ft" for feet, "cm" for centimeters, "mm" for millimeters, "mi" for miles, "nmile" for nautical miles ("nm" is nano meters), "in" for inches, "yd" for yards, "km" for kilometers, "fathom" for fathoms, "chains" for chains, and "furlong" for furlongs. If outputUnits is unspecified (NULL), output units are the same as distance measurements units in data.

control

A list containing optimization control parameters such as the maximum number of iterations, tolerance, the optimizer to use, etc. See the RdistanceControls function for explanation of each value, the defaults, and the requirements for this list. See examples below for how to change controls.

Transect types

Rdistance accommodates two kinds of transects: continuous and point. On continuous transects detections can occur at any point along the route, and these are line-transects. On point transects detections can only occur at a series of stops (points), and these are point-transects. Transects are the basic sampling unit in both cases. Columns named in transectID are sufficient to specify unique line-transects. The combination of transectID and pointID specify unique sampling locations along point-transects. See Input data frames below for more detail.

Input data frames

To save space and to easily specify sites without detections, all site ID's, regardless of whether a detection occurred there, and site level covariates are stored in the siteData data frame. Detection distances and group sizes are measured at the detection level and are stored in the detectionData data frame.

Data frame requirements

The following explains conditions under which various combinations of the input data frames are required.

Detection data and site data both required:
Both detectionData and siteData are required if site level covariates are specified on the right-hand side of formula. Detection level covariates are not currently allowed. Both detectionData and siteData data frames are required to estimate abundance later in abundEstim.
Detection data only required:
detectionData only is required when covariates are are not included in the distance function (i.e., the right-hand side of formula is "~1" or "~groupsize(groupSize)"). Note that dfuncEstim does not need to know transect IDs (or group sizes) in order to estimate a distance function; but, group sizes and transect IDs are stored and used to estimate abundance in function abundEstim. Both the detectionData and siteData data frames are required in abundEstim.
Neither detection data nor site data required
Neither detectionData nor siteData are required if all variables specified in formula are within the scope of dfuncEstim (e.g., in the global working environment) and abundance estimates are not required. Regular R scoping rules apply when the call to dfuncEstim is embedded in a function. This case is will produce distance functions only. Abundance cannot later be estimated because transects and transect lengths cannot be specified outside of a data frame. If abundance will be estimated, use either case 1 or 2.

Relationship between data frames (transect and point ID's)

The input data frames, detectionData and siteData, must be merge-able on unique sites. For line-transects, site ID's specify transects or routes and are unique values of the transectID column in siteData. In this case, the following merge must work: merge(detectionData,siteData,by=transectID).

For point-transects, site ID's specify individual points and are unique values of the combination paste(transectID,pointID). In this case, the following merge must work: merge(detectionData,siteData,by=c(transectID, pointID).

By default, transects are unique combinations of the common variables in the detectionData and siteData data frames if both data frames are specified (i.e., unique values of intersect(names(detectionData), names(siteData))). If siteData is not specified and transectID is not given, transects are assumed to be identified in a column named siteID in detectionData.

Either way (i.e., either transectID = "siteID" or specified as something else), the column(s) containing transect ID's must be correct here if abundance is to be estimated later. Routine abundEstim requires transect ID's for bootstrapping because it resamples unique values of the composite transect ID column(s). abundEstim uses the value of transectID specified here and hence users cannot change transect ID's between calls to dfuncEstim and abundEstim and all transectID columns must be present in both data frames even though they may not be used until later.

An error occurs if both detectionData and siteData are specified but no common columns exist. Duplicate transectID values are not allowed in siteData but are allowed in detectionData because multiple detections can occur on a single transect or at a single site. If the same site is surveyed in multiple years, specify another level of transect ID; for example, transectID = c("year","transectID").

Measurement Units

As of Rdistance version 3.0.0, measurement units are require on all distances. This includes off-transect distances, radial distances, truncation distances (w.lo and w.hi), transect lengths, and study size area. In dfuncEstim, units are required on the following: detectionData$dist; w.lo (unless it is zero); w.hi (unless it is NULL); and x.scl. In abundEstim, units are required on siteData$length and area. All units are 1-dimensional except those on area, which are 2-dimensional.

Requiring units ensures that internal calculations and results (e.g., ESW and abundance) are correct and that output units are clear. Input distances can have variable units. For example, input distances can be in specified in "m", w.hi in "in", and w.lo in "km". Internally, all distances are converted to the units specified by outputUnits (or the units of input distances if outputUnits is NULL), and all output is reported in units of outputUnits.

Measurement units can be assigned using units()<- after attaching the units package or with x <- units::set_units(x, "<units>"). See units::valid_udunits() for a list of valid symbolic units.

If measurements are truly unit-less, or measurement units are unknown, set RdistanceControls(requireUnits = FALSE). This suppresses all unit checks and conversions. Users are on their own to make sure inputs are scaled correctly and that output units are known.

References

Buckland, S.T., D.R. Anderson, K.P. Burnham, J.L. Laake, D.L. Borchers, and L. Thomas. (2001) Introduction to distance sampling: estimating abundance of biological populations. Oxford University Press, Oxford, UK.

Examples

Run this code

# Load example sparrow data (line transect survey type)
data(sparrowDetectionData)

dfunc <- dfuncEstim(formula = dist ~ 1
                  , detectionData = sparrowDetectionData)
dfunc
plot(dfunc)

Run the code above in your browser using DataLab