This function organizes input and output for estimation of trend across time
for a series of samples (for categorical and continuous variables). Trend is estimated using the
analytical procedure identified by the model arguments. For categorical
variables, the choices for the model_cat
argument are: (1) simple linear
regression, (2) weighted linear regression, and (3) generalized linear
mixed-effects model. For continuous variables, the choices for the
model_cont
argument are: (1) simple linear regression, (2) weighted
linear regression, and (3) linear mixed-effects model. The analysis data,
dframe
, can be either a data frame or a simple features (sf
) object. If an
sf
object is used, coordinates are extracted from the geometry column in the
object, arguments xcoord
and ycoord
are assigned values
"xcoord"
and "ycoord"
, respectively, and the geometry column is
dropped from the object.
trend_analysis(
dframe,
vars_cat = NULL,
vars_cont = NULL,
subpops = NULL,
model_cat = "SLR",
cat_rhs = NULL,
model_cont = "LMM",
cont_rhs = NULL,
siteID = "siteID",
yearID = "year",
weight = "weight",
xcoord = NULL,
ycoord = NULL,
stratumID = NULL,
clusterID = NULL,
weight1 = NULL,
xcoord1 = NULL,
ycoord1 = NULL,
sizeweight = FALSE,
sweight = NULL,
sweight1 = NULL,
fpc = NULL,
popsize = NULL,
invprboot = TRUE,
nboot = 1000,
vartype = "Local",
jointprob = "overton",
conf = 95,
All_Sites = FALSE
)
The analysis results. A list composed of two data frames containing trend estimates for all combinations of population Types, subpopulations within Types, and response variables. For categorical variables, trend estimates are calculated for each category of the variable. The two data frames in the output list are:
catsum
data frame containing trend estimates for categorical variables
contsum
data frame containing trend estimates for continuous variables
For the SLR and WLR model options, the data frame contains the following variables:
subpopulation (domain) name
subpopulation name within a domain
response variable
trend estimate
trend standard error
trend xx% (default 95%) lower confidence bound
trend xx% (default 95%) upper confidence bound
trend p-value
intercept estimate
intercept standard error
intercept xx% (default 95%) lower confidence bound
intercept xx% (default 95%) upper confidence bound
intercept p-value
R-squared value
adjusted R-squared value
For the GLMM and LMM model options, contents of the data frames will vary
depending on the model specified by arguments cat_rhs
and
cont_rhs
. For the default PO model, the data frame contains the
following variables:
subpopulation (domain) name
subpopulation name within a domain
response variable
trend estimate
trend standard error
trend xx% (default 95%) lower confidence bound
trend xx% (default 95%) upper confidence bound
trend p-value
intercept estimate
intercept standard error
intercept xx% (default 95%) lower confidence bound
intercept xx% (default 95%) upper confidence bound
intercept p-value
variance of the site intercepts
variance of the site trends
correlation of site intercepts and site trends
year variance
residual variance
generalized Akaike Information Criterion
Data to be analyzed (analysis data). A data frame or
sf
object containing survey design variables, response
variables, and subpopulation (domain) variables.
Vector composed of character values that identify the names
of categorical response variables in dframe
. If
argument model_cat
equals "GLMM", the categorical variables in the
dframe
data frame must be factors each of which has two levels,
where the second level will be assumed to specify "success". The default
value is NULL
.
Vector composed of character values that identify the
names of continuous response variables in dframe
.
The default value is NULL
.
Vector composed of character values that identify the
names of subpopulation (domain) variables in dframe
.
If a value is not provided, the value "All_Sites"
is assigned to the
subpops argument and a factor variable named "All_Sites"
that takes
the value "All Sites"
is added to dframe
. The
default value is NULL
.
Character value identifying the analytical procedure used
for trend estimation for categorical variables. The choices are:
"SLR"
(simple linear regression), "WLR"
(weighted linear
regression), and "GLMM"
(generalized linear mixed-effects model).
The default value is "SLR"
.
Character value specifying the right hand side of the formula
for a generalized linear mixed-effects model. If a value is not provided,
the argument is assigned a value that specifies the Piepho and Ogutu (2002)
model. The default value is NULL
.
Character value identifying the analytical procedure used
for trend estimation for continuous variables. The choices are:
"SLR"
(simple linear regression), "WLR"
(weighted linear
regression), and "LMM"
(linear mixed-effects model). The default
value is "LMM"
.
Character value specifying the right hand side of the
formula for a linear mixed-effects model. If a value is not provided, the
argument is assigned a value that specifies the Piepho and Ogutu (2002)
model. The default value is NULL
.
Character value providing name of the site ID variable in
dframe
. If repeat visit sites are present, the site
ID value for each revisit site will be the same for each survey. For a
two-stage sample, the site ID variable identifies stage two site IDs. The
default value is "siteID"
.
Character value providing name of the time period variable in
dframe
, which must be numeric and will be forced to
numeric if it is not. The default assumption is that the time period
variable is years. The default value is "year"
.
Character value providing name of the design weight
variable in dframe
. For a two-stage sample, the
weight variable identifies stage two weights. The default value is
"weight"
.
Character value providing name of the x-coordinate variable in
dframe
. For a two-stage sample, the x-coordinate
variable identifies stage two x-coordinates. Note that x-coordinates are
required for calculation of the local mean variance estimator. If dframe
is an sf
object, this argument is not required (as the geometry column
in dframe
is used to find the x-coordinate). The
default value is NULL
.
Character value providing name of the y-coordinate variable in
dframe
. For a two-stage sample, the y-coordinate
variable identifies stage two y-coordinates. Note that y-coordinates are
required for calculation of the local mean variance estimator. If dframe
is an sf
object, this argument is not required (as the geometry column
in dframe
is used to find the y-coordinate). The default
value is NULL
.
Character value providing name of the stratum ID variable in
dframe
. The default value is NULL
.
Character value providing name of the cluster (stage one) ID
variable in dframe
. Note that cluster IDs are
required for a two-stage sample. The default value is NULL
.
Character value providing name of the stage one weight
variable in dframe
. The default value is NULL
.
Character value providing name of the stage one x-coordinate
variable in dframe
. Note that x-coordinates are
required for calculation of the local mean variance estimator. The default
value is NULL
.
Character value providing name of the stage one y-coordinate
variable in dframe
. Note that y-coordinates are
required for calculation of the local mean variance estimator. The default
value is NULL
.
Logical value that indicates whether size weights should be
used during estimation, where TRUE
= use size weights and
FALSE
= do not use size weights. To employ size weights for a
single-stage sample, a value must be supplied for argument weight. To
employ size weights for a two-stage sample, values must be supplied for
arguments weight
and weight1
. The default value is
FALSE
.
Character value providing name of the size weight variable in
dframe
. For a two-stage sample, the size weight
variable identifies stage two size weights. The default value is
NULL
.
Character value providing name of the stage one size weight
variable in dframe
. The default value is
NULL
.
Object that specifies values required for calculation of the finite population correction factor used during variance estimation. The object must match the survey design in terms of stratification and whether the design is single-stage or two-stage. For an unstratified design, the object is a vector. The vector is composed of a single numeric value for a single-stage design. For a two-stage unstratified design, the object is a named vector containing one more than the number of clusters in the sample, where the first item in the vector specifies the number of clusters in the population and each subsequent item specifies the number of stage two units for the cluster. The name for the first item in the vector is arbitrary. Subsequent names in the vector identify clusters and must match the cluster IDs. For a stratified design, the object is a named list of vectors, where names must match the strata IDs. For each stratum, the format of the vector is identical to the format described for unstratified single-stage and two-stage designs. Note that the finite population correction factor is not used with the local mean variance estimator.
Example fpc for a single-stage unstratified survey design:
fpc <- 15000
Example fpc for a single-stage stratified survey design:
fpc <- list(
Stratum_1 = 9000,
Stratum_2 = 6000)
Example fpc for a two-stage unstratified survey design:
fpc <- c(
Ncluster = 150,
Cluster_1 = 150,
Cluster_2 = 75,
Cluster_3 = 75,
Cluster_4 = 125,
Cluster_5 = 75)
Example fpc for a two-stage stratified survey design:
fpc <- list(
Stratum_1 = c(
Ncluster_1 = 100,
Cluster_1 = 125,
Cluster_2 = 100,
Cluster_3 = 100,
Cluster_4 = 125,
Cluster_5 = 50),
Stratum_2 = c(
Ncluster_2 = 50,
Cluster_1 = 75,
Cluster_2 = 150,
Cluster_3 = 75,
Cluster_4 = 75,
Cluster_5 = 125))
Object that provides values for the population argument of the
calibrate
or postStratify
functions in the survey package. If
a value is provided for popsize, then either the calibrate
or
postStratify
function is used to modify the survey design object
that is required by functions in the survey package. Whether to use the
calibrate
or postStratify
function is dictated by the format
of popsize, which is discussed below. Post-stratification adjusts the
sampling and replicate weights so that the joint distribution of a set of
post-stratifying variables matches the known population joint distribution.
Calibration, generalized raking, or GREG estimators generalize
post-stratification and raking by calibrating a sample to the marginal
totals of variables in a linear regression model. For the calibrate
function, the object is a named list, where the names identify factor
variables in dframe
. Each element of the list is a
named vector containing the population total for each level of the
associated factor variable. For the postStratify
function, the
object is either a data frame, table, or xtabs object that provides the
population total for all combinations of selected factor variables in the
dframe
data frame. If a data frame is used for popsize
, the
variable containing population totals must be the last variable in the data
frame. If a table is used for popsize
, the table must have named
dimnames
where the names identify factor variables in the
dframe
data frame. If the popsize argument is equal to NULL
,
then neither calibration nor post-stratification is performed. The default
value is NULL
.
Example popsize for calibration:
popsize <- list(
Ecoregion = c(
East = 750,
Central = 500,
West = 250),
Type = c(
Streams = 1150,
Rivers = 350))
Example popsize for post-stratification using a data frame:
popsize <- data.frame(
Ecoregion = rep(c("East", "Central", "West"),
rep(2, 3)),
Type = rep(c("Streams", "Rivers"), 3),
Total = c(575, 175, 400, 100, 175, 75))
Example popsize for post-stratification using a table:
popsize <- with(MySurveyFrame,
table(Ecoregion, Type))
Example popsize for post-stratification using an xtabs object:
popsize <- xtabs(~Ecoregion + Type,
data = MySurveyFrame)
Logical value that indicates whether the inverse probability
bootstrap procedure is used to calculate trend parameter estimates. This
bootstrap procedure is only available for the "LMM" option for continuous
variables. Inverse probability references the design weights, which
are the inverse of the sample inclusion probabilities. The default value
is TRUE
.
Numeric value for the number of bootstrap iterations. The
default is 1000
.
Character value providing choice of the variance estimator,
where "Local"
= the local mean estimator, "SRS"
= the simple
random sampling estimator, "HT"
= the Horvitz-Thompson estimator,
and "YG"
= the Yates-Grundy estimator. The default value is
"Local"
.
Character value providing choice of joint inclusion
probability approximation for use with Horvitz-Thompson and Yates-Grundy
variance estimators, where "overton"
indicates the Overton
approximation, "hr"
indicates the Hartley_Rao approximation, and
"brewer"
equals the Brewer approximation. The default value is
"overton"
.
Numeric value for the Gaussian-based confidence level. The default is
95
.
A logical variable used when subpops
is not
NULL
. If All_Sites
is TRUE
, then alongside the
subpopulation output, output for all sites (ignoring subpopulations) is
returned for each variable in vars
. If All_Sites
is
FALSE
, then alongside the subpopulation output, output for all sites
(ignoring subpopulations) is not returned for each variable in vars
.
The default is FALSE
.
Tom Kincaid Kincaid.Tom@epa.gov
For the simple linear regression (SLR) model, a design-based estimate of the
category proportion (categorical variables) or the mean (continuous
variables) is calculated for each time period (year). Four choices of
variance estimator are available for calculating variance of the design-based
estimates: (1) the local mean estimator, (2) the simple random sampling
estimator, (3) the Horvitz-Thompson estimator, and (4) the Yates-Grundy
estimator. For the Horvitz-Thompson and Yates-Grundy estimators, there are
three choices for calculating joint inclusion probabilities: (1) the Overton
approximation, (2) the Hartley-Rao approximation, and (3) the Brewer
approximation. The lm
function in the stats package is used to fit a
linear model using a formula
argument that specifies the proportion or
mean estimates as the response variable and years as the regressor variable.
For fitting the SLR model, the yearID
variable from the dframe
argument is modified by subtracting the minimum value of years from all
values of the variable. Parameter estimates are extracted from the object
returned by the lm
function. For the weighted linear regression (WLR)
model, the process is the same as the SLR model except that the inverse of
the variances of the proportion or mean estimates is used as the
weights
argument in the call to the lm
function. For the LMM
option, the lmer
function in the lme4 package is used to fit a linear
mixed-effects model for trend across years. For both the GLMM and LMM
options, the default Piepho and Ogutu (PO) model includes fixed effects for
intercept and trend (slope) and random effects for intercept and trend for
individual sites, where the siteID
variable from the dframe
argument identifies sites. Correlation between the random effects for site
intercepts and site trends is included in the model. Finally, the PO model
contains random effects for year variance and residual variance. For the GLMM
and LMM options, arguments cat_rhs
and cont_rhs
, respectively,
can be used to specify the right hand side of the model formula. Internally,
a variable named Wyear
is created that is useful for specifying the
cat_rhs
and cont_rhs
arguments. The Wyear
variable is
created by subtracting the minimum value of the yearID
variable from
all values of the variable. If argument invprboot
is FALSE
,
parameter estimates are extracted from the object returned by the lmer
function. If argument invprboot
is TRUE
, the boot
function in the boot package is used to generate bootstrap replicates using a
function named bootfcn
as the statistic
argument passed to the
boot
function. For each bootstrap replicate, bootfcn
calls the
glmer
or lmer
function, as appropriate, using the specified
model. design weights identified by the weight
argument for
the trend_analysis
function are passed as the weights
argument
for the boot
function, which specifies importance weights. Using the
design weights as the weights
argument ensures that bootstrap
replicates are representative of the survey population. Parameter estimates
are calculated using the object returned by the boot
function.
change_analysis
for change analysis
# Example using a categorical variable with three resource classes and a
# continuous variable
mydframe <- data.frame(
siteID = rep(paste0("Site", 1:40), rep(5, 40)),
yearID = rep(seq(2000, 2020, by = 5), 40),
wgt = rep(runif(40, 10, 100), rep(5, 40)),
xcoord = rep(runif(40), rep(5, 40)),
ycoord = rep(runif(40), rep(5, 40)),
All_Sites = rep("All Sites", 200),
Region = sample(c("North", "South"), 200, replace = TRUE),
Resource_Class = sample(c("Good", "Fair", "Poor"), 200, replace = TRUE),
ContVar = rnorm(200, 10, 1)
)
myvars_cat <- c("Resource_Class")
myvars_cont <- c("ContVar")
mysubpops <- c("All_Sites", "Region")
trend_analysis(
dframe = mydframe,
vars_cat = myvars_cat,
vars_cont = myvars_cont,
subpops = mysubpops,
model_cat = "WLR",
model_cont = "SLR",
siteID = "siteID",
yearID = "yearID",
weight = "wgt",
xcoord = "xcoord",
ycoord = "ycoord"
)
Run the code above in your browser using DataLab