Select a spatially balanced sample from a point (finite), linear / linestring (infinite), or areal / polygon (infinite) sampling frame using the Generalized Random Tessellation Stratified (GRTS) algorithm. The GRTS algorithm accommodates unstratified and stratified sampling designs and allows for equal inclusion probabilities, unequal inclusion probabilities according to a categorical variable, and inclusion probabilities proportional to a positive auxiliary variable. Several additional sampling options are included, such as including legacy (historical) sites, requiring a minimum distance between sites, and selecting replacement sites. For technical details, see Stevens and Olsen (2004).
grts(
sframe,
n_base,
stratum_var = NULL,
seltype = NULL,
caty_var = NULL,
caty_n = NULL,
aux_var = NULL,
legacy_var = NULL,
legacy_sites = NULL,
legacy_stratum_var = NULL,
legacy_caty_var = NULL,
legacy_aux_var = NULL,
mindis = NULL,
maxtry = 10,
n_over = NULL,
n_near = NULL,
wgt_units = NULL,
pt_density = NULL,
DesignID = "Site",
SiteBegin = 1,
sep = "-",
projcrs_check = TRUE
)
The sampling design sites and additional information about the sampling design. More specifically, it is, a list with five elements:
sites_legacy
An sf object containing legacy sites. This is
NULL
if legacy sites were not included in the sample.
sites_base
An sf object containing the base sites. This is NULL
if n_base
equals the number of legacy sites.
sites_over
An sf object containing the reverse hierarchically
ordered replacement sites. This is NULL
if no reverse hierarchically
ordered replacement sites were included in the sample.
sites_near
An sf object containing the nearest neighbor
replacement sites. This is NULL
if no nearest neighbor replacement
sites were included in the sample.
design
A list documenting the specifications of this sampling design.
This can be checked to verify your sampling design ran as intended.
call
The original function call.
stratum_var
The name of the stratification variable in sframe
.
This equals NULL
if no stratification is used.
stratum
The unique strata. This equals "None"
if
the sampling design is unstratified.
n_base
The base sample size per stratum.
seltype
The selection type per stratum.
caty_var
The name of the unequal probability variable in sframe
.
This equals NULL
if no unequal probability variable is used.
caty_n
The expected sample sizes for each level of the
unequal probability grouping variable per stratum. This equals
NULL
when seltype
is not "unequal"
.
aux_var
The name of the proportional probability (auxiliary) variable in sframe
.
This equals NULL
if no proportional probability variable is used.
legacy
A logical variable indicating whether legacy sites
were included in the sample.
legacy_stratum_var
The name of the stratification variable in legacy_sites
.
Omitted if legacy sites are not used. This equals NULL
if legacy sites were used but
no stratification variable is used.
legacy_caty_var
The name of the unequal probability variable in legacy_sites
.
Omitted if legacy sites are not used. This equals NULL
if legacy sites were used but
no unequal probability variable is used.
legacy_aux_var
The name of the proportional probability (auxiliary)
variable in legacy_sites
.
Omitted if legacy sites are not used. This equals NULL
if legacy sites
were used but no proportional probability variable is used.
mindis
The minimum distance requirement desired. This
is NULL
when no minimum distance requirement was applied.
n_over
The reverse hierarchically ordered replacement
site sample sizes per stratum. If seltype
is unequal
,
this represents the expected sample sizes. This is NULL
when no reverse hierarchically ordered replacement sites were selected.
n_near
The number of nearest neighbor replacement sites
desired. This is NULL
when no nearest neighbor replacement
sites were selected.
When non-NULL
, the sites_legacy
, sites_base
,
sites_over
, and sites_near
objects contain the original columns
in sframe
and include a few additional columns. These additional columns
are
siteID
A site identifier (as named using the DesignID
and SiteBegin
arguments to grts()
).
siteuse
Whether the site is a legacy site (Legacy
), base
site (Base
), reverse hierarchically ordered replacement site
(Over
), or nearest neighbor replacement site (Near
).
replsite
The replacement site ordering. replsite
is
None
if the site is not a replacement site, Next
if it is
the next reverse hierarchically ordered replacement site to use, or
Near_
, where the word following _
indicates the ordering of sites closest to
the originally sampled site.
lon_WGS84
Longitude coordinates using the WGS84 coordinate
system (EPSG:4326). Only given if coordinates are projected.
lat_WGS84
Latitude coordinates using the WGS84 coordinate
system (EPSG:4326). Only given if coordinates are projected.
X
Longitude coordinates using the provided coordinate
system. Only given if coordinates are not projected (i.e., they are geographic or NA).
Y
Latitude coordinates using the provided coordinate
system. Only given if coordinates are not projected (i.e., they are geographic or NA).
stratum
A stratum indicator. stratum
is None
if the sampling design was unstratified. If the sampling design was stratified
,
stratum
indicates the stratum.
wgt
The design weight.
ip
The site's original inclusion probability (the reciprocal)
of (wgt
).
caty
An unequal probability grouping indicator. caty
is None
if the sampling design did not use unequal inclusion probabilities.
If the sampling design did use unequal inclusion probabilities, caty
indicates the unequal probability level.
aux
The auxiliary proportional probability variable. This
column is only returned if seltype
was proportional
in the
original sampling design.
If any columns in sframe
contain these names, those columns
from sframe
will be automatically prefixed with sframe_
in the sites
object. When output is printed, a summary of site counts by
the levels in stratum_var
and caty_var
is shown.
A sampling frame as an sf
object. The coordinate
system for sframe
must projected (not geographic). If m or z values
are in sframe
's geometry, they are silently dropped (i.e., only x-coordinates
and y-coordinates are preserved).
The base sample size required. If the sampling design is unstratified,
this is a single numeric value. If the sampling design is stratified, this is a named
vector or list whose names represent each stratum and whose values represent each
stratum's sample size. These names must match the values of the stratification
variable represented by stratum_var
. Legacy sites are considered part
of the base sample, so the value for n_base
should be equal to the number
of legacy sites plus the number of desired non-legacy sites.
A character string containing the name of the column from
sframe
that identifies stratum membership for each element in sframe
.
If stratum equals NULL
, the sampling design is unstratified and all elements in sframe
are eligible to be selected in the sample. The default is NULL
.
A character string or vector indicating the inclusion probability type,
which must be one of following: "equal"
for equal inclusion probabilities;
"unequal"
for unequal inclusion probabilities according to a categorical
variable specified by caty_var
; and "proportional"
for inclusion
probabilities proportional to a positive auxiliary variable specified by
aux_var
. If the sampling design is unstratified, seltype
is a single
character vector. If the sampling design is stratified, seltype
is a named vector
whose names represent each stratum and whose values represent each stratum's
inclusion probability type. seltype
's default value tries to match the
intended inclusion probability type: If caty_var
and aux_var
are
not specified, seltype
is "equal"
; if caty_var
is specified,
seltype
is "unequal"
; and if aux_var
is specified, seltype
is "proportional"
.
A character string containing the name of the column from
sframe
that represents the unequal probability variable.
A character vector indicating the expected sample size for each
level of caty_var
, the unequal probability variable. If the sampling design
is unstratified, caty_n
is a named vector whose names represent each
level of caty_var
and whose values represent each level's expected
sample size. The sum of caty_n
must equal n_base
. If the sampling design
is stratified and the expected sample sizes are the same among strata, caty_n
is
a named vector whose names represent represent each
level of caty_var
and whose values represent each level's expected
sample size -- these expected sample sizes are applied to all strata. The sum of
caty_n
must equal each stratum's value in n_base
.
If the sampling design is stratified and the expected sample sizes differ among strata,
caty_n
is a list where each element is named as a stratum in n_base
.
Each stratum's list element is a named vector whose
names represent each level of caty_var
and whose values represent each
level's expected sample size (within the stratum). The sum of the values in each stratum's
list element must equal that stratum's value in n_base
.
A character string containing the name of the column from
sframe
that represents the proportional (to size) inclusion probability
variable (auxiliary variable). This auxiliary variable must be positive, and the resulting
inclusion probabilities are proportional to the values of the auxiliary variable.
Larger values of the auxiliary variable result in higher inclusion probabilities.
This argument can be used instead of legacy_sites
when sframe
is a POINT
or MULTIPOINT
geometry (i.e. a finite sampling frame),
When legacy_var
is used, it is a character string containing the name of the column
from sframe
that represents whether each site is a legacy site. For
legacy sites, the values of the legacy_var
must contain character strings that
act as a legacy site identifier. For non-legacy sites, the values of the
legacy_var
column must be NA
. Using this approach,
legacy_stratum_var
, legacy_caty_var
, and legacy_aux_var
are not required and should not be used (because legacy_var
represents a column
in sframe
). spsurvey
assumes that the legacy sites were selected from
a previous sampling design that incorporated randomness into site selection
and that the legacy sites are elements of the current sampling frame.
An sf object with a POINT
or MULTIPOINT
geometry representing the legacy sites. spsurvey assumes that
the legacy sites were selected from a previous sampling design that
incorporated randomness into site selection and that the legacy sites
are elements of the current sampling frame. If sframe
has a
POINT
or MULTIPOINT
geometry, the observations in legacy_sites
should not also be in sframe
(i.e., duplicates are not removed). Thus, sframe
and legacy_sites
together compose the current sampling frame. If m or z values
are in legacy_sites
' geometry, they are silently dropped (i.e., only x-coordinates
and y-coordinates are preserved).
A character string containing the name of the column from
legacy_sites
that identifies stratum membership for each element of legacy_sites
.
This argument is required when the sampling design is stratified and its levels
must be contained in the levels of the stratum_var
variable. The default value of legacy_stratum_var
is stratum_var
, so legacy_stratum_var
need only be specified explicitly when
the name of the stratification variable in legacy_sites
differs from stratum_var
.
A character string containing the name of the column from
legacy_sites
that identifies the unequal probability variable for each element of legacy_sites
.
This argument is required when the sampling design uses unequal selection probabilities and its categories
must be contained in the levels of the caty_var
variable. The default value of legacy_caty_var
is caty_var
, so legacy_caty_var
need only be specified explicitly when
the name of the unequal probability variable in legacy_sites
differs from caty_var
.
A character string containing the name of the column from
legacy_sites
that identifies the proportional probability variable for each element of legacy_sites
.
This argument is required when the sampling design uses proportional selection probabilities and the values of the
legacy_aux_var
variable must be positive. The
default value of legacy_aux_var
is aux_var
, so legacy_aux_var
need only be specified explicitly
when the name of the proportional probability variable in legacy_sites
differs from aux_var
.
A numeric value indicating the desired minimum distance between sampled
sites. If the sampling design is stratified and mindis
is an numeric value, the minimum
distance is applied to all strata. If the sampling design is stratified and different minimum distances
are desired among strata, then mindis
is a list whose names match the names of n_base
and whose and values
are the minimum distance for the corresponding stratum. If a minimum distance is not desired
for a particular stratum, then the corresponding value in mindis
should be 0
or
NULL
(which is equivalent to 0
).
The units of mindis
must represent the units in sframe
. A warning is returned if the
minimum distance could not be reached after maxtry
attempts. If legacy sites are used, the minimum distance
requirement (and subsequent warning if maxtry
attempts are reached) is enforced for all base sites
that are not legacy sites (i.e., the minimum distance is enforced for these sites
by comparing distances against all base sites (legacy and non-legacy)).
The number of maximum attempts to apply the minimum distance algorithm to obtain
the desired minimum distance between sites. Each iteration takes roughly as long as the
standard GRTS algorithm. Successive iterations will always contain at least as many
sites satisfying the minimum distance requirement as the previous iteration. The algorithm stops
when the minimum distance requirement is met or there are maxtry
iterations.
The default number of maximum iterations is 10
.
The number of reverse hierarchically ordered (rho) replacement sites.
If the sampling design is unstratified, then
n_over
is an integer specifying the number of rho replacement sites desired.
If the sampling design is stratified,
then n_over
is a vector (or list) whose names match the names of n_base
and
whose values indicate the number of rho replacement sites for each stratum.
If replacement sites are not desired for a particular stratum, then the corresponding
value in n_over
should be 0
or NULL
(which is equivalent to 0
).
If the sampling design is stratified but the number of n_over
sites is the same in each
stratum, n_over
can be a vector which is used for each stratum.
If n_over
is an unnamed, length-one vector, it's value is recycled
and used for each stratum. Note that if the
sampling design has unequal selection probabilities (seltype = "unequal"
), then n_over
sites
are given the same proportion of caty_n
values as n_base
.
The number of nearest neighbor (nn) replacement sites.
If the sampling design is unstratified, n_near
is integer from 1
to 10
specifying the number of
nn replacement sites to be selected for each base site. If the sampling design
is stratified but the same number of nn replacement sites is desired
for each stratum, n_near
is integer from 1
to 10
specifying the number of
nn replacement sites to be selected for each base site. If the sampling design is
unstratified and a different number of nn replacement sites is
desired for each stratum, n_near
is a vector (or list) whose names represent strata and whose
values is integer from 1
to 10
specifying the number of
nn replacement sites to be selected for each base site in the stratum. If replacement sites
are not desired for a particular stratum, then the corresponding value in n_over
should be 0
or NULL
(which is equivalent to 0
). For
infinite sampling frames, the distance between a site and its nn
depends on pt_density
. The larger pt_density
, the closer the nn neighbors.
The units used to compute the design weights. These
units must be standard units as defined by the set_units()
function in
the units package. The default units match the units of the sf object.
A positive integer controlling the density of the GRTS approximation
for infinite sampling frames. The GRTS approximation for infinite sample
frames vastly improves computational efficiency by generating many finite points and
selecting a sample from the points. pt_density
represents the density
of finite points per unit to use in the approximation. More specifically,
for each stratum, the number of points used in the approximation equals
pt_density * (n_base + n_over)
. A larger value of pt_density
means a closer approximation to the infinite sampling frame but less
computational efficiency. The default value of pt_density
is 10
. Note that
when used with caty_n
, the unequal inclusion probabilities generated from
this approach are also approximations.
A character string indicating the naming structure for each
site's identifier selected in the sample, which is matched with SiteBegin
and
included as a variable in the
sf object in the function's output. Default is "Site".
A character string indicating the first number to use to match
with DesignID
while creating each site's identifier selected in the sample.
Successive sites are given successive integers. The default starting number
is 1
and the number of digits is equal to number of digits in
nbase + nover
.
For example, if nbase
is 50 and nover
is 0, then the default
site identifiers are Site-01
to Site-50
A character string that acts as a separator between
DesignID
and SiteBegin
. The default is "-"
.
A check for whether the coordinates are projected. If TRUE
,
an error is returned if coordinates are not projected (i.e., they are geographic or NA). If FALSE
, the
check is not performed, which means that the crs in sframe
(and legacy_sites
if provided) can be projected, geographic, or NA.
Tony Olsen olsen.tony@epa.gov
n_base
is the number of sites used to calculate
the design weights, which is typically the number of sites used in an analysis. When a panel sampling design is implemented, n_base
is typically the
number of sites in all panels that will be sampled in the same temporal period --
n_base
is not the total number of sites in all panels. The sum of n_base
and
n_over
is equal to the total number of sites to be visited for all panels plus
any replacement sites that may be required.
Stevens Jr., Don L. and Olsen, Anthony R. (2004). Spatially balanced sampling of natural resources. Journal of the American Statistical Association, 99(465), 262-278.
irs
to select a sample that is not spatially balanced
if (FALSE) {
samp <- grts(NE_Lakes, n_base = 100)
print(samp)
strata_n <- c(low = 25, high = 30)
samp_strat <- grts(NE_Lakes, n_base = strata_n, stratum_var = "ELEV_CAT")
print(samp_strat)
samp_over <- grts(NE_Lakes, n_base = 30, n_over = 5)
print(samp_over)
}
Run the code above in your browser using DataLab