posteriorsize
computes the posterior distribution of the
population size based on data collected by Respondent Driven Sampling. The
approach approximates the RDS via the Sequential Sampling model of Gile
(2008). As such, it is referred to as the Sequential Sampling - Population Size Estimate (SS-PSE).
It uses the order of selection of the sample to provide information
on the distribution of network sizes over the population members.
posteriorsize(
s,
s2 = NULL,
previous = NULL,
median.prior.size = NULL,
interval = 10,
burnin = 5000,
maxN = NULL,
K = FALSE,
samplesize = 1000,
quartiles.prior.size = NULL,
mean.prior.size = NULL,
mode.prior.size = NULL,
priorsizedistribution = c("beta", "flat", "nbinom", "pln", "supplied"),
effective.prior.df = 1,
sd.prior.size = NULL,
mode.prior.sample.proportion = NULL,
alpha = NULL,
visibilitydistribution = c("cmp", "nbinom", "pln"),
mean.prior.visibility = NULL,
sd.prior.visibility = NULL,
max.sd.prior.visibility = 4,
df.mean.prior.visibility = 1,
df.sd.prior.visibility = 3,
beta_0.mean.prior = -3,
beta_t.mean.prior = 0,
beta_u.mean.prior = 0,
beta_0.sd.prior = 10,
beta_t.sd.prior = 10,
beta_u.sd.prior = 10,
mem.optimism.prior = NULL,
df.mem.optimism.prior = 5,
mem.scale.prior = 2,
df.mem.scale.prior = 10,
mem.overdispersion = 15,
visibility = TRUE,
type.impute = c("median", "distribution", "mode", "mean"),
Np = 0,
n = NULL,
n2 = NULL,
mu_proposal = 0.1,
nu_proposal = 0.15,
beta_0_proposal = 0.2,
beta_t_proposal = 0.001,
beta_u_proposal = 0.001,
memmu_proposal = 0.1,
memscale_proposal = 0.15,
burnintheta = 500,
burninbeta = 50,
parallel = 1,
parallel.type = "PSOCK",
seed = NULL,
maxbeta = 90,
supplied = list(maxN = maxN),
max.coupons = NULL,
recruit.time = NULL,
recruit.time2 = NULL,
include.tree = TRUE,
unit.scale = FALSE,
optimism = TRUE,
reflect.time = FALSE,
equalize = TRUE,
verbose = FALSE
)
posteriorsize
returns a list consisting of the
following elements:
vector; The final posterior draw for the degrees of the population. The first \(n\) are the sample in sequence and the reminder are non-sequenced.
count; the maximum visibility for an individual. This is usually calculated as twice the maximum observed degree.
count; the sample size.
count; the number of Monte-Carlo samples to draw to compute the posterior. This is the number returned by the Metropolis-Hastings algorithm.The default is 1000.
count; the number of proposals before any MCMC sampling is done. It typically is set to a fairly large number.
count; the number of proposals between sampled statistics.
scalar; The
hyper parameter mean.prior.visibility
being the mean visibility for the prior
distribution for a randomly chosen person. The prior has this mean.
scalar; The hyper parameter sigma
being the
standard deviation of the visibility for a randomly chosen person. The prior has
this standard deviation.
scalar; A hyper parameter being the degrees-of-freedom of the prior for the mean. This gives the equivalent sample size that would contain the same amount of information inherent in the prior.
scalar; A hyper parameter being the degrees-of-freedom of the prior for the standard deviation. This gives the equivalent sample size that would contain the same amount of information inherent in the prior for the standard deviation.
integer; The
overall visibility distribution is a mixture of the 1:Np
rates and a
parametric visibility distribution model truncated below Np. Thus the model fits
the proportions of the population with visibility 1:Np
each with a
separate parameter. This should adjust for an lack-of-fit of the parametric
visibility distribution model at lower visibilities, although it also changes the
model away from the parametric visibility distribution model.
scalar; The standard deviation of the proposal distribution for the mean visibility.
scalar; The standard deviation of the proposal distribution for the CMP scale parameter of the visibility distribution.
vector of length 5; summary statistics for the posterior population size.
maximum aposteriori value of N
mean aposteriori value of N
median aposteriori value of N
the 2.5th percentile of the (posterior) distribution for the N. That is, the lower point on a 95% probability interval.
the 97.5th percentile of the (posterior) distribution for the N. That is, the upper point on a 95% probability interval.
integer; maximum possible population size. By default this is determined from an upper quantile of the prior distribution.
matrix of dimension
samplesize
\(\times\) 10
matrix of summary statistics from
the posterior. This is also an object of class mcmc
so it can be
plotted and summarized via the mcmc.diagnostics
function in the
ergm
package (and also the coda
package). The statistics are:
population size.
scalar; The mean visibility for the prior distribution for a randomly chosen person. The prior has this mean.
scalar; The standard deviation of the visibility for a randomly chosen person. The prior has this standard deviation.
scalar; the number of nodes of visibility 1 in the population (it is assumed all nodes have visibility 1 or more).
scalar; This is
only present for the cmp
model. It is the \(\lambda\) parameter in
the standard parameterization of the Conway-Maxwell-Poisson model for the
visibility distribution.
scalar; This is only present for the
cmp
model. It is the \(\nu\) parameter in the standard
parameterization of the Conway-Maxwell-Poisson model for the visibility
distribution.
matrix of dimension samplesize
\(\times\) n
matrix of
posterior draws from the unit size distribution for those in the survey.
The sample for the i
th person is the i
th column.
vector; the vector of (log) prior
probabilities on each value of \(m=N-n\) - that is, the number of
unobserved members of the population. The values are
n:(length(lpriorm)-1+n)
.
count; the number of proposals in the Metropolis-Hastings sub-step for the visibility distribution parameters (\(\theta\)) before any MCMC sampling is done. It typically is set to a modestly large number.
logical; if this is
TRUE
, the program printed out additional information, including
goodness of fit statistics.
vector; a vector
of length the maximum visibility (K
) (by default
K=2*max(sample
visibility)
). The k
th entry is the posterior predictive number persons
with visibility k
. That is, it is the posterior predictive distribution
of the number of people with each visibility in the population.
vector; a vector of length the maximum visibility
(K
) (by default
K=2*max(sample visibility)
). The k
th entry
is the posterior predictive proportion of persons with visibility k
.
That is, it is the posterior predictive distribution of the proportion of
people with each visibility in the population.
vector of length 6
of MAP estimates corresponding to the output sample
. These are:
population size.
scalar; The mean visibility for the prior distribution for a randomly chosen person. The prior has this mean.
scalar; The standard deviation of the visibility for a randomly chosen person. The prior has this standard deviation.
scalar; the number of nodes of visibility 1 in the population (it is assumed all nodes have visibility 1 or more).
scalar; This is
only present for the cmp
model. It is the \(\lambda\) parameter in
the standard parameterization of the Conway-Maxwell-Poisson model for the
visibility distribution.
scalar; This is only present for the
cmp
model. It is the \(\nu\) parameter in the standard
parameterization of the Conway-Maxwell-Poisson model for the visibility
distribution.
scalar; A hyperparameter being the mode of the prior distribution on the sample proportion \(n/N\).
scalar; A hyperparameter being the mode of the prior distribution on the population size.
scalar; A hyperparameter being the mode of the prior distribution on the population size.
scalar; A hyperparameter being the mean of the prior distribution on the population size.
vector of length 2; A pair of hyperparameters being the lower and upper quartiles of the prior distribution on the population size.
count; the
parametric distribution to use for the individual network sizes (i.e.,
visibilities). The options are cmp
, nbinom
, and pln
. These
correspond to the Conway-Maxwell-Poisson, Negative-Binomial, and
Poisson-log-normal. The default is cmp
.
character; the type of parametric distribution
to use for the prior on population size. The options are beta
(for a
Beta prior on the sample proportion (i.e. \(n/N\)), nbinom
(Negative-Binomial), pln
(Poisson-log-normal), flat
(uniform),
and continuous
(the continuous version of the Beta prior on the
sample proportion. The default is beta
.
either a vector of integers or an rds.data.frame
providing network
size information.
If a rds.data.frame
is passed and visibility=TRUE
, the default, then
the measurement error model is to used, whereby latent visibilities are used in place
of the reported network sizes as the size variable. If a vector of integers is passed these
are the network sizes in sequential order of recording (and the measurement model is not used).
either a vector of integers or an rds.data.frame
providing network
size information for a second RDS sample subsequent to the first RDS recorded in \(s\).
If a rds.data.frame
is passed and visibility=TRUE
, the default, then
the measurement error model is to used, whereby latent visibilities are used in place
of the reported network sizes as the size variable. If a vector of integers is passed these
are the network sizes in sequential order of recording (and the measurement model is not used).
character; optionally, the name of the variable in \(s2\) indicating if the corresponding unit was sampled in the first RDS.
scalar; A hyperparameter being the mode of the prior distribution on the population size.
count; the number of proposals between sampled statistics.
count; the number of proposals before any MCMC sampling is done. It typically is set to a fairly large number.
integer; maximum possible population size. By default this is determined from an upper quantile of the prior distribution.
count; the maximum visibility for an individual. This is usually
calculated as round(stats::quantile(s,0.80))
. It applies to network sizes and (latent) visibilities.
If logical and FALSE then the K is unbounded but set to compute the visibilities.
count; the number of Monte-Carlo samples to draw to compute the posterior. This is the number returned by the Metropolis-Hastings algorithm.The default is 1000.
vector of length 2; A pair of hyperparameters
being the lower and upper quartiles of the prior distribution on the
population size. For example,
quartiles.prior.size=c(1000,4000)
corresponds to a prior where the lower quartile (25%) is 1000 and the upper
(75%) is 4000.
scalar; A hyperparameter being the mean of the prior distribution on the population size.
scalar; A hyperparameter being the mode of the prior distribution on the population size.
character; the type of parametric distribution
to use for the prior on population size. The options are beta
(for a
Beta prior on the sample proportion (i.e. \(n/N\))), flat
(uniform), nbinom
(Negative-Binomial), and pln
(Poisson-log-normal). The default is beta
.
scalar; A hyperparameter being the effective number of samples worth of information represented in the prior distribution on the population size. By default this is 1, but it can be greater (or less!) to allow for different levels of uncertainty.
scalar; A hyperparameter being the standard deviation of the prior distribution on the population size.
scalar; A hyperparameter being the mode of the prior distribution on the sample proportion \(n/N\).
scalar; A hyperparameter being the first parameter of the beta prior model for the sample proportion. By default this is NULL, meaning that 1 is chosen. it can be any value at least 1 to allow for different levels of uncertainty.
count; the parametric distribution to use for the
individual network sizes (i.e., degrees). The options are cmp
,
nbinom
, and pln
. These correspond to the
Conway-Maxwell-Poisson, Negative-Binomial, and Poisson-log-normal. The
default is cmp
.
scalar; A hyper parameter being the mean visibility for the prior distribution for a randomly chosen person. The prior has this mean.
scalar; A hyper parameter being the standard deviation of the visibility for a randomly chosen person. The prior has this standard deviation.
scalar; The maximum allowed value of sd.prior.visibility
.
If the passed or computed value is higher, it is reduced to this value.
This is done for numerical stability reasons.
scalar; A hyper parameter being the degrees-of-freedom of the prior for the mean. This gives the equivalent sample size that would contain the same amount of information inherent in the prior.
scalar; A hyper parameter being the degrees-of-freedom of the prior for the standard deviation. This gives the equivalent sample size that would contain the same amount of information inherent in the prior for the standard deviation.
scalar; A hyper parameter being the mean of the beta_0 parameter distribution in the model for the number of recruits.
scalar; A hyper parameter being the mean of the beta_t parameter distribution in the model for the number of recruits. This corresponds to the time-to-recruit variable.
scalar; A hyper parameter being the mean of the beta_u parameter distribution in the model for the number of recruits. This corresponds to the visibility variable.
scalar; A hyper parameter being the standard deviation of the beta_0 parameter distribution in the model for the number of recruits.
scalar; A hyper parameter being the standard deviation of the beta_t parameter distribution in the model for the number of recruits. This corresponds to the time-to-recruit variable.
scalar; A hyper parameter being the standard deviation of the beta_u parameter distribution in the model for the number of recruits. This corresponds to the visibility variable.
scalar; A hyper parameter being the mean of the distribution of the optimism parameter.
scalar; A hyper parameter being the degrees-of-freedom of the prior for the optimism parameter. This gives the equivalent sample size that would contain the same amount of information inherent in the prior.
scalar; A hyper parameter being the scale of the concentration of baseline negative binomial measurement error model.
scalar; A hyper parameter being the degrees-of-freedom of the prior for the standard deviation of the dispersion parameter in the visibility model. This gives the equivalent sample size that would contain the same amount of information inherent in the prior for the standard deviation.
scalar; A parameter being the overdispersion of the negative binomial distribution that is the baseline for the measurement error model.
logical; Indicate if the measurement error model
is to be used, whereby latent visibilities are used in place of the reported
network sizes as the unit size variable. If TRUE
then a rds.data.frame
need to be passed to provide the RDS information needed for the measurement error model.
The type of imputation to use for the summary visibilities
(returned in the component visibilities
. The imputes are based on the posterior
draws of the visibilities.
It can be of type distribution
, mode
,median
, or mean
with median
the default, being the posterior median of the visibility for that person.
integer; The overall visibility distribution is a mixture of the
Np
rates for 1:Np
and a parametric visibility distribution model
truncated below Np
. Thus the model fits the proportions of the
population with visibility 1:Np
each with a separate parameter. This
should adjust for an lack-of-fit of the parametric visibility distribution model
at lower visibilities, although it also changes the model away from the
parametric visibility distribution model.
integer; the number of people in the sample. This is usually computed from \(s\) automatically and not usually specified by the user.
integer; If \(s2\) is specified, this is the number of people in the second sample. This is usually computed from \(s\) automatically and not usually specified by the user.
scalar; The standard deviation of the proposal distribution for the mean visibility.
scalar; The standard deviation of the proposal distribution for the CMP scale parameter that determines the standard deviation of the visibility.
scalar; The standard deviation of the proposal distribution for the beta_0 parameter of the recruit model.
scalar; The standard deviation of the proposal distribution for the beta_t parameter of the recruit model. This corresponds to the visibility variable.
scalar; The standard deviation of the proposal distribution for the beta_u parameter of the recruit model. This corresponds to the time-to-recruit variable.
scalar; The standard deviation of the proposal distribution for the log of the optimism parameter (that is, gamma).
scalar; The standard deviation of the proposal distribution for the log of the s.d. in the optimism model.
count; the number of proposals in the Metropolis-Hastings sub-step for the visibility distribution parameters (\(\theta\)) before any MCMC sampling is done. It typically is set to a modestly large number.
count; the number of proposals in the Metropolis-Hastings sub-step for the visibility distribution parameters (\(\beta\)) before any MCMC sampling is done. It typically is set to a modestly large number.
count; the number of parallel processes to run for the Monte-Carlo sample. This uses MPI or PSOCK. The default is 1, that is not to use parallel processing.
The type of parallel processing to use. The options are "PSOCK" or "MPI". This requires the corresponding type to be installed. The default is "PSOCK".
integer; random number integer seed. Defaults to NULL
to
use whatever the state of the random number generator is at the time of the
call.
scalar; The maximum allowed value of the beta
parameter.
If the implied or computed value is higher, it is reduced to this value.
This is done for numerical stability reasons.
list; If supplied, is a list with components maxN
and
sample
. In this case supplied
is a matrix with a column named
N
being a sample from a prior distribution for the population size.
The value maxN
specifies the maximum value of the population size, a
priori.
The number of recruitment coupons distributed to each enrolled subject (i.e. the maximum number of recruitees for any subject). By default it is taken by the attribute or data, else the maximum recorded number of coupons.
vector; An optional value for the data/time that the person was interviewed. It needs to resolve as a numeric vector with number of elements the number of rows of the data with non-missing values of the network variable. If it is a character name of a variable in the data then that variable is used. If it is NULL then the sequence number of the recruit in the data is used. If it is NA then the recruitment is not used in the model. Otherwise, the recruitment time is used in the model to better predict the visibility of the person.
vector; An optional value for the data/time that the person in the second RDS survey was interviewed. It needs to resolve as a numeric vector with number of elements the number of rows of the data with non-missing values of the network variable. If it is a character name of a variable in the data then that variable is used. If it is NULL, the default, then the sequence number of the recruit in the data is used. If it is NA then the recruitment is not used in the model. Otherwise, the recruitment time is used in the model to better predict the visibility of the person.
logical; If TRUE
,
augment the reported network size by the number of recruits and one for the recruiter (if any).
This reflects a more accurate value for the visibility, but is not the self-reported degree.
In particular, it typically produces a positive visibility (compared to a
possibility zero self-reported degree).
numeric; If not NULL
it sets the numeric value of the scale parameter
of the distribution of the unit sizes.
For the negative binomial, it is the multiplier on the variance of the negative binomial
compared to a Poisson (via the Poisson-Gamma mixture representation). Sometimes the scale is
unnaturally large (e.g. 40) so this give the option of fixing it (rather than using
the MLE of it). The model is fit with the parameter fixed at this passed value.
logical; If TRUE
then add a term to the model allowing
the (proportional) inflation of the self-reported degrees relative to the unit sizes.
logical; If TRUE
then the recruit.time
is the time before the
end of the study (instead of the time since the survey started or chronological time).
logical; If TRUE
and the capture-recapture model is used, adjusts for gross differences in the
reported network sizes between the two samples.
logical; if this is TRUE
, the program will print out
additional information, including goodness of fit statistics.
The best way to specify the prior is via the
hyperparameter mode.prior.size
which specifies the mode of the prior
distribution on the population size. You can alternatively specify the
hyperparameter median.prior.size
which specifies the median of the
prior distribution on the population size, or mean.prior.sample
proportion
which specifies the mean of the prior distribution on the
proportion of the population size in the sample or mode.prior.sample
proportion
which specifies the mode of the prior distribution on the
proportion of the population size in the sample. Finally, you can specify
quartiles.prior.size
as a vector of length 2 being the pair of lower
and upper quartiles of the prior distribution on the population size.
Gile, Krista J. (2008) Inference from Partially-Observed Network Data, Ph.D. Thesis, Department of Statistics, University of Washington.
Gile, Krista J. and Handcock, Mark S. (2010) Respondent-Driven Sampling: An Assessment of Current Methodology, Sociological Methodology 40, 285-327.
Gile, Krista J. and Handcock, Mark S. (2014) sspse: Estimating Hidden Population Size using Respondent Driven Sampling Data R package, Los Angeles, CA. Version 0.5, https://hpmrg.org/sspse/.
Handcock MS (2003). degreenet: Models for Skewed Count Distributions Relevant to Networks. Statnet Project, Seattle, WA. Version 1.2, https://statnet.org/.
Handcock, Mark S., Gile, Krista J. and Mar, Corinne M. (2014) Estimating Hidden Population Size using Respondent-Driven Sampling Data, Electronic Journal of Statistics, 8, 1, 1491-1521
Handcock, Mark S., Gile, Krista J. and Mar, Corinne M. (2015) Estimating the Size of Populations at High Risk for HIV using Respondent-Driven Sampling Data, Biometrics.
network, statnet, degreenet
data(fauxmadrona)
# Here interval=1 so that it will run faster. It should be higher in a
# real application.
fit <- posteriorsize(fauxmadrona, median.prior.size=1000,
burnin=20, interval=1, samplesize=100)
summary(fit)
Run the code above in your browser using DataLab