Learn R Programming

PracTools (version 1.6)

SampStop: Stopping rule for surveys

Description

Compute the probability that continuing data collection will lead to a change in the value of an estimated mean.

Usage

SampStop(lm.obj, formula, n1.data, yvar, n2.data, p = NULL, delta = NULL, seed = NULL)

Value

Matrix with length(p)*length{delta} rows and columns:

Pr(response)

Probability of response by each of the remaining \(n_2\) cases

Exp no. resps

Expected number of respondents among the remaining \(n_2\) cases

, i.e. \(n_2*p\)

y1 mean

Mean of the \(n_1\) respondents

diff in means

Value of the input parameter delta

se of diff

Standard error of the difference delta

z-score

Z-score for computing \(Pr(|e_1 - e_2| < \delta\))

Pr(smaller diff)

\(Pr(|e_1 - e_2| < \delta)\) for the inputs of p and delta

Arguments

lm.obj

object of class lm from a regression predicting \(y\) based on n1.data

formula

righthand side of the formula in lm.obj; it excludes the dependent variable \(y\); no quotes are used.

n1.data

data frame containing units in the part of the sample that has been completed; includes \(y\) and the covariates in formula.

yvar

name or number of column in n1.data containing \(y\).

n2.data

data frame containing units in the part of the sample that is yet to be completed; includes only covariates in formula.

p

Vector of anticipated response probabilities for the n2 sample; 0 < p < 1.

delta

vector of potential differences in the estimated means for the n1 and n2 samples.

seed

random number seed for selecting sample from incomplete cases.

Author

George Zipf, Richard Valliant

Details

SampStop allows an evaluation to be made of whether data collection can be stopped, without substantially affecting the value of an estimated mean, prior to completing collection for all units. Suppose that a sample of size \(n\) is divided between the \(n_1\) units whose collection has been completed and the remaining \(n_2 = n - n_1\) units that are yet to be completed. The function computes \(Pr(|e_1 - e_2| < \delta)\) where \(e_1 - e_2\) is the potential difference (delta) between the estimated mean based on the completed sample and the estimated mean for the full sample if all units were to be completed. For \(e_1\) the mean is estimated after imputing the \(y\)'s for the \(n_2\) incomplete units. The estimated mean \(e_2\) is computed assuming that an additional \(n_2 * p\) units are completed, and the \(y\)'s for the remaining \(n_2 - n_2*p\) incomplete units are imputed. Estimating the variance of \(e_1 - e_2\) involves selecting a sample from n2.data using the random number seed in seed.

The parameter p is the response rate that is anticipated for the \(n_2\) uncompleted units. The usual situation is that there is some uncertainty about p which can be accounted for by inputting a vector of p's. \(\delta\) is a difference in estimates that, if not exceeded, would lead to stopping data collection. For an acceptably small value of delta, if \(Pr(|e_1 - e_2| < \delta)\) is large enough, the decision can be made to stop data collection. The variable \(y\) in yvar is assumed to follow the linear model in lm.obj. A model with independent errors (or a simple random sample) is assumed for calculations.

References

Wagner, J. and Raghunathan, T. (2010). A new stopping rule for surveys. Statistics in Medicine, 29(9), 1014-1024.

Examples

Run this code
library(PracTools)
    # Model with quantitative covariates
data(hospital)
HOSP <- hospital
HOSP$sqrt.x <- sqrt(HOSP$x)
sam   <- sample(nrow(HOSP), 50)
N1       <- HOSP[sam, ]
N2       <- HOSP[-sam, ]
    ## Create lm object using "known" data; no intercept model
lm.obj  <- lm(y ~ 0 + sqrt.x + x, data = N1)
del <- mean(HOSP$y) - mean(HOSP$y) * seq(.6, 1, by=0.05)
SampStop(lm.obj  = lm.obj,
                    formula = ~ 0 + sqrt.x + x,
                    n1.data = N1,
                    yvar    = "y",
                    n2.data = N2,
                    p       = seq(0.2, 0.6, by=0.05),
                    delta   = del,
                    seed = .Random.seed[413]) 
    # Model with factors
data(labor)
sam   <- sample(nrow(labor), 50)
n1.vars <- c("WklyWage", "HoursPerWk", "agecat", "sex")
n2.vars <- c("HoursPerWk", "agecat", "sex")
N1       <- labor[sam, n1.vars]
N2       <- labor[-sam, n2.vars]
lm.obj  <- lm(WklyWage ~ HoursPerWk + as.factor(agecat) + as.factor(sex), data = labor)
del <- mean(N1$WklyWage) - mean(N1$WklyWage) * seq(.75, .95, by=0.05)
result <- SampStop(lm.obj  = lm.obj,
                    formula = ~ HoursPerWk + as.factor(agecat) + as.factor(sex),
                    n1.data = N1,
                    yvar    = "WklyWage",
                    n2.data = N2,
                    p       = seq(0.2, 0.4, by=0.05),
                    delta   = del,
                    seed = .Random.seed[78]) 

p.nresp <- paste(result[,1], result[,2], sep=", ")
library(ggplot2)
ggplot2::ggplot(result, aes(result[,4], result[,7], colour = factor(p.nresp) )) +
  geom_point() +
  geom_line(linewidth=1.1) +
  labs(x = "delta", y = "Pr(|e1-e2|<= delta)", colour = "Pr(resp), n.resp")                      

Run the code above in your browser using DataLab