Compares the average levels of a variable between two groups that potentially share members.
gap(variable, data, groupA = "default", groupB = "default",
percentiles = NULL, achievementLevel = NULL,
achievementDiscrete = FALSE, targetLevel = NULL, weightVar = NULL,
jrrIMax = 1, varMethod = c("jackknife", "Taylor"),
omittedLevels = TRUE, defaultConditions = TRUE, recode = NULL,
referenceDataIndex = 1, returnVarEstInputs = FALSE,
returnSimpleDoF = FALSE, returnSimpleN = FALSE,
returnNumberOfPSU = FALSE)
a character indicating the variable to be compared, potentially with a subject scale or subscale
an edsurvey.data.frame
, a light.edsurvey.data.frame
, or an edsurvey.data.frame.list
an expression or character expression that defines a condition for subset.
This subset will be compared to groupB
. If not specified, it will define
a whole sample as in data
.
an expression or character expression that defines a condition for subset.
This subset will be compared to groupA
. If not specified, it will define
a whole sample as in data
. If set to NULL
, estimates for the second group
will be dropped.
a numeric vector. The gap
function calculates the
mean when this
argument is omitted or set to NULL
. Otherwise,
the gap at the percentile given is calculated.
the achievement level(s) at which percentages should be calculated
a logical indicating if the achievement level
specified in the achievementLevel
argument should be interpreted as discrete
so that
just the percentage in that particular achievement
level
will be included. Defaults to FALSE
so that
the percentage at or above that achievement level
will be
included in the percentage.
a character string. When specified, calculates the gap in
the percentage of students at
targetLevel
in variable
. This is useful for
comparing the gap in the percentage of students at a
survey response level.
a character indicating the weight variable to use. See Details.
a numeric value; when using the jackknife variance estimation
method, the \(V_{jrr}\) term
(see Details) can be estimated with any positive number
of plausible values and is
estimated on the lower of the number of
available plausible values and
jrrIMax
. When jrrIMax
is set to Inf
,
all plausible values will
be used. Higher values of jrrIMax
lead to longer
computing times and more
accurate variance estimates.
a character set to jackknife
or Taylor
that indicates the variance estimation method
to be used
a logical value. When set to the default value of
TRUE
, drops those levels of
all factor variables.
Use print
on an edsurvey.data.frame
to see the omitted levels.
a logical value. When set to the default value
of TRUE
, uses the default
conditions stored in edsurvey.data.frame
to subset the data.
Use print
on an edsurvey.data.frame
to see the default conditions.
a list of lists to recode variables. Defaults to NULL
.
Can be set as
recode
=
list(var1
=
list(from
=
c("a",
"b",
"c"),
to
=
"d"))
.
See Examples.
a numeric used only when data
is an
edsurvey.data.frame.list
,
indicating which dataset is the reference
dataset that other datasets are compared with.
Defaults to one.
a logical value; set to TRUE
to return the
inputs to the jackknife and imputation variance
estimates. This is intended to allow for the
computation
of covariances between estimates.
a logical value set to TRUE
to return the degrees
of freedom for some statistics (see Value
section) that do not have a
t-test; useful primarily for further computation
a logical value set to TRUE
to add the count
(n-size) of observations included in groups A and B
in the percentage object
a logical value set to TRUE
to return the number of
primary sampling units (PSU) used in calculation.
The return type depends on if the class of the data
argument is an
edsurvey.data.frame
or an edsurvey.data.frame.list
. Both
include the call (called call
), a list called labels
,
an object named percentage
that shows the percentage in groupA
and groupB
, and an object
that shows the gap called results
.
The labels includes the following elements:
the definitions of the groups
the n-size for the full dataset (before applying the definition)
the n-size for the data after the group is subsetted and other restrictions (such as omitted values) are applied
the number of PSUs used in calculation--only returned when
returnNumberOfPSU
=
TRUE
The percentages are computed according to the vignette titled Statistics in the section “Estimation of Weighted Percentages When Plausible Values Are Not Present.” The standard errors are calculated according to “Estimation of the Standard Error of Weighted Percentages When Plausible Values Are Not Present, Using the Jackknife Method.” Standard errors of differences are calculated as the square root of the typical variance formula $$Var(A-B) = Var(A) + Var(B) - 2 Cov(A,B)$$ where the covariance term is calculated as described in the vignette titled Statistics in the section “Estimation of Covariances.” These degrees of freedom are available only with the jackknife variance estimation. The degrees of freedom used for hypothesis testing are always set to the number of jackknife replicates in the data.
When the data
argument is an edsurvey.data.frame
,
gap
returns an S3 object of class gap
.
The percentage
object is a numeric vector with the following elements:
the percentage of respondents in groupA
compared with the whole sample in data
the standard error on the percentage of respondents in
groupA
degrees of freedom appropriate for a t-test involving pctA
.
This value is returned only if
returnSimpleDoF
=
TRUE
.
the percentage of respondents in groupB
.
the standard error on the percentage of respondents in
groupB
degrees of freedom appropriate for a t-test involving pctA
.
This value is returned only if
returnSimpleDoF
=
TRUE
.
the value of pctA
minus pctB
the covariance of pctA
and pctB
; used in
calculating diffABse
.
the standard error of pctA
minus pctB
the p-value associated with the t-test used
for the hypothesis test that diffAB
is zero.
degrees of freedom used in calculating
diffABpValue
The results
object is a numeric data frame with the following elements:
the mean estimate of groupA
(or the percentage estimate
if achievementLevel
or targetLevel
is specified)
the standard error of estimateA
degrees of freedom appropriate for a t-test involving meanA
.
This value is returned only if
returnSimpleDoF
=
TRUE
.
the mean estimate of groupB
(or the percentage estimate
if achievementLevel
or targetLevel
is specified)
the standard error of estimateB
degrees of freedom appropriate for a t-test involving meanB
.
This value is returned only if
returnSimpleDoF
=
TRUE
.
the value of estimateA
minus estimateB
the covariance of estimateA
and estimateB
. Used in
calculating diffABse
.
the standard error of diffAB
the p-value associated with the t-test used
for the hypothesis test that diffAB
is zero.
degrees of freedom used for the t-test on diffAB
percentiles
or achievementLevel
is included
in the results
object. When results
has a single row and when returnVarEstInputs
is TRUE
, the additional elements varEstInputs
and
pctVarEstInputs
also are returned. These can be used for calculating
covariances with varEstToCov
.
When the data
argument is an edsurvey.data.frame.list
,
gap
returns an S3 object of class gapList
.
The results
object in the edsurveyResultList
is
a data.frame
. Each row regards a particular dataset from the
edsurvey.data.frame
, and a reference dataset is dictated by
the referenceDataIndex
argument.
The percentage
object is a data.frame
with the following elements:
a data frame with a column for each column in the covs
. See previous
section for more details.
all elements in the percentage
object in the
previous section
the difference in pctA
between the reference data
and this dataset. Set to NA
for the
reference dataset.
the covariance of pctA
in the reference data and
pctA
on this row. Used in
calculating diffAAse
.
the standard error for diffAA
.
the p-value associated with the t-test used
for the hypothesis test that diffAA
is zero
the difference in pctB
between the reference data
and this dataset. Set to NA
for the
reference dataset.
the covariance of pctB
in the reference data and
pctB
on this row. Used in
calculating diffAAse
.
the standard error for diffBB
the p-value associated with the t-test used
for the hypothesis test that diffBB
is zero
the value of diffAB
in the reference dataset
minus the value of diffAB
in this dataset. Set
to NA
for the reference dataset.
the covariance of diffAB
in the reference data and
diffAB
on this row. Used in
calculating diffABABse
.
the standard error for diffABAB
the p-value associated with the t-test used
for the hypothesis test that diffABAB
is zero
The results
object is a data.frame
with the following elements:
all elements in the results
object in the
previous section
the value of groupA
in the reference dataset minus
the value in this dataset. Set to NA
for the
reference dataset.
the covariance of meanA
in the reference data and
meanA
on this row. Used in
calculating diffAAse
.
the standard error for diffAA
.
the p-value associated with the t-test used
for the hypothesis test that diffAA
is zero
the value of groupB
in the reference dataset minus
the value in this dataset. Set to NA
for the
reference dataset.
the covariance of meanB
in the reference data and
meanB
on this row. Used in
calculating diffBBse
.
the standard error for diffBB
the p-value associated with the t-test used
for the hypothesis test that diffBB
is zero
the value of diffAB
in the reference dataset
minus the value of diffAB
in this dataset. Set
to NA
for the reference dataset.
the covariance of diffAB
in the reference data and
diffAB
on this row. Used in
calculating diffABABse
.
the standard error for diffABAB
the p-value associated with the t-test used
for the hypothesis test that diffABAB
is zero
a logical value indicating if this line uses the same
survey as the reference line. Set to NA
for the
reference line.
This function calculates the gap between groupA
and groupB
(which
may be omitted to indicate the full sample). The gap is
calculated for one of four statistics:
The mean score gap (in the score
variable) identified in the variable
argument.
This is the default. The means and their standard errors are
calculated using the methods
described in the lm.sdf
function documentation.
The gap between respondents at
the percentiles specified in the percentiles
argument.
This is returned when the percentiles
argument is
defined. The mean and standard error are computed as described in the
percentile
function documentation.
The gap in the percentage of
students at (when achievementDiscrete
is TRUE
) or at
or above (when achievementDiscrete
is FALSE
) a
particular achievement level. This is used when the
achievementLevel
argument is defined. The mean and standard error
are calculated as described in the achievementLevels
function documentation.
The gap in the percentage of
respondents responding at targetLevel
to
variable
. This is used when targetLevel
is
defined. The mean and standard deviation are calculated as described in
the edsurveyTable
function documentation.
# NOT RUN {
# read in the example data (generated, not real student data)
sdf <- readNAEP(system.file("extdata/data", "M36NT2PM.dat", package = "NAEPprimer"))
# find the mean score gap in the primer data between males and females
gap("composite", sdf, dsex=="Male", dsex=="Female")
# find the score gap of the quartiles in the primer data between males and females
gap("composite", sdf, dsex=="Male", dsex=="Female", percentile=50)
gap("composite", sdf, dsex=="Male", dsex=="Female", percentile=c(25, 50, 75))
# find the percent proficient (or higher) gap in the primer data between males and females
gap("composite", sdf, dsex=="Male", dsex=="Female",
achievementLevel=c("Basic", "Proficient", "Advanced"))
# find the discrete achievement level gap--this is harder to interpret
gap("composite", sdf, dsex=="Male", dsex=="Female",
achievementLevel="Proficient", achievementDiscrete=TRUE)
# find the percent talk about studies at home (b017451) never or hardly
# ever gap in the primer data between males and females
gap("b017451", sdf, dsex=="Male", dsex=="Female",
targetLevel="Never or hardly ever")
# example showing how to compare multiple levels
gap("b017451",sdf, dsex=="Male", dsex=="Female", targetLevel="Infrequently",
recode=list(b017451=list(from=c("Never or hardly ever",
"Once every few weeks",
"About once a week"),
to=c("Infrequently"))))
# make subsets of sdf by scrpsu, "Scrambled PSU and school code"
sdfA <- subset(sdf, scrpsu %in% c(5,45,56))
sdfB <- subset(sdf, scrpsu %in% c(75,76,78))
sdfC <- subset(sdf, scrpsu %in% 100:200)
sdfD <- subset(sdf, scrpsu %in% 201:300)
sdfl <- edsurvey.data.frame.list(list(sdfA, sdfB, sdfC, sdfD),
labels=c("A locations", "B locations",
"C locations", "D locations"))
gap("composite", sdfl, dsex=="Male", dsex=="Female", percentile=c(50))
# }
Run the code above in your browser using DataLab