binspwc
implements hypothesis testing procedures for pairwise group comparison of binscatter estimators
and plots confidence bands for the difference in binscatter parameters between each pair of groups, following the
results in Cattaneo, Crump, Farrell and Feng (2024a) and
Cattaneo, Crump, Farrell and Feng (2024b).
If the binning scheme is not set by the user, the companion function
binsregselect
is used to implement binscatter in a data-driven way. Binned scatter plots based on different methods
can be constructed using the companion functions binsreg
, binsqreg
or binsglm
.
Hypothesis testing for parametric functional forms of and shape restrictions on the regression function of interest can
be conducted via the companion function binstest
.
binspwc(y, x, w = NULL, data = NULL, estmethod = "reg",
family = gaussian(), quantile = NULL, deriv = 0, at = NULL,
nolink = F, by = NULL, pwc = NULL, testtype = "two-sided",
lp = Inf, bins = NULL, bynbins = NULL, binspos = "qs",
pselect = NULL, sselect = NULL, binsmethod = "dpi", nbinsrot = NULL,
samebinsby = FALSE, randcut = NULL, nsims = 500, simsgrid = 20,
simsseed = NULL, vce = NULL, cluster = NULL, asyvar = F,
dfcheck = c(20, 30), masspoints = "on", weights = NULL,
subset = NULL, numdist = NULL, numclust = NULL, estmethodopt = NULL,
plot = FALSE, dotsngrid = 0, plotxrange = NULL, plotyrange = NULL,
colors = NULL, symbols = NULL, level = 95, ...)
stat
A matrix. Each row corresponds to the comparison between two groups. The first column is the test statistic. The second and third columns give the corresponding group numbers.
The null hypothesis is mu_i(x)<=mu_j(x)
, mu_i(x)=mu_j(x)
, or mu_i(x)>=mu_j(x)
for group i (given in the second column) and group j (given in the third column).
The group number corresponds to the list of group names given by opt$byvals
.
pval
A vector of p-values for all pairwise group comparisons.
bins_plot
A ggplot
object for confidence bands plot.
data.plot
A list containing data for plotting. Each item is a sublist of data frames for comparison between each pair of groups. Each sublist may contain the following data frames:
data.dots
Data for dots. It contains: pair
, the name for the pair of groups; x
, evaluation points; diff.fit
, point estimates of the group difference;
data.cb
Data for confidence bands. It contains: pair
, the name for the pair of groups; x
, evaluation points; cb.fit
, point estimates of the group difference;
cb.se
, standard errors; cb.l
and cb.r
, left and right boundaries of the confidence band.
cval.cb
A vector of critical values for all pairwise group comparisons.
imse.var.rot
Variance constant in IMSE expansion, ROT selection.
imse.bsq.rot
Bias constant in IMSE expansion, ROT selection.
imse.var.dpi
Variance constant in IMSE expansion, DPI selection.
imse.bsq.dpi
Bias constant in IMSE expansion, DPI selection.
opt
A list containing options passed to the function, as well as N.by
(total sample size for each group),
Ndist.by
(number of distinct values in x
for each group), Nclust.by
(number of clusters for each group),
and nbins.by
(number of bins for each group), and byvals
(number of distinct values in by
).
outcome variable. A vector.
independent variable of interest. A vector.
control variables. A matrix, a vector or a formula
.
an optional data frame containing variables used in the model.
estimation method. The default is estmethod="reg"
for tests based on binscatter least squares regression. Other options are "qreg"
for quantile regression and "glm"
for generalized linear regression. If estmethod="glm"
, the option family
must be specified.
a description of the error distribution and link function to be used in the generalized linear model when estmethod="glm"
. (See family
for details of family functions.)
the quantile to be estimated. A number strictly between 0 and 1.
derivative order of the regression function for estimation, testing and plotting.
The default is deriv=0
, which corresponds to the function itself.
value of w
at which the estimated function is evaluated. The default is at="mean"
, which corresponds to
the mean of w
. Other options are: at="median"
for the median of w
, at="zero"
for a vector of zeros.
at
can also be a vector of the same length as the number of columns of w
(if w
is a matrix) or a data frame containing the same variables as specified in w
(when
data
is specified). Note that when at="mean"
or at="median"
, all factor variables (if specified) are excluded from the evaluation (set as zero).
if true, the function within the inverse link function is reported instead of the conditional mean function for the outcome.
a vector containing the group indicator for subgroup analysis; both numeric and string variables
are supported. When by
is specified, binsreg
implements estimation and inference for each subgroup
separately, but produces a common binned scatter plot. By default, the binning structure is selected for each
subgroup separately, but see the option samebinsby
below for imposing a common binning structure across subgroups.
a vector or a logical value. If pwc=c(p,s)
, a piecewise polynomial of degree p
with s
smoothness constraints is used for testing the difference between groups.
If pwc=T
or pwc=NULL
(default) is specified, pwc=c(1,1)
is used unless the degree p
or smoothness s
selection
is requested via the option pselect
or sselect
(see more details in the explanation of pselect
and sselect
).
type of pairwise comparison test. The default is testtype="two-sided"
, which corresponds to a two-sided test of the form H0: mu_1(x)=mu_2(x)
.
Other options are: testtype="left"
for the one-sided test form H0: mu_1(x)<=mu_2(x)
and testtype="right"
for the one-sided test of the form H0: mu_1(x)>=mu_2(x)
.
an Lp metric used for pairwise comparison tests. The default is lp=Inf
, which
corresponds to the sup-norm of the t-statistic. Other options are lp=q
for a positive number q>=1
.
Note that lp=Inf
("sup-norm") has to be used for one-sided tests (testtype="left"
or testtype="right"
).
A vector. If bins=c(p,s)
, it sets the piecewise polynomial of degree p
with s
smoothness constraints
for data-driven (IMSE-optimal) selection of the partitioning/binning scheme. The default is bins=c(0,0)
, which corresponds to the piecewise constant.
a vector of the number of bins for partitioning/binning of x
, which is applied to the binscatter estimation for each group.
If a single number is specified, it is applied to the estimation for all groups.
If bynbins=T
or bynbins=NULL
(default), the number of bins is selected via the companion function binsregselect
in a data-driven way whenever possible.
Note: If a vector with more than one number is supplied, it is understood as the number of bins applied to binscatter estimation
for each subgroup rather than the range for selecting the number of bins.
position of binning knots. The default is binspos="qs"
, which corresponds to quantile-spaced
binning (canonical binscatter). The other options are "es"
for evenly-spaced binning, or
a vector for manual specification of the positions of inner knots (which must be within the range of x
).
vector of numbers within which the degree of polynomial p
for point estimation is selected.
If the selected optimal degree is p
, then piecewise polynomials of degree p+1
are used to
conduct pairwise group comparison. Note: To implement the degree or smoothness selection, in addition to pselect
or sselect
,
bynbins=#
must be specified.
vector of numbers within which the number of smoothness constraints s
for point estimation is selected.
If the selected optimal smoothness is s
, then piecewise polynomials with s+1
smoothness constraints
are used to conduct pairwise group comparison.
If not specified, for each value p
supplied in the option pselect
, only the
piecewise polynomial with the maximum smoothness is considered, i.e., s=p
.
method for data-driven selection of the number of bins. The default is binsmethod="dpi"
,
which corresponds to the IMSE-optimal direct plug-in rule. The other option is: "rot"
for rule of thumb implementation.
initial number of bins value used to construct the DPI number of bins selector. If not specified, the data-driven ROT selector is used instead.
if true, a common partitioning/binning structure across all subgroups specified by the option by
is forced.
The knots positions are selected according to the option binspos
and using the full sample. If nbins
is not specified, then the number of bins is selected via the companion command binsregselect
and
using the full sample.
upper bound on a uniformly distributed variable used to draw a subsample for bins/degree/smoothness selection.
Observations for which runif()<=#
are used. # must be between 0 and 1. By default, max(5000, 0.01n)
observations
are used if the samples size n>5000
.
number of random draws for hypothesis testing. The default is
nsims=500
, which corresponds to 500 draws from a standard Gaussian random vector of size
[(p+1)*J - (J-1)*s]
. Setting at least nsims=2000
is recommended to obtain the final results.
number of evaluation points of an evenly-spaced grid within each bin used for evaluation of
the supremum (infimum or Lp metric) operation needed to construct hypothesis testing
procedures. The default is simsgrid=20
, which corresponds to 20 evenly-spaced
evaluation points within each bin for approximating the supremum (infimum or Lp metric) operator.
Setting at least simsgrid=50
is recommended to obtain the final results.
seed for simulation.
procedure to compute the variance-covariance matrix estimator. For least squares regression and generalized linear regression, the allowed options are the same as that for binsreg
or binsqreg
.
For quantile regression, the allowed options are the same as that for binsqreg
.
cluster ID. Used for compute cluster-robust standard errors.
if true, the standard error of the nonparametric component is computed and the uncertainty related to control
variables is omitted. Default is asyvar=FALSE
, that is, the uncertainty related to control variables is taken into account.
adjustments for minimum effective sample size checks, which take into account number of unique
values of x
(i.e., number of mass points), number of clusters, and degrees of freedom of
the different stat models considered. The default is dfcheck=c(20, 30)
.
See Cattaneo, Crump, Farrell and Feng (2024c) for more details.
how mass points in x
are handled. Available options:
"on"
all mass point and degrees of freedom checks are implemented. Default.
"noadjust"
mass point checks and the corresponding effective sample size adjustments are omitted.
"nolocalcheck"
within-bin mass point and degrees of freedom checks are omitted.
"off"
"noadjust" and "nolocalcheck" are set simultaneously.
"veryfew"
forces the function to proceed as if x
has only a few number of mass points (i.e., distinct values).
In other words, forces the function to proceed as if the mass point and degrees of freedom checks were failed.
an optional vector of weights to be used in the fitting process. Should be NULL
or
a numeric vector. For more details, see lm
.
optional rule specifying a subset of observations to be used.
number of distinct values for selection. Used to speed up computation.
number of clusters for selection. Used to speed up computation.
a list of optional arguments used by rq
(for quantile regression) or glm
(for fitting generalized linear models).
if true, the confidence bands for all pairwise group comparisons (the difference between each pair of groups) are plotted.
The degree and smoothness of polynomials used to construct the bands are the same as those specified for testing. The default is plot=F
, i.e.,
no plot is generated.
number of dots to be added to the plot for confidence bands. Given the choice, these dots are point estimates of the difference between groups
evaluated over an evenly-spaced grid within the common support of all groups. The default is dotsngrid=0
, i.e., no point estimates
are added. Whenever possible, the degree and smoothness of the polynomial for these point estimates are the same as those for selecting the number of bins;
otherwise, the degree and smoothness specified for testing are used.
a vector. plotxrange=c(min, max)
specifies a range of the x-axis for plotting. Observations outside the range are dropped in the plot.
a vector. plotyrange=c(min, max)
specifies a range of the y-axis for plotting. Observations outside the range are dropped in the plot.
an ordered list of colors for plotting the difference between each pair of groups.
an ordered list of symbols for plotting the difference between each pair of groups.
nominal confidence level for confidence band estimation. Default is level=95
.
optional arguments to control bootstrapping if estmethod="qreg"
and vce="boot"
. See boot.rq
.
Matias D. Cattaneo, Princeton University, Princeton, NJ. cattaneo@princeton.edu.
Richard K. Crump, Federal Reserve Bank of New York, New York, NY. richard.crump@ny.frb.org.
Max H. Farrell, UC Santa Barbara, Santa Barbara, CA. mhfarrell@gmail.com.
Yingjie Feng (maintainer), Tsinghua University, Beijing, China. fengyingjiepku@gmail.com.
Cattaneo, M. D., R. K. Crump, M. H. Farrell, and Y. Feng. 2024a: On Binscatter. American Economic Review 114(5): 1488-1514.
Cattaneo, M. D., R. K. Crump, M. H. Farrell, and Y. Feng. 2024b: Nonlinear Binscatter Methods. Working Paper.
Cattaneo, M. D., R. K. Crump, M. H. Farrell, and Y. Feng. 2024c: Binscatter Regressions. Working Paper.
binsreg
, binsqreg
, binsglm
, binsregselect
, binstest
.
x <- runif(500); y <- sin(x)+rnorm(500); t <- 1*(runif(500)>0.5)
## Binned scatterplot
binspwc(y,x, by=t)
Run the code above in your browser using DataLab