binsregselect
implements data-driven procedures for selecting the number of bins for binscatter
estimation. The selected number is optimal in minimizing integrated mean squared error (IMSE).
binsregselect(y, x, w = NULL, data = NULL, deriv = 0, bins = NULL,
pselect = NULL, sselect = NULL, binspos = "qs", nbins = NULL,
binsmethod = "dpi", nbinsrot = NULL, simsgrid = 20, savegrid = F,
vce = "HC1", useeffn = NULL, randcut = NULL, cluster = NULL,
dfcheck = c(20, 30), masspoints = "on", weights = NULL,
subset = NULL, norotnorm = F, numdist = NULL, numclust = NULL)
nbinsrot.poly
ROT number of bins, unregularized.
nbinsrot.regul
ROT number of bins, regularized.
nbinsrot.uknot
ROT number of bins, unique knots.
nbinsdpi
DPI number of bins.
nbinsdpi.uknot
DPI number of bins, unique knots.
prot.poly
ROT degree of polynomials, unregularized.
prot.regul
ROT degree of polynomials, regularized.
prot.uknot
ROT degree of polynomials, unique knots.
pdpi
DPI degree of polynomials.
pdpi.uknot
DPI degree of polynomials, unique knots.
srot.poly
ROT number of smoothness constraints, unregularized.
srot.regul
ROT number of smoothness constraints, regularized.
srot.uknot
ROT number of smoothness constraints, unique knots.
sdpi
DPI number of smoothness constraints.
sdpi.uknot
DPI number of smoothness constraints, unique knots.
imse.var.rot
Variance constant in IMSE expansion, ROT selection.
imse.bsq.rot
Bias constant in IMSE expansion, ROT selection.
imse.var.dpi
Variance constant in IMSE expansion, DPI selection.
imse.bsq.dpi
Bias constant in IMSE expansion, DPI selection.
int.result
Intermediate results, including a matrix of degree and smoothness (deg_mat
),
the selected numbers of bins (vec.nbinsrot.poly
,vec.nbinsrot.regul
,
vec.nbinsrot.uknot
, vec.nbinsdpi
, vec.nbinsdpi.uknot
),
and the bias and variance constants in IMSE (vec.imse.b.rot
,
vec.imse.v.rot
, vec.imse.b.dpi
, vec.imse.v.dpi
)
under each rule (ROT or DPI), corresponding to each pair of degree and smoothness
(each row in deg_mat
).
opt
A list containing options passed to the function, as well as total sample size n
,
number of distinct values Ndist
in x
, and number of clusters Nclust
.
data.grid
A data frame containing grid.
outcome variable. A vector.
independent variable of interest. A vector.
control variables. A matrix, a vector or a formula
.
an optional data frame containing variables used in the model.
derivative order of the regression function for estimation, testing and plotting.
The default is deriv=0
, which corresponds to the function itself.
a vector. bins=c(p,s)
set a piecewise polynomial of degree p
with s
smoothness constraints
for data-driven (IMSE-optimal) selection of the partitioning/binning scheme. By default, the function sets
bins=c(0,0)
, which corresponds to piecewise constant (canonical binscatter).
vector of numbers within which the degree of polynomial p
for point estimation is selected.
Note: To implement the degree or smoothness selection, in addition to pselect
or sselect
,
nbins=#
must be specified.
vector of numbers within which the number of smoothness constraints s
for point estimation is selected.
If not specified, for each value p
supplied in the option pselect
, only the
piecewise polynomial with the maximum smoothness is considered, i.e., s=p
.
position of binning knots. The default is binspos="qs"
, which corresponds to quantile-spaced
binning (canonical binscatter). The other option is binspos="es"
for evenly-spaced binning.
number of bins for degree/smoothness selection. If nbins=T
or nbins=NULL
(default) is specified,
the function selects the number of bins instead, given the specified degree and smoothness.
If a vector with more than one number is specified, the command selects the number of bins within this vector.
method for data-driven selection of the number of bins. The default is binsmethod="dpi"
,
which corresponds to the IMSE-optimal direct plug-in rule. The other option is: "rot"
for rule of thumb implementation.
initial number of bins value used to construct the DPI number of bins selector. If not specified, the data-driven ROT selector is used instead.
number of evaluation points of an evenly-spaced grid within each bin used for evaluation of
the supremum (infimum or Lp metric) operation needed to construct confidence bands and hypothesis testing
procedures. The default is simsgrid=20
, which corresponds to 20 evenly-spaced
evaluation points within each bin for approximating the supremum (infimum or Lp metric) operator.
if true, a data frame produced containing grid.
procedure to compute the variance-covariance matrix estimator. Options are
"const"
homoskedastic variance estimator.
"HC0"
heteroskedasticity-robust plug-in residuals variance estimator
without weights.
"HC1"
heteroskedasticity-robust plug-in residuals variance estimator
with hc1 weights. Default.
"HC2"
heteroskedasticity-robust plug-in residuals variance estimator
with hc2 weights.
"HC3"
heteroskedasticity-robust plug-in residuals variance estimator
with hc3 weights.
effective sample size to be used when computing the (IMSE-optimal) number of bins. This option is useful for extrapolating the optimal number of bins to larger (or smaller) datasets than the one used to compute it.
upper bound on a uniformly distributed variable used to draw a subsample for bins/degree/smoothness selection.
Observations for which runif()<=#
are used. # must be between 0 and 1.
cluster ID. Used for compute cluster-robust standard errors.
adjustments for minimum effective sample size checks, which take into account number of unique
values of x
(i.e., number of mass points), number of clusters, and degrees of freedom of
the different statistical models considered. The default is dfcheck=c(20, 30)
.
See Cattaneo, Crump, Farrell and Feng (2024c) for more details.
how mass points in x
are handled. Available options:
"on"
all mass point and degrees of freedom checks are implemented. Default.
"noadjust"
mass point checks and the corresponding effective sample size adjustments are omitted.
"nolocalcheck"
within-bin mass point and degrees of freedom checks are omitted.
"off"
"noadjust" and "nolocalcheck" are set simultaneously.
"veryfew"
forces the function to proceed as if x
has only a few number of mass points (i.e., distinct values).
In other words, forces the function to proceed as if the mass point and degrees of freedom checks were failed.
an optional vector of weights to be used in the fitting process. Should be NULL
or
a numeric vector. For more details, see lm
.
optional rule specifying a subset of observations to be used.
if true, a uniform density rather than normal density used for ROT selection.
number of distinct values for selection. Used to speed up computation.
number of clusters for selection. Used to speed up computation.
Matias D. Cattaneo, Princeton University, Princeton, NJ. cattaneo@princeton.edu.
Richard K. Crump, Federal Reserve Bank of New York, New York, NY. richard.crump@ny.frb.org.
Max H. Farrell, UC Santa Barbara, Santa Barbara, CA. mhfarrell@gmail.com.
Yingjie Feng (maintainer), Tsinghua University, Beijing, China. fengyingjiepku@gmail.com.
Cattaneo, M. D., R. K. Crump, M. H. Farrell, and Y. Feng. 2024a: On Binscatter. American Economic Review 114(5): 1488-1514.
Cattaneo, M. D., R. K. Crump, M. H. Farrell, and Y. Feng. 2024b: Nonlinear Binscatter Methods. Working Paper.
Cattaneo, M. D., R. K. Crump, M. H. Farrell, and Y. Feng. 2024c: Binscatter Regressions. Working Paper.
binsreg
, binstest
.
x <- runif(500); y <- sin(x)+rnorm(500)
est <- binsregselect(y,x)
summary(est)
Run the code above in your browser using DataLab