np: Nonparametric Kernel Smoothing Methods for Mixed Data Types

Description

This package provides a variety of nonparametric and semiparametric kernel methods that seamlessly handle a mix of continuous, unordered, and ordered factor data types (unordered and ordered factors are often referred to as ‘nominal’ and ‘ordinal’ categorical variables respectively). A vignette containing many of the examples found in the help files accompanying the np package that is intended to serve as a gentle introduction to this package can be accessed via vignette("np", package="np").

For a listing of all routines in the np package type: ‘library(help="np")’.

Bandwidth selection is a key aspect of sound nonparametric and semiparametric kernel estimation. np is designed from the ground up to make bandwidth selection the focus of attention. To this end, one typically begins by creating a ‘bandwidth object’ which embodies all aspects of the method, including specific kernel functions, data names, data types, and the like. One then passes these bandwidth objects to other functions, and those functions can grab the specifics from the bandwidth object thereby removing potential inconsistencies and unnecessary repetition. Furthermore, many functions such as plot (which automatically calls npplot) can work with the bandwidth object directly without having to do the subsequent companion function evaluation.

As of np version 0.20-0, we allow the user to combine these steps. When using np versions 0.20-0 and higher, if the first step (bandwidth selection) is not performed explicitly then the second step will automatically call the omitted first step bandwidth selector using defaults unless otherwise specified, and the bandwidth object could then be retrieved retroactively if so desired via objectname$bws. Furthermore, options for bandwidth selection will be passed directly to the bandwidth selector function. Note that the combined approach would not be a wise choice for certain applications such as when bootstrapping (as it would involve unnecessary computation since the bandwidths would properly be those for the original sample and not the bootstrap resamples) or when conducting quantile regression (as it would involve unnecessary computation when different quantiles are computed from the same conditional cumulative distribution estimate).

There are two ways in which you can interact with functions in np, either i) using data frames, or ii) using a formula interface, where appropriate.

To some, it may be natural to use the data frame interface. The R data.frame function preserves a variable's type once it has been cast (unlike cbind, which we avoid for this reason). If you find this most natural for your project, you first create a data frame casting data according to their type (i.e., one of continuous (default, numeric), factor, ordered). Then you would simply pass this data frame to the appropriate np function, for example npudensbw(dat=data).

To others, however, it may be natural to use the formula interface that is used for the regression examples, among others. For nonparametric regression functions such as npreg, you would proceed as you would using lm (e.g., bw <- npregbw(y~factor(x1)+x2)) except that you would of course not need to specify, e.g., polynomials in variables, interaction terms, or create a number of dummy variables for a factor. Every function in np supports both interfaces, where appropriate.

Note that if your factor is in fact a character string such as, say, X being either "MALE" or "FEMALE", np will handle this directly, i.e., there is no need to map the string values into unique integers such as (0,1). Once the user casts a variable as a particular data type (i.e., factor, ordered, or continuous (default, numeric)), all subsequent methods automatically detect the type and use the appropriate kernel function and method where appropriate.

All estimation methods are fully multivariate, i.e., there are no limitations on the number of variables one can model (or number of observations for that matter). Execution time for most routines is, however, exponentially increasing in the number of observations and increases with the number of variables involved.

Nonparametric methods include unconditional density (distribution), conditional density (distribution), regression, mode, and quantile estimators along with gradients where appropriate, while semiparametric methods include single index, partially linear, and smooth (i.e., varying) coefficient models.

A number of tests are included such as consistent specification tests for parametric regression and quantile regression models along with tests of significance for nonparametric regression.

A variety of bootstrap methods for computing standard errors, nonparametric confidence bounds, and bias-corrected bounds are implemented.

A variety of bandwidth methods are implemented including fixed, nearest-neighbor, and adaptive nearest-neighbor.

A variety of data-driven methods of bandwidth selection are implemented, while the user can specify their own bandwidths should they so choose (either a raw bandwidth or scaling factor).

A flexible plotting utility, npplot (which is automatically invoked by plot) , facilitates graphing of multivariate objects. An example for creating postscript graphs using the npplot utility and pulling this into a LaTeX document is provided.

The function npksum allows users to create or implement their own kernel estimators or tests should they so desire.

The underlying functions are written in C for computational efficiency. Despite this, due to their nature, data-driven bandwidth selection methods involving multivariate numerical search can be time-consuming, particularly for large datasets. A version of this package using the Rmpi wrapper is under development that allows one to deploy this software in a clustered computing environment to facilitate computation involving large datasets.

To cite the np package, type citation("np") from within R for details.

Arguments

Details

The kernel methods in np employ the so-called `generalized product kernels' found in Hall, Racine, and Li (2004), Li, Lin, and Racine (2013), Li, Ouyang, and Racine (2013), Li and Racine (2003), Li and Racine (2004), Li and Racine (2007), Li and Racine (2010), Ouyang, Li, and Racine (2006), and Racine and Li (2004), among others. For details on a particular method, kindly refer to the original references listed above.

We briefly describe the particulars of various univariate kernels used to generate the generalized product kernels that underlie the kernel estimators implemented in the np package. In a nutshell, the generalized kernel functions that underlie the kernel estimators in np are formed by taking the product of univariate kernels such as those listed below. When you cast your data as a particular type (continuous, factor, or ordered factor) in a data frame or formula, the routines will automatically recognize the type of variable being modelled and use the appropriate kernel type for each variable in the resulting estimator.

Second Order Gaussian ($x$ is continuous)

$k(z) = \exp(-z^2/2)/\sqrt{2\pi}$ where $z=(x_i-x)/h$, and $h>0$.

Second Order Truncated Gaussian ($x$ is continuous)

$k(z) = (\exp(-z^2/2)-\exp(-b^2/2))/(\textrm{erf}(b/\sqrt{2})\sqrt{2\pi}-2b\exp(-b^2/2))$ where $z=(x_i-x)/h$, $b>0$, $|z|\le b$ and $h>0$.

See nptgauss for details on modifying $b$.

Second Order Epanechnikov ($x$ is continuous)

$k(z) = 3\left(1 - z^2/5\right)/(4\sqrt{5})$ if $z^2<5$, $0$ otherwise, where $z=(x_i-x)/h$, and $h>0$.

Uniform ($x$ is continuous)

$k(z) = 1/2$ if $|z|<1$, $0$ otherwise, where $z=(x_i-x)/h$, and $h>0$.

Aitchison and Aitken ($x$ is a (discrete) factor)

$l(x_i,x,\lambda) = 1 - \lambda$ if $x_i=x$, and $\lambda/(c-1)$ if $x_i \neq x$, where $c$ is the number of (discrete) outcomes assumed by the factor $x$.

Note that $\lambda$ must lie between $0$ and $(c-1)/c$.

Wang and van Ryzin ($x$ is a (discrete) ordered factor)

$l(x_i,x,\lambda) = 1 - \lambda$ if $|x_i-x|=0$, and $((1-\lambda)/2)\lambda^{|x_i-x|}$ if $|x_i - x|\ge1$.