A tidy reimplementation of the functions implemented in mgcv::gamSim()
that can be used to fit GAMs. An new feature is that the sampling
distribution can be applied to all the example types.
data_sim(
model = "eg1",
n = 400,
scale = NULL,
theta = 3,
power = 1.5,
dist = c("normal", "poisson", "binary", "negbin", "tweedie", "gamma", "ocat",
"ordered categorical"),
n_cat = 4,
cuts = c(-1, 0, 5),
seed = NULL,
gfam_families = c("binary", "tweedie", "normal")
)
character; either "egX"
where X
is an integer 1:7
, or
the name of a model. See Details for possible options.
numeric; the number of observations to simulate.
numeric; the level of noise to use.
numeric; the dispersion parameter \(\theta\) to use. The default is entirely arbitrary, chosen only to provide simulated data that exhibits extra dispersion beyond that assumed by under a Poisson.
numeric; the Tweedie power parameter.
character; a sampling distribution for the response
variable. "ordered categorical"
is a synonym of "ocat"
.
integer; the number of categories for categorical response.
Currently only used for distr %in% c("ocat", "ordered categorical")
.
numeric; vector of cut points on the latent variable, excluding
the end points -Inf
and Inf
. Must be one fewer than the number of
categories: length(cuts) == n_cat - 1
.
numeric; the seed for the random number generator. Passed to
base::set.seed()
.
character; a vector of distributions to use in
generating data with grouped families for use with family = gfam()
. The
allowed distributions as as per dist
.
data_sim()
can simulate data from several underlying models of
known true functions. The available options currently are:
"eg1"
: a four term additive true model. This is the classic Gu & Wahba
four univariate term test model. See gw_functions
for more details of
the underlying four functions.
"eg2"
: a bivariate smooth true model.
"eg3"
: an example containing a continuous by smooth (varying
coefficient) true model. The model is \(\hat{y}_i = f_2(x_{1i})x_{2i}\) where the function \(f_2()\) is \(f_2(x) = 0.2 * x^{11} *
(10 * (1 - x))^6 + 10 * (10 * x)^3 * (1 - x)^{10}\).
"eg4"
: a factor by smooth true model. The true model contains a factor
with 3 levels, where the response for the nth level follows the nth
Gu & Wabha function (for \(n \in {1, 2, 3}\)).
"eg5"
: an additive plus factor true model. The response is a linear
combination of the Gu & Wabha functions 2, 3, 4 (the latter is a null
function) plus a factor term with four levels.
"eg6"
: an additive plus random effect term true model.
´"eg7": a version of the model in
"eg1"`, but where the covariates are
correlated.
"gwf2"
: a model where the response is Gu & Wabha's
\(f_2(x_i)\) plus noise.
"lwf6"
: a model where the response is Luo & Wabha's "example 6"
function \(sin(2(4x-2)) + 2 exp(-256(x-0.5)^2)\) plus noise.
"gfam"
: simulates data for use with GAMs with
family = gfam(families)
. See example in mgcv::gfam()
. If this model
is specified then dist
is ignored and gfam_families
is used to
specify which distributions are included in the simulated data. Can be a
vector of any of the families allowed by dist
. For
"ocat" %in% gfam_families
(or "ordered categorical"
), 4 classes are
assumed, which can't be changed. Link functions used are "identity"
for "normal"
, "logit"
for "binary"
, "ocat"
, and
"ordered categorical"
, and "exp"
elsewhere.
The random component providing noise or sampling variation can follow one
of the distributions, specified via argument dist
"normal"
: Gaussian,
"poisson"
: Poisson,
"binary"
: Bernoulli,
"negbin"
: Negative binomial,
"tweedie"
: Tweedie,
"gamma"
: gamma , and
"ordered categorical"
: ordered categorical
Other arguments provide the parameters for the distribution.
Gu, C., Wahba, G., (1993). Smoothing Spline ANOVA with Component-Wise Bayesian "Confidence Intervals." J. Comput. Graph. Stat. 2, 97–117.
Luo, Z., Wahba, G., (1997). Hybrid adaptive splines. J. Am. Stat. Assoc. 92, 107–116.
# \dontshow{
op <- options(pillar.sigfig = 5, cli.unicode = FALSE)
# }
data_sim("eg1", n = 100, seed = 1)
# an ordered categorical response
data_sim("eg1", n = 100, dist = "ocat", n_cat = 4, cuts = c(-1, 0, 5))
# \dontshow{
options(op)
# }
Run the code above in your browser using DataLab