The get_predicted()
function is a robust, flexible and user-friendly
alternative to base R predict()
function. Additional features and
advantages include availability of uncertainty intervals (CI), bootstrapping,
a more intuitive API and the support of more models than base R's predict()
function. However, although the interface are simplified, it is still very
important to read the documentation of the arguments. This is because making
"predictions" (a lose term for a variety of things) is a non-trivial process,
with lots of caveats and complications. Read the 'Details' section for more
information.
get_predicted_ci()
returns the confidence (or prediction) interval (CI)
associated with predictions made by a model. This function can be called
separately on a vector of predicted values. get_predicted()
usually
returns confidence intervals (included as attribute, and accessible via the
as.data.frame()
method) by default.
get_predicted(x, ...)# S3 method for default
get_predicted(
x,
data = NULL,
predict = "expectation",
ci = NULL,
ci_type = "confidence",
ci_method = NULL,
dispersion_method = "sd",
vcov = NULL,
vcov_args = NULL,
verbose = TRUE,
...
)
# S3 method for lm
get_predicted(
x,
data = NULL,
predict = "expectation",
ci = NULL,
iterations = NULL,
verbose = TRUE,
...
)
# S3 method for stanreg
get_predicted(
x,
data = NULL,
predict = "expectation",
iterations = NULL,
ci = NULL,
ci_method = NULL,
include_random = "default",
include_smooth = TRUE,
verbose = TRUE,
...
)
# S3 method for gam
get_predicted(
x,
data = NULL,
predict = "expectation",
ci = NULL,
include_random = TRUE,
include_smooth = TRUE,
iterations = NULL,
verbose = TRUE,
...
)
# S3 method for lmerMod
get_predicted(
x,
data = NULL,
predict = "expectation",
ci = NULL,
ci_method = NULL,
include_random = "default",
iterations = NULL,
verbose = TRUE,
...
)
# S3 method for principal
get_predicted(x, data = NULL, ...)
The fitted values (i.e. predictions for the response). For Bayesian
or bootstrapped models (when iterations != NULL
), iterations (as
columns and observations are rows) can be accessed via as.data.frame()
.
A statistical model (can also be a data.frame, in which case the second argument has to be a model).
Other argument to be passed, for instance to get_predicted_ci()
.
An optional data frame in which to look for variables with which
to predict. If omitted, the data used to fit the model is used. Visualization
matrices can be generated using get_datagrid()
.
string or NULL
"link"
returns predictions on the model's link-scale (for logistic models,
that means the log-odds scale) with a confidence interval (CI).
"expectation"
(default) also returns confidence intervals, but this time
the output is on the response scale (for logistic models, that means
probabilities).
"prediction"
also gives an output on the response scale, but this time
associated with a prediction interval (PI), which is larger than a confidence
interval (though it mostly make sense for linear models).
"classification"
only differs from "prediction"
for binomial models
where it additionally transforms the predictions into the original response's
type (for instance, to a factor).
Other strings are passed directly to the type
argument of the predict()
method supplied by the modelling package.
When predict = NULL
, alternative arguments such as type
will be captured
by the ...
ellipsis and passed directly to the predict()
method supplied
by the modelling package. Note that this might result in conflicts with
multiple matching type
arguments - thus, the recommendation is to use the
predict
argument for those values.
Notes: You can see the 4 options for predictions as on a gradient from
"close to the model" to "close to the response data": "link", "expectation",
"prediction", "classification". The predict
argument modulates two things:
the scale of the output and the type of certainty interval. Read more about
in the Details section below.
The interval level. Default is NULL
, to be fast even for larger
models. Set the interval level to an explicit value, e.g. 0.95
, for 95%
CI).
Can be "prediction"
or "confidence"
. Prediction
intervals show the range that likely contains the value of a new
observation (in what range it would fall), whereas confidence intervals
reflect the uncertainty around the estimated parameters (and gives the
range of the link; for instance of the regression line in a linear
regressions). Prediction intervals account for both the uncertainty in the
model's parameters, plus the random variation of the individual values.
Thus, prediction intervals are always wider than confidence intervals.
Moreover, prediction intervals will not necessarily become narrower as the
sample size increases (as they do not reflect only the quality of the fit).
This applies mostly for "simple" linear models (like lm
), as for
other models (e.g., glm
), prediction intervals are somewhat useless
(for instance, for a binomial model for which the dependent variable is a
vector of 1s and 0s, the prediction interval is... [0, 1]
).
The method for computing p values and confidence intervals. Possible values depend on model type.
NULL
uses the default method, which varies based on the model type.
Most frequentist models: "wald"
(default), "residual"
or "normal"
.
Bayesian models: "quantile"
(default), "hdi"
, "eti"
, and "spi"
.
Mixed effects lme4 models: "wald"
(default), "residual"
, "normal"
,
"satterthwaite"
, and "kenward-roger"
.
See get_df()
for details.
Bootstrap dispersion and Bayesian posterior summary:
"sd"
or "mad"
.
Variance-covariance matrix used to compute uncertainty estimates (e.g., for robust standard errors). This argument accepts a covariance matrix, a function which returns a covariance matrix, or a string which identifies the function to be used to compute the covariance matrix.
A covariance matrix
A function which returns a covariance matrix (e.g., stats::vcov()
)
A string which indicates the kind of uncertainty estimates to return.
Heteroskedasticity-consistent: "vcovHC"
, "HC"
, "HC0"
, "HC1"
,
"HC2"
, "HC3"
, "HC4"
, "HC4m"
, "HC5"
. See ?sandwich::vcovHC
Cluster-robust: "vcovCR"
, "CR0"
, "CR1"
, "CR1p"
, "CR1S"
,
"CR2"
, "CR3"
. See ?clubSandwich::vcovCR()
Bootstrap: "vcovBS"
, "xy"
, "residual"
, "wild"
, "mammen"
,
"webb"
. See ?sandwich::vcovBS
Other sandwich
package functions: "vcovHAC"
, "vcovPC"
, "vcovCL"
,
"vcovPL"
.
List of arguments to be passed to the function identified by
the vcov
argument. This function is typically supplied by the sandwich
or clubSandwich packages. Please refer to their documentation (e.g.,
?sandwich::vcovHAC
) to see the list of available arguments. If no estimation
type (argument type
) is given, the default type for "HC"
(or "vcovHC"
)
equals the default from the sandwich package; for type "CR"
(or
"vcoCR"
), the default is set to "CR3"
.
Toggle warnings.
For Bayesian models, this corresponds to the number of
posterior draws. If NULL
, will return all the draws (one for each
iteration of the model). For frequentist models, if not NULL
, will
generate bootstrapped draws, from which bootstrapped CIs will be computed.
Iterations can be accessed by running as.data.frame(..., keep_iterations = TRUE)
on the output.
If "default"
, include all random effects in the
prediction, unless random effect variables are not in the data. If TRUE
,
include all random effects in the prediction (in this case, it will be
checked if actually all random effect variables are in data
). If FALSE
,
don't take them into account. Can also be a formula to specify which random
effects to condition on when predicting (passed to the re.form
argument).
If include_random = TRUE
and data
is provided, make sure to include
the random effect variables in data
as well.
For General Additive Models (GAMs). If FALSE
,
will fix the value of the smooth to its average, so that the predictions
are not depending on it. (default), mean()
, or
bayestestR::map_estimate()
.
In insight::get_predicted()
, the predict
argument jointly
modulates two separate concepts, the scale and the uncertainty interval.
Linear models - lm()
: For linear models, Prediction
intervals (predict="prediction"
) show the range that likely
contains the value of a new observation (in what range it is likely to
fall), whereas confidence intervals (predict="expectation"
or
predict="link"
) reflect the uncertainty around the estimated
parameters (and gives the range of uncertainty of the regression line). In
general, Prediction Intervals (PIs) account for both the uncertainty in the
model's parameters, plus the random variation of the individual values.
Thus, prediction intervals are always wider than confidence intervals.
Moreover, prediction intervals will not necessarily become narrower as the
sample size increases (as they do not reflect only the quality of the fit,
but also the variability within the data).
Generalized Linear models - glm()
: For binomial models,
prediction intervals are somewhat useless (for instance, for a binomial
(Bernoulli) model for which the dependent variable is a vector of 1s and
0s, the prediction interval is... [0, 1]
).
When users set the predict
argument to "expectation"
, the predictions
are returned on the response scale, which is arguably the most convenient
way to understand and visualize relationships of interest. When users set
the predict
argument to "link"
, predictions are returned on the link
scale, and no transformation is applied. For instance, for a logistic
regression model, the response scale corresponds to the predicted
probabilities, whereas the link-scale makes predictions of log-odds
(probabilities on the logit scale). Note that when users select
predict="classification"
in binomial models, the get_predicted()
function will first calculate predictions as if the user had selected
predict="expectation"
. Then, it will round the responses in order to
return the most likely outcome.
The arguments vcov
and vcov_args
can be used to calculate robust
standard errors for confidence intervals of predictions. These arguments,
when provided in get_predicted()
, are passed down to get_predicted_ci()
,
thus, see the related documentation there for more
details.
For predictions based on multiple iterations, for instance in the case of Bayesian
models and bootstrapped predictions, the function used to compute the centrality
(point-estimate predictions) can be modified via the centrality_function
argument. For instance, get_predicted(model, centrality_function = stats::median)
.
The default is mean
. Individual draws can be accessed by running
iter <- as.data.frame(get_predicted(model))
, and their iterations can be
reshaped into a long format by bayestestR::reshape_iterations(iter)
.
get_datagrid()
data(mtcars)
x <- lm(mpg ~ cyl + hp, data = mtcars)
predictions <- get_predicted(x, ci = .95)
predictions
# Options and methods ---------------------
get_predicted(x, predict = "prediction")
# Get CI
as.data.frame(predictions)
if (require("boot")) {
# Bootstrapped
as.data.frame(get_predicted(x, iterations = 4))
# Same as as.data.frame(..., keep_iterations = FALSE)
summary(get_predicted(x, iterations = 4))
}
# Different prediction types ------------------------
data(iris)
data <- droplevels(iris[1:100, ])
# Fit a logistic model
x <- glm(Species ~ Sepal.Length, data = data, family = "binomial")
# Expectation (default): response scale + CI
pred <- get_predicted(x, predict = "expectation", ci = .95)
head(as.data.frame(pred))
# Prediction: response scale + PI
pred <- get_predicted(x, predict = "prediction", ci = .95)
head(as.data.frame(pred))
# Link: link scale + CI
pred <- get_predicted(x, predict = "link", ci = .95)
head(as.data.frame(pred))
# Classification: classification "type" + PI
pred <- get_predicted(x, predict = "classification", ci = .95)
head(as.data.frame(pred))
Run the code above in your browser using DataLab