Randomly generates estimation and validation samples, estimates the model on the first and calculates the likelihood for the second, then repeats.
apollo_outOfSample(
apollo_beta,
apollo_fixed,
apollo_probabilities,
apollo_inputs,
estimate_settings = list(estimationRoutine = "bgw", maxIterations = 200, writeIter =
FALSE, hessianRoutine = "none", printLevel = 3L, silent = TRUE),
outOfSample_settings = list(nRep = 10, validationSize = 0.1, samples = NA, rmse = NULL)
)
A matrix with the average log-likelihood per observation for both the estimation and validation samples, for each repetition. Two additional files with further details are written to the working/output directory.
Named numeric vector. Names and values for parameters.
Character vector. Names (as defined in
apollo_beta
) of parameters whose value should not
change during estimation.
Function. Returns probabilities of the model to be estimated. Must receive three arguments:
apollo_beta
: Named numeric
vector. Names and values of model parameters.
apollo_inputs
: List
containing options of the model. See
apollo_validateInputs.
functionality
: Character.
Can be either "components"
,
"conditionals"
, "estimate"
(default), "gradient"
,
"output"
, "prediction"
,
"preprocess"
, "raw"
,
"report"
, "shares_LL"
,
"validate"
or "zero_LL"
.
List grouping most common inputs. Created by function apollo_validateInputs.
List. Options controlling the estimation process. See apollo_estimate.
List. Contains settings for this function. User input is required for all settings except those with a default or marked as optional.
Numeric scalar. Number of times a different pair of estimation and validation sets are to be extracted from the full database. Default is 30.
Numeric matrix or
data.frame. Optional
argument. Must have as
many rows as observations
in the database
,
and as many columns as
number of repetitions
wanted. Each column
represents a re-sample,
and each element must be
a 0 if the observation
should be assigned to the
estimation sample, or 1
if the observation should
be assigned to the
prediction sample. If this
argument is provided, then
nRep
and
validationSize
are
ignored. Note that this
allows sampling at the
observation rather than
the individual level.
Numeric scalar. Size of the validation sample. Can be a percentage of the sample (0-1) or the number of individuals in the validation sample (>1). Default is 0.1.
Character matrix with two
columns. Used to calculate
Root Mean Squared Error (RMSE)
of prediction. The first
column must contain the names
of observed outcomes in the
database. The second column
must contain the names of the
predicted outcomes as
returned by
apollo_prediction
.
If omitted or NULL, no RMSE
is calculated. This only
works for models with a
single component.
A common way to test for overfitting of a model is to measure its fit on a sample not used during estimation that is, measuring its out-of-sample fit. A simple way to do this is splitting the complete available dataset in two parts: an estimation sample, and a validation sample. The model of interest is estimated using only the estimation sample, and then those estimated parameters are used to measure the fit of the model (e.g. the log-likelihood of the model) on the validation sample. Doing this with only one validation sample, however, may lead to biased results, as a particular validation sample need not be representative of the population. One way to minimise this issue is to randomly draw several pairs of estimation and validation samples from the complete dataset, and apply the procedure to each pair.
The splitting of the database into estimation and validation samples is done
at the individual level, not at the observation level. If the sampling wants
to be done at the individual level (not recommended on panel data), then the
optional outOfSample_settings$samples
argument should be provided.
This function writes two different files to the working/output directory:
modelName_outOfSample_params.csv
: Records the
estimated parameters, final log-likelihood, and number of
observations on each repetition.
modelName_outOfSample_samples.csv
: Records the
sample composition of each repetition.
The first two files are updated throughout the run of this function, while the last one is only written once the function finishes.
When run, this function will look for the two files above in the working/output directory. If they are found, the function will attempt to pick up re-sampling from where those files left off. This is useful in cases where the original bootstrapping was interrupted, or when additional re-sampling wants to be performed.