
Selects the bandwidth parameter with lowest out of sample prediction error for MGMs and mVAR Models.
bwSelect(data, type, level, bwSeq, bwFolds,
bwFoldsize, modeltype, pbar, ...)
The function returns a list with the following entries:
Contains all provided input arguments. If saveData = TRUE
, it also contains the data.
Contains the models estimated at the time points in the tests set. For details see tvmvar
or tvmgm
.
List with number of entries equal to the length of bwSeq
entries. Each entry contains a list with bwFolds
entries. Each of those entries contains a contains a bwFoldsize
times p matrix of out of sample prediction errors.
The same as fullErrorFolds
but pooled over folds.
List with number of entries equal to the length of bwSeq
entries. Each entry contains the average prediction error over variables and time points in the test set.
List with bwFolds
entries, which contain the rows of the test sample for each fold.
List with bwFolds
entries, which contains the observation weights used to fit the model at the bwFoldsize
time points.
A n x p data matrix.
p vector indicating the type of variable for each column in data
. "g" for Gaussian, "p" for Poisson, "c" for categorical.
p vector indicating the number of categories of each variable. For continuous variables set to 1.
A sequence with candidate bandwidth values (0, s] with s < Inf. Note that the bandwidth is applied relative to the unit time interval [0,1] and hence a banwidth of > 2 corresponds roughly to equal weights for all time points and hence gives similar estimates as the stationary model estimated via mvar()
.
The number of folds (see details below).
The size of each fold (see details below).
If modeltype = "mvar"
model, the optimal bandwidth parameter for a tvmvar()
model is selected. If modeltype = "mgm"
model, the optimal bandwidth parameter for a tvmgm()
model is selected. Additional arguments to tvmvar()
or tvmgm()
can be passed via the ...
argument.
If TRUE a progress bar is shown. Defaults to pbar = "TRUE"
.
Arguments passed to tvmgm
or tvmvar
.
Jonas Haslbeck <jonashaslbeck@gmail.com>
Performs a cross-validation scheme that is specified by bwFolds
and bwFoldsize
. In the first fold, the test set is defined by an equally spaced sequence between [1, n - bwFolds
] of length bwFoldsize
. In the second fold, the test set is defined by an equally spaced sequence between [2, n - bwFolds
+ 1] of length bwFoldsize
, etc. . Note that if bwFoldsize
= n / bwFolds
, this procedure is equal to bwFolds
-fold cross valildation. However, full cross validation is computationally very expensive and a single split in test/training set by setting bwFolds = 1
is sufficient in many situations. The procedure selects the bandwidth with the lowest prediction error, averaged over variables and time points in the test set.
bwSelect
computes the absolute error (continuous) or 0/1-loss (categorical) for each time point in the test set defined by bwFoldsize
as described in the previous paragraph for every fold specified in bwFolds
, separately for each variable. The computed errors are returned in different levels of aggregation in the output list (see below). Note that continuous variables are scaled (centered and divided by their standard deviation), hence the absolute error and 0/1-loss are roughly on the scale scale.
Note that selecting the bandwidth with the EBIC is no alternative. This is because the EBIC always selects the intercept model with the lowest bandwidth. The reason is that the unregularized intercept closely models the noise in the data and hence the penalty sets all other parameters to zero. This problem is solved by using out of sample prediction error in the cross validation scheme.
Barber, R. F., & Drton, M. (2015). High-dimensional Ising model selection with Bayesian information criteria. Electronic Journal of Statistics, 9(1), 567-607.
Foygel, R., & Drton, M. (2010). Extended Bayesian information criteria for Gaussian graphical models. In Advances in neural information processing systems (pp. 604-612).
Haslbeck, J. M. B., & Waldorp, L. J. (2020). mgm: Estimating time-varying Mixed Graphical Models in high-dimensional Data. Journal of Statistical Software, 93(8), pp. 1-46. DOI: 10.18637/jss.v093.i08
if (FALSE) {
## A) bwSelect for tvmgm()
# A.1) Generate noise data set
p <- 5
n <- 100
data_n <- matrix(rnorm(p*n), nrow=100)
head(data_n)
type <- c("c", "c", rep("g", 3))
level <- c(2, 2, 1, 1, 1)
x1 <- data_n[,1]
x2 <- data_n[,2]
data_n[x1>0,1] <- 1
data_n[x1<0,1] <- 0
data_n[x2>0,2] <- 1
data_n[x2<0,2] <- 0
head(data_n)
# A.2) Estimate optimal bandwidth parameter
bwobj_mgm <- bwSelect(data = data_n,
type = type,
level = level,
bwSeq = seq(0.05, 1, length=3),
bwFolds = 1,
bwFoldsize = 3,
modeltype = "mgm",
k = 3,
pbar = TRUE,
overparameterize = TRUE)
print.mgm(bwobj_mgm)
## B) bwSelect for tvmVar()
# B.1) Generate noise data set
p <- 5
n <- 100
data_n <- matrix(rnorm(p*n), nrow=100)
head(data_n)
type <- c("c", "c", rep("g", 3))
level <- c(2, 2, 1, 1, 1)
x1 <- data_n[,1]
x2 <- data_n[,2]
data_n[x1>0,1] <- 1
data_n[x1<0,1] <- 0
data_n[x2>0,2] <- 1
data_n[x2<0,2] <- 0
head(data_n)
# B.2) Estimate optimal bandwidth parameter
bwobj_mvar <- bwSelect(data = data_n,
type = type,
level = level,
bwSeq = seq(0.05, 1, length=3),
bwFolds = 1,
bwFoldsize = 3,
modeltype = "mvar",
lags = 1:3,
pbar = TRUE,
overparameterize = TRUE)
print.mgm(bwobj_mvar)
# For more examples see https://github.com/jmbh/mgmDocumentation
}
Run the code above in your browser using DataLab