Preprocess the original data for sieve-SGD estimation.
sieve.sgd.preprocess(
X,
s = c(2),
r0 = c(2),
J = c(1),
type = c("cosine"),
interaction_order = c(3),
omega = c(0.51),
norm_feature = TRUE,
norm_para = NULL,
lower_q = 0.005,
upper_q = 0.995
)
A list containing the necessary information for next step model fitting. Typically, the list is used as the main input of sieve.sgd.solver.
a number. Number of samples has been processed so far.
a string. The type of basis funtion.
a list of hyperparameters.
a matrix. Identifies the multivariate basis functions used in fitting.
the index product for each basis function. It is used in calculating basis function - specific learning rates.
a list storing the fitted results. It has a length of "number of unique combinations of the hyperparameters". The component of inf.list is itself a list, it has a hyper.para.index domain to specify its corresponding hyperparameters (need to be used together with hyper.para.list). Its rolling.cv domain is the progressive validation statistics for hyperparameter tuning; beta.f is the regression coefficients for the first length(beta.f) basis functions, the rest of the basis have 0 coefficients.
a matrix. It records how each dimension of the feature/predictor is rescaled, which is useful when rescaling the testing sample's predictors.
a data frame containing prediction features/ independent variables. The (i,j)-th element is the j-th dimension of the i-th sample's feature vector. So the number of rows equals to the sample size and the number of columns equals to the feature/covariate dimension. If the complete data set is large, this can be a representative subset of it (ideally have more than 1000 samples).
numerical array. Smoothness parameter, a smaller s corresponds to a more flexible model. Default is 2. The elements of this array should take values greater than 0.5. The larger s is, the smoother we are assuming the truth to be.
numerical array. Initial learning rate/step size, don't set it too large. The step size at each iteration will be r0*(sample size)^(-1/(2s+1)), which is slowly decaying.
numerical array. Initial number of basis functions, a larger J corresponds to a more flexible estimator The number of basis functions at each iteration will be J*(sample size)^(1/(2s+1)), which is slowly increasing. We recommend use J that is at least the dimension of predictor, i.e. the column number of the X matrix.
a string. It specifies what kind of basis functions are used. The default is (aperiodic) cosine basis functions ('cosine'), which is enough for generic usage.
a number. It also controls the model complexity. 1 means fitting an additive model, 2 means fitting a model allows, 3 means interaction terms between 3 dimensions of the feature, etc. The default is 3. For large sample size, lower dimension problems, try a larger value (but need to be smaller than the dimension of original features); for smaller sample size and higher dimensional problems, try set it to a smaller value (1 or 2).
the rate of dimension-reduction parameter. Default is 0.51, usually do not need to change.
a logical variable. Default is TRUE. It means sieve_preprocess will rescale the each dimension of features to 0 and 1. Only set to FALSE when user already manually rescale them between 0 and 1.
a matrix. It specifies how the features are normalized. For training data, use the default value NULL.
lower quantile used in normalization. Default is 0.01 (1% quantile).
upper quantile used in normalization. Default is 0.99 (99% quantile).
xdim <- 1 #1 dimensional feature
#generate 1000 training samples
TrainData <- GenSamples(s.size = 1000, xdim = xdim)
sieve.model <- sieve.sgd.preprocess(X = TrainData[,2:(xdim+1)])
Run the code above in your browser using DataLab