Learn R Programming

Sieve (version 2.1)

sieve_preprocess: Preprocess the original data for sieve estimation.

Description

Generate the design matrix for the downstream lasso-type penalized model fitting.

Usage

sieve_preprocess(
  X,
  basisN = NULL,
  maxj = NULL,
  type = "cosine",
  interaction_order = 3,
  index_matrix = NULL,
  norm_feature = TRUE,
  norm_para = NULL
)

Value

A list containing the necessary information for next step model fitting. Typically, the list is used as the main input of Sieve::sieve_solver.

Phi

a matrix. This is the design matrix directly used by the next step model fitting. The (i,j)-th element of this matrix is the evaluation of i-th sample's feature at the j-th basis function. The dimension of this matrix is sample size x basisN.

X

a matrix. This is the rescaled original feature/predictor matrix.

type

a string. The type of basis funtion.

index_matrix

a matrix. It specifies what are the product basis functions used when constructing the design matrix Phi. It has a dimension basisN x dimension of original features. There are at most interaction_order many non-1 elements in each row.

basisN

a number. Number of sieve basis functions.

norm_para

a matrix. It records how each dimension of the feature/predictor is rescaled, which is useful when rescaling the testing sample's predictors.

Arguments

X

a data frame containing original features. The (i,j)-th element is the j-th dimension of the i-th sample's feature vector. So the number of rows equals to the sample size and the number of columns equals to the feature dimension.

basisN

number of sieve basis function. It is in general larger than the dimension of the original feature. Default is 50*dimension of original feature. A larger value has a smaller approximation error but it is harder to estimate. The computational time/memory requirement should scale linearly to basisN.

maxj

a number. the maximum index product of the basis function. A larger value means more basisN. If basisN is already specified, do not need to provide value for this argument.

type

a string. It specifies what kind of basis functions are used. The default is (aperiodic) cosine basis functions, which is suitable for most purpose.

interaction_order

a number. It also controls the model complexity. 1 means fitting an additive model, 2 means fitting a model allows, 3 means interaction terms between 3 dimensions of the feature, etc. The default is 3. For large sample size, lower dimension problems, try a larger value (but need to be smaller than the dimension of original features); for smaller sample size and higher dimensional problems, try set it to a smaller value (1 or 2).

index_matrix

a matrix. provide a pre-generated index matrix. The default is NULL, meaning sieve_preprocess will generate one for the user.

norm_feature

a logical variable. Default is TRUE. It means sieve_preprocess will rescale the each dimension of features to 0 and 1. Only set to FALSE when user already manually rescale them between 0 and 1.

norm_para

a matrix. It specifies how the features are normalized. For training data, use the default value NULL.

Examples

Run this code
xdim <- 1 #1 dimensional feature
#generate 1000 training samples
TrainData <- GenSamples(s.size = 1000, xdim = xdim)
#use 50 cosine basis functions
type <- 'cosine'
basisN <- 50 
sieve.model <- sieve_preprocess(X = TrainData[,2:(xdim+1)], 
                                basisN = basisN, type = type)
#sieve.model$Phi #Phi is the design matrix

xdim <- 5 #1 dimensional feature
#generate 1000 training samples
#only the first two dimensions are truly associated with the outcome
TrainData <- GenSamples(s.size = 1000, xdim = xdim, 
                              frho = 'additive', frho.para = 2)
                              
#use 1000 basis functions
#each of them is a product of univariate cosine functions.
type <- 'cosine'
basisN <- 1000 
sieve.model <- sieve_preprocess(X = TrainData[,2:(xdim+1)], 
                                basisN = basisN, type = type)
#sieve.model$Phi #Phi is the design matrix

#fit a nonaprametric additive model by setting interaction_order = 1
sieve.model <- sieve_preprocess(X = TrainData[,2:(xdim+1)], 
                                basisN = basisN, type = type, 
                                interaction_order = 1)
#sieve.model$index_matrix #for each row, there is at most one entry >= 2. 
#this means there are no basis functions varying in more than 2-dimensions 
#that is, we are fitting additive models without interaction between features.

Run the code above in your browser using DataLab