make_2classification: Data Simulation for 2 stages

Description

It generates simulated dataset to test multiple stage learning algorithms. The outcomes are generated based on a pattern mixture model using a latent variable with 4 categories. For each category, X has a multivariate normal distribution and each category is assigned a vector of optimal treatments V. Specifically, we generate centroids of the classes from a multivariate normal distribution mean 0 and std 5. We add the centroids to the first pinfo dimension of the vectors of feature variables X simulated from multivariate normal distribution with pinfo+pnoise dimensions.

Then we assign optimal treatments $y=(A_1^*, A_2^*)$ from (1,1),(1,-1),(-1,-1),(-1,1) to each latent category. The observed treatment assignments $A=(A_1,A_2)$ are completely random to be 1 and -1 with probability 0.5, and the outcomes are generated as: R_1=0, R_2= A'y+N(0,1). Therefore the mean optimal outcome $R_1+R_2$ is $2$ when the treatment assignments are equal to the optimal treatment for a given a latent group in both stages.

Usage

make_2classification(n_cluster, pinfo, pnoise, n_sample, centroids = 0)

Arguments

n_cluster

number of clusters.

pinfo

number of informative variables, dimensions of the centroids related to the latent class of the sample.

pnoise

number of noise variable.

n_sample

sample size

centroids

For a training set, do not assign centroids, the centroids are generated randomly by the function. For a testing set, ones want to assign the same set of centroids as the training set. It is a matrix of dimension n_cluster by p.

Value

Feature variable matrix, it is a n_sample by pinfo+pnoise matrix generated from multivariate normal distribution. Where the noises are with mean 0 and std 1. The informative variables are shifted to centered at the randomly generate centroids.

List of 2, A[[1]] and A[[2]] are the treatment assignment vectors for stage 1 and 2.

List of 2, y[[1]] and y[[2]] are the true optimal treatment vectors for stage 1 and 2

List of 2, R[[1]] is vector of n_sample zeros (this is the simplified case where the intermediate outcomes are 0), R[[2]] is the final outcomes vector

centroids

centers of each cluster, are from pinfo dimensional multivariate normal distribution.

Examples

Run this code

# NOT RUN {
n_cluster=5
pinfo=10
pnoise=10
n_sample=50
example2=make_2classification(n_cluster,pinfo,pnoise,n_sample)
pi=list()
pi[[2]]=pi[[1]]=rep(1,n_sample)
set.seed(3)
modelO=Olearning(example2$X,example2$A,example2$R,n_sample,2,pi)
modelP=Plearning(example2$X,example2$A,example2$R,n_sample,2,pi)
modelQ=Qlearning(example2$X,example2$A,example2$R,2)
# }

Run the code above in your browser using DataLab

Description

Usage

Arguments

Value

See Also

Examples