poLCA.simdata: Create simulated cross-classification data

Description

Uses the latent class model's assumed data-generating process to create a simulated dataset that can be used to test the properties of the poLCA latent class and latent class regression estimator.

Usage

poLCA.simdata(N = 5000, probs = NULL, nclass = 2, ndv = 4, 
              nresp = NULL, x = NULL, niv = 0, b = NULL, 
              classdist = NULL, missval = FALSE, pctmiss = NULL)

Arguments

number of observations.

probs

a list of matrices of dimension nclass by nresp with each matrix corresponding to one manifest variable, and each row containing the class-conditional outcome probabilities (which must sum to 1) If probs is NU

nclass

number of latent classes. Ifprobs is specified, then nclass is set equal to the number of rows in each matrix in that list. If classdist is specified, then nclass is set equal to the length of that vecto

ndv

number of manifest variables. If probs is specified, then ndv is set equal to the number of matrices in that list. If nresp is specified, then ndv is set equal to the length of that vector. Otherwise, t

nresp

number of possible outcomes for each manifest variable. If probs is specified, then ndv is set equal to the number of columns in each matrix in that list. If both probs and nresp are NULL (d

a matrix of concomicant variables with N rows and niv columns. If x=NULL (default), but niv>0, then niv concomitant variables will be generated as mutually independent random draws from a s

niv

number of concomitant variables (covariates). Setting niv=0 (default) creates a data set assuming no covariates. If nclass=1 then niv is automatically set equal to 0. If both x and niv are

when using covariates, an niv+1 by nclass-1 matrix of (multinomial) logit coefficients. If b is NULL (default), then coefficients are generated as random integers between -2 and 2.

classdist

a vector of mixing proportions (class population shares) of length nclass. classdist must sum to 1. Disregarded if b is specified or niv>1 because then classdist is, in part, a function of

missval

logical. If TRUE then a fraction pctmiss of the manifest variables are randomly dropped as missing values. Default is FALSE.

pctmiss

percentage of values to be dropped as missing, if missval=TRUE. If pctmiss is NULL (default), then a value between 5 and 40 percent is chosen randomly.

Value

data data frame containing the simulated variables. Variable names for manifest variables are Y1, Y2, etc. Variable names for concomitant variables are X1, X2, etc.
probsa list of matrices of dimension nclass by nresp containing the class-conditional response probabilities.
nrespa vector containing the number of possible outcomes for each manifest variable.
bcoefficients on covariates, if used.
classdistmixing proportions corresponding to each latent class.
pctmisspercent of observations missing.
trueclassN by 1 vector containing the "true" class membership for each individual.

Details

Note that entering probs overrides nclass, ndv, and nresp. It also overrides classdist if the length of the classdist vector is not equal to the length of the probs list. Likewise, if probs=NULL, then length(nresp) overrides ndv and length(classdist) overrides nclass. Setting niv>1 causes any user-entered value of classdist to be disregarded.

Examples

Run this code

##
## Create a sample data set with 3 classes and no covariates 
## and run poLCA to recover the specified parameters.
##
probs <- list(matrix(c(0.6,0.1,0.3,     0.6,0.3,0.1,     0.3,0.1,0.6    ),ncol=3,byrow=TRUE), # conditional resp prob to Y1
              matrix(c(0.2,0.8,         0.7,0.3,         0.3,0.7        ),ncol=2,byrow=TRUE), # conditional resp prob to Y2
              matrix(c(0.3,0.6,0.1,     0.1,0.3,0.6,     0.3,0.6,0.1    ),ncol=3,byrow=TRUE), # conditional resp prob to Y3
              matrix(c(0.1,0.1,0.5,0.3, 0.5,0.3,0.1,0.1, 0.3,0.1,0.1,0.5),ncol=4,byrow=TRUE), # conditional resp prob to Y4
              matrix(c(0.1,0.1,0.8,     0.1,0.8,0.1,     0.8,0.1,0.1    ),ncol=3,byrow=TRUE)) # conditional resp prob to Y5
simdat <- poLCA.simdata(N=5000,probs,classdist=c(0.2,0.3,0.5))
f1 <- cbind(Y1,Y2,Y3,Y4,Y5)~1
lc1 <- poLCA(f1,simdat$dat,nclass=3)
print(table(lc1$predclass,simdat$trueclass))

##
## Create a sample dataset with 2 classes and three covariates.
## Then compare predicted class memberships when the model is 
## estimated "correctly" with covariates to when it is estimated
## "incorrectly" without covariates.
##
simdat2 <- poLCA.simdata(N=5000,ndv=7,niv=3,nclass=2,b=matrix(c(1,-2,1,-1)))
f2a <- cbind(Y1,Y2,Y3,Y4,Y5,Y6,Y7)~X1+X2+X3
lc2a <- poLCA(f2a,simdat2$dat,nclass=2)
f2b <- cbind(Y1,Y2,Y3,Y4,Y5,Y6,Y7)~1
lc2b <- poLCA(f2b,simdat2$dat,nclass=2)
print(table(lc2a$predclass,lc2b$predclass))

##
## Create a sample dataset with missing values and estimate the
## latent class model including and excluding the missing values.
## Then plot the estimated class-conditional outcome response 
## probabilities against each other for the two methods.
##
simdat3 <- poLCA.simdata(N=5000,niv=2,ndv=5,nclass=3,b=matrix(c(-1,2,-3,1,-2,2),3,2),missval=TRUE,pctmiss=0.2)
f3 <- cbind(Y1,Y2,Y3,Y4,Y5)~X1+X2
lc3.miss <- poLCA(f3,simdat3$dat,nclass=3,verbose=FALSE)
probs.start.new <- poLCA.reorder(lc3.miss$probs.start,order(lc3.miss$P))
lc3.miss <- poLCA(f3,simdat3$dat,nclass=3,probs.start=probs.start.new)

lc3.nomiss <- poLCA(f3,simdat3$dat,nclass=3,verbose=FALSE,na.rm=FALSE)
probs.start.new <- poLCA.reorder(lc3.nomiss$probs.start,order(lc3.nomiss$P))
lc3.nomiss <- poLCA(f3,simdat3$dat,nclass=3,na.rm=FALSE,probs.start=probs.start.new)

plot(lc3.miss$probs[[1]],lc3.nomiss$probs[[1]],xlim=c(0,1),ylim=c(0,1),
    xlab="Conditional response probabilities (missing values dropped)",
    ylab="Conditional response probabilities (missing values included)")
for (i in 2:5) { points(lc3.miss$probs[[i]],lc3.nomiss$probs[[i]]) }
abline(0,1,lty=3)