scoreIrt: Find Item Response Theory (IRT) based scores for dichotomous or polytomous items

Description

irt.fa finds Item Response Theory (IRT) parameters through factor analysis of the tetrachoric or polychoric correlations of dichtomous or polytomous items. scoreIrt uses these parameter estimates of discrimination and location to find IRT based scores for the responses. As many factors as found for the correlation matrix will be scored. scoreIrt.2pl will score lists of scales.

Usage

scoreIrt(stats=NULL, items, keys=NULL,cut = 0.3,bounds=c(-4,4),mod="logistic") 
scoreIrt.1pl(keys.list,items,correct=.5,messages=FALSE,cut=.3,bounds=c(-4,4),
     mod="logistic")  #Rasch like scaling
scoreIrt.2pl(itemLists,items,correct=.5,messages=FALSE,cut=.3,bounds=c(-4,4),
   mod="logistic")  #2 pl scoring
#the next is an alias for scoreIrt both of which are wrappers for 
#     score.irt.2 and score.irt.poly
score.irt(stats=NULL, items, keys=NULL,cut = 0.3,bounds=c(-4,4),mod="logistic") 
 #the higher order call just calls one of the next two
  #for dichotomous items 
score.irt.2(stats, items,keys=NULL,cut = 0.3,bounds=c(-4,4),mod="logistic") 
  #for polytomous items
score.irt.poly(stats, items, keys=NULL, cut = 0.3,bounds=c(-4,4),mod="logistic")
    #to create irt like statistics for plotting
irt.stats.like(items,stats,keys=NULL,cut=.3)
make.irt.stats(difficulty,discrimination) 
   
irt.tau(x)    #find the tau values for the x object
irt.se(stats,scores=0,D=1.702)

Arguments

stats

Output from irt.fa is used for parameter estimates of location and discrimination. Stats may also be the output from a normal factor analysis (fa). If stats is a data.frame of discrimination and thresholds from some other data set, these values will be used. See the last example.

items

The raw data, may be either dichotomous or polytomous.

itemLists

a list of items to be factored and scored for each scale, can be a keys.list as used in scoreItems or scoreIrt.1pl

keys.list

A list of items to be scored with keying direction (see example)

keys

A keys matrix of which items should be scored for each factor

cut

Only items with discrimination values > cut will be used for scoring.

The raw data to be used to find the tau parameter in irt.tau

bounds

The lower and upper estimates for the fitting function

mod

Should a logistic or normal model be used in estimating the scores?

correct

What value should be used for continuity correction when finding the tetrachoric or polychoric correlations when using irt.fa

messages

Should messages be suppressed when running multiple scales?

scores

A single score or a vector of scores to find standard errors

The scaling function for the test information statistic used in irt.se

difficulty

The difficulties for each item in a polytomous scoring

discrimination

The item discrimin

Value

scores

A data frame of theta estimates, total scores based upon raw sums, and estimates of fit.

tau

Returned by irt.tau: A data frame of the tau values for an object of dichotomous or polytomous items. Found without bothering to find the correlations.

Details

Although there are more elegant ways of finding subject scores given a set of item locations (difficulties) and discriminations, simply finding that value of theta \(\theta\) that best fits the equation \(P(x|\theta) = 1/(1+exp(\beta(\delta - \theta) )\) for a score vector X, and location \(\delta\) and discrimination \(\beta\) provides more information than just total scores. With complete data, total scores and irt estimates are almost perfectly correlated. However, the irt estimates provide much more information in the case of missing data.

The bounds parameter sets the lower and upper limits to the estimate. This is relevant for the case of a subject who gives just the lowest score on every item, or just the top score on every item. Formerly (prior to 1.6.12) this was done by estimating these taial scores by finding the probability of missing every item taken, converting this to a quantile score based upon the normal distribution, and then assigning a z value equivalent to 1/2 of that quantile. Similarly, if a person gets all the items they take correct, their score is defined as the quantile of the z equivalent to the probability of getting all of the items correct, and then moving up the distribution half way. If these estimates exceed either the upper or lower bounds, they are adjusted to those boundaries.

As of 1.6.9, the procedure is very different. We now assume that all items are bounded with one passed item that is easier than all items given, and one failed item that is harder than any item given. This produces much cleaner results.

There are several more elegant packages in R that provide Full Information Maximum Likeliood IRT based estimates. In particular, the MIRT package seems especially good. The ltm package give equivalent estimates to MIRT for dichotomous data but produces unstable estimates for polytomous data and should be avoided.

Although the scoreIrt estimates are are not FIML based they seem to correlated with the MIRT estiamtes with values exceeding .99. Indeed, based upon very limited simulations there are some small hints that the solutions match the true score estimates slightly better than do the MIRT estimates. scoreIrt seems to do a good job of recovering the basic structure.

If trying to use item parameters from a different data set (e.g. some standardization sample), specify the stats as a data frame with the first column representing the item discriminations, and the next columns the item difficulties. See the last example.

The two wrapper functions scoreIrt.1pl and scoreIrt.2pl are very fast and are meant for scoring one or many scales at a time with a one factor model (scoreIrt.2pl) or just Rasch like scoring. Just specify the scoring direction for a number of scales (scoreIrt.1pl) or just items to score for a number of scales scoreIrt.2pl. scoreIrt.2pl will then apply irt.fa to the items for each scale separately, and then find the 2pl scores.

The keys.list is a list of items to score for each scale. Preceding the item name with a negative sign will reverse score that item (relevant for scoreIrt.1pl. Alternatively, a keys matrix can be created using make.keys. The keys matrix is a matrix of 1s, 0s, and -1s reflecting whether an item should be scored or not scored for a particular factor. See scoreItems or make.keys for details. The default case is to score all items with absolute discriminations > cut.

If one wants to score scales taking advantage of differences in item location but not do a full IRT analysis, then find the item difficulties from the raw data using irt.tau or combine this information with a scoring keys matrix (see scoreItems and make.keys and create quasi-IRT statistics using irt.stats.like. This is the equivalent of doing a quasi-Rasch model, in that all items are assumed to be equally discriminating. In this case, tau values may be found first (using irt.tau or just found before doing the scoring. This is all done for you inside of scoreIrt.1pl.

Such irt based scores are particularly useful if finding scales based upon massively missing data (e.g., the SAPA data sets). Even without doing the full irt analysis, we can take into account different item difficulties.

David Condon has added a very nice function to do 2PL analysis for a number of scales at one time. scoreIrt.2pl takes the raw data file and a list of items to score for each of multiple scales. These are then factored (currently just one factor for each scale) and the loadings and difficulties are used for scoring.

There are conventionally two different metrics and models that are used. The logistic metric and model and the normal metric and model. These are chosen using the mod parameter.

irt.se finds the standard errors for scores with a particular value. These are based upon the information curves calculated by irt.fa and are not based upon the particular score of a particular subject.

References

Kamata, Akihito and Bauer, Daniel J. (2008) A Note on the Relation Between Factor Analytic and Item Response Theory Models Structural Equation Modeling, 15 (1) 136-153.

McDonald, Roderick P. (1999) Test theory: A unified treatment. L. Erlbaum Associates.

Revelle, William. (in prep) An introduction to psychometric theory with applications in R. Springer. Working draft available at https://personality-project.org/r/book/

Examples

Run this code

# NOT RUN {
  #not run in the interest of time, but worth doing
d9 <- sim.irt(9,1000,-2.5,2.5,mod="normal") #dichotomous items
test <- irt.fa(d9$items)
scores <- scoreIrt(test,d9$items)
scores.df <- data.frame(scores,true=d9$theta) #combine the estimates with the true thetas.
pairs.panels(scores.df,pch=".",
main="Comparing IRT and classical with complete data") 
#now show how to do this with a quasi-Rasch model
tau <- irt.tau(d9$items)
scores.rasch <- scoreIrt(tau,d9$items,key=rep(1,9))
scores.dfr<- data.frame(scores.df,scores.rasch) #almost identical to 2PL model!
pairs.panels(scores.dfr)
#with all the data, why bother ?

#now delete some of the data
d9$items[1:333,1:3] <- NA
d9$items[334:666,4:6] <- NA
d9$items[667:1000,7:9] <- NA
scores <- scoreIrt(test,d9$items)
scores.df <- data.frame(scores,true=d9$theta) #combine the estimates with the true thetas.
pairs.panels(scores.df, pch=".",
main="Comparing IRT and classical with random missing data")
 #with missing data, the theta estimates are noticably better.
#now show how to do this with a quasi-Rasch model
tau <- irt.tau(d9$items)
scores.rasch <- scoreIrt(tau,d9$items,key=rep(1,9))
scores.dfr <- data.frame(scores.df,rasch = scores.rasch)
pairs.panels(scores.dfr)  #rasch is actually better!



v9 <- sim.irt(9,1000,-2.,2.,mod="normal") #dichotomous items
items <- v9$items
test <- irt.fa(items)
total <- rowSums(items)
ord <- order(total)
items <- items[ord,]


#now delete some of the data - note that they are ordered by score
items[1:333,5:9] <- NA
items[334:666,3:7] <- NA
items[667:1000,1:4] <- NA
items[990:995,1:9] <- NA   #the case of terrible data
items[996:998,] <- 0   #all wrong
items[999:1000] <- 1   #all right
scores <- scoreIrt(test,items)
unitweighted <- scoreIrt(items=items,keys=rep(1,9)) #each item has a discrimination of 1
#combine the estimates with the true thetas.
scores.df <- data.frame(v9$theta[ord],scores,unitweighted) 
   
colnames(scores.df) <- c("True theta","irt theta","total","fit","rasch","total","fit")
pairs.panels(scores.df,pch=".",main="Comparing IRT and classical with missing data") 
 #with missing data, the theta estimates are noticably better estimates 
 #of the generating theta than using the empirically derived factor loading weights

#now show the ability to score multiple scales using keys
ab.tau <- irt.tau(psychTools::ability)  #first find the tau values
ab.keys <- make.keys(psychTools::ability,list(g=1:16,reason=1:4,
letter=5:8,matrix=9:12,rotate=13:16))
#ab.scores <- scoreIrt(stats=ab.tau, items = psychTools::ability, keys = ab.keys)

#and now do it for polytomous items using 2pl
bfi.scores <- scoreIrt.2pl(bfi.keys,bfi[1:25])
#compare with classical unit weighting by using scoreItems
#not run in the interests of time
#bfi.unit <- scoreItems(psychTools::bfi.keys,psychTools::bfi[1:25])
#bfi.df <- data.frame(bfi.scores,bfi.unit$scores)
#pairs.panels(bfi.df,pch=".")


bfi.irt <- scoreIrt(items=bfi[16:20])  #find irt based N scores

#Specify item difficulties and discriminations from a different data set.
stats <- structure(list(MR1 = c(1.4, 1.3, 1.3, 0.8, 0.7), difficulty.1 = c(-1.2, 
-2, -1.5, -1.2, -0.9), difficulty.2 = c(-0.1, -0.8, -0.4, -0.3, 
-0.1), difficulty.3 = c(0.6, -0.2, 0.2, 0.2, 0.3), difficulty.4 = c(1.5, 
0.9, 1.1, 1, 1), difficulty.5 = c(2.5, 2.1, 2.2, 1.7, 1.6)), row.names = c("N1", 
"N2", "N3", "N4", "N5"), class = "data.frame")


stats #show them
bfi.new <-scoreIrt(stats,bfi[16:20])
bfi.irt <- scoreIrt(items=bfi[16:20]) 
cor2(bfi.new,bfi.irt)
newstats <- stats
newstats[2:6] <-stats[2:6 ] + 1  #change the difficulties
bfi.harder <- scoreIrt(newstats,bfi[16:20])
pooled <- cbind(bfi.irt,bfi.new,bfi.harder)
describe(pooled)  #note that the mean scores have changed
lowerCor(pooled) #and although the unit weighted scale are identical,
# the irt scales differ by changing the difficulties


# }

Run the code above in your browser using DataLab