Some adapative test systems (e.g. school assessment tools) use the 1pl IRT model, while others use the 2pl or the 3pl. When developing an adaptive IQ test, is there a rule of thumb about which model to choose in calibrating the item difficulty and test takers ability?
I can't find any research that gives some insights in fit between IQ test items and different kinds of IRT models.
Many thanks in advance!
Best Answer
I think the difference primarily is a philosophical one when choosing Rasch/1PL models (the emphases on what measurement means is slightly different in that literature, and hence researchers try their best to obtain these special items), and an empirical/design one when deciding between using 2PL and 3PL models.
Since the slopes are all equal in 1PL models determining a persons location amounts to finding the optimal location where respondents have a P = 0.5 chance of answering correctly by simply choosing items with the best intercepts to get an estimate of $theta$, whereas in 2- and 3PL models it's slightly more complicated due to the unequal slopes and lower bound parameters for guessing. As a consequence, 2-3PL models often require more advanced adaptive item selection procedures such as the Kullback–Leibler/Fisher information to select the next best item for honing in on $theta$.
Speaking purely from a design perspective if the adaptive testing items contain a finite number of responses then the 3PL seems like the better option, but if it's more of a fill in the blank style answer (e.g., 2 + 3 = __.) then the 1PL and 2PL models would, at least theoretically, be more reasonable.