Item exposure mediation in CAT using temporary enemy item relationships (Becker, Pearson) abstract
Abstract: Enemy items and repeat tests created challenges in testing operations especially in the continuous administration of linear fixed-length test forms. This study uses simulation analysis to evaluate how these two might affect ability estimation and item exposure in computerized adaptive testing, and finds enemy item relationship can helps balance item exposures without impacting precision.
Self efficacy and the CAT environment (Mark Brow, UIC) abstract
Abstract: Several authors have found that computer adaptive testing (CAT), with optimized psychometric precision that corresponds to a 50 percent pass rate, may increase test anxiety and debilitate performance, though no studies have examined explicitly the mechanism that impacts performance. This paper will offer a tentative explanation of this mechanism and examine methods that decrease CAT difficulty while maintaining precision. This study uses simulations modeled after two studies, Eggen and Verschoor (2006) and Bergstrom, Lunz, and Gershon (1992), that proposed methods for lessening test difficulty for CAT. The aim of these papers was to examine the loss of psychometric efficiency vis-a-vis alternate item selection criteria, i.e., as an alternative to the maximum information criteria, and offer compensatory procedures. This paper will explore some of these procedures as a way to bolster self-efficacy.
A Predicted Standard Error Reduction Stopping Rule for Multidimensional Computer Adaptive Tests (Neopolitan, Morris, Bass, Lauristen) abstract slides
Abstract: CAT stopping rules often administer items until a specified level of measurement precisions is reached, which is effective when all trait levels are well represented in the item bank. When the item bank is misaligned with the trait distribution, however, there are some trait levels where the available items provide little information and some respondents might be asked a large number of questions without ever achieving the SE cutoff.
Choi, Grady & Dodd (2010) proposed an alternative approach based on predicted SE reduction (PSER). If no item is expected to substantially improve measurement precision, there is no point administering additional items and the exam would stop, regardless of whether the SE cutoff has been reached. This approach has been found to substantially reduce the number of items administered to individuals for whom the item bank provides little information (Morris, Bass & Neapolitan, 2016).
The current research extends the PSER stopping rule to multidimensional CAT. MCAT introduces the additional complication that we are attempting to simultaneously minimize the SE of multiple traits. We will present three versions of the multidimensional PSER, which alternately apply the stopping rule to: a) the trait with the largest SE, sum of SE across all traits, and c) the sum of posterior variances across all traits. The relative efficiency and precision of the three alternatives will be examined using a 3-dimensioanl CAT for the PROMIS emotional distress banks (depression, anxiety and anger; Cella et al., 2010; Morris Bass & Neapolitan, 2017).
Investigating item characteristics and DIF in MIMIC (Reboucas & Cheng, Notre Dame) abstract
Abstract: In DIF assessment (Holland & Thayer, 1988; Thissen, Steinberg, & Wainer, 1998; Shealy & Stout, 1993), researchers frequently identify a set of DIF-free items prior to DIF detection. Such items are known as the anchor, and many methods of anchor selection have been developed (Kopf, Zeileis & Strobl, 2015). Using the multiple causes, multiple indicators (MIMIC; Camilli & Shepard, 2004) model, Shih & Wang (2009) proposed the two-step procedure M-IT/M-PA consisting of first selecting an anchor (M-IT) and then testing all other items for DIF with the pure anchor (M-PA). M-IT/M-PA yields high power and nominal type I error rates even with a four-item anchor. Limited research has been done on the association between item characteristics and DIF detection (Magis and De Boeck, 2014), especially with an anchor set. This study aims to (1) investigate the relationship between item characteristics and accuracy of the anchor and (2) assess the accuracy of DIF detection.
Simulation study results show that DIF assessment with the M-IT/M-PA has decreased power when items are not discriminating or either very easy/difficult. Low discrimination is also associated with poor accuracy in the selection of DIF-free items for the anchor. In general, a DIF item with low discrimination is more likely to be mistakenly chosen into the anchor set; even if it is tested for DIF, it is difficult to detect the DIF effect. In the future, other anchor selection and DIF detection methods will be studied, such as the IRT-LRT method, and we believe similar trends will be found.
Exploring the linguistic characteristics of DIF (Jorion, Pearson) abstract
Abstract: [Coming soon!]
Using Response Time to Detect Speededness Based on CUSUM (Yu & Cheng, Notre Dame) abstract
Abstract: Test speededness occurs when an examinee does not have sufficient time to fully consider every question on a test within a fixed time limit (Bejar, 1985). Test speededness has been a long-standing issue in test theory (Schnipke & Scrams, 1997). Item-level response times (RTs) refer to the times that an examinee spends on each item during a test. They can provide additional information to item responses about examinees' testing taking behavior, as well as information about item and test characteristics (Marianti, Fox, Avetisyan, & veldkamp, 2014).
This paper focused on using response time data to detect aberrant response behavior, more specifically speededness, based on CUSUM procedure. CUSUM procedure is a widely used Statistical Process Control (SPC) procedure. We conducted simulation studies to investigate the performance of a number of popular CUSUM statistics in IRT research in detecting speededness. Two different models, gradual change model (GCM) and mixed hierarchical model (MHM), were used to generate response time data under speeded behavior. Normal response time is assumed to follow the log-normal model.
Results suggest that CUSUM statistics are more powerful in detecting gradual change than abrupt change caused by speededness. The type I error rates for most of the CUSUM statistics are close to the nominal level. The overall performance of the statistic T4 is the best everything considered.
An agenda for psychometric research (Mead, Talent Algorithms Inc., Becker, Pearson, & Morris, IIT) abstract
Abstract: This discussion session will present research ideas for comments by the audience.
A computational model of targeted recruiting (Morris, IIT) abstract
Abstract: Ethnic disparities in employment outcomes are a persistent concern for organizations. Efforts to reduce disparities through the design of selection systems have had only limited success. A complementary approach is to focus on the quality of diversity of the applicant pool. Targeted recruiting aims to increase the number of highly qualified minority candidates in the applicant pool. In order to reduce disparities, recruitment efforts must simultaneously target both minority populations and job qualifications. The current research further explores the impact of qualification-focused recruitment. A distinction is made between recruiting efforts that encourage more applications from highly-qualified applicants and those that discourage applications from unqualified candidates (the so-called chilling effect). Simulations are used to explore the consequences of these different recruitment effects on minority hires and adverse impact statistics.
A machine learning "Rosetta Stone" for psychologists and psychometricians (Mead, Talent Algorithms Inc. & Huang, Amazon) abstract slides
Abstract: Do you know how and why to perform "one-hot" encoding for a "co-occurrence matrix?" Or how and why to perform "2-fold" crossvalidation? What are "bagging," "boosting," or "features?" You will understand these terms after you hear this talk about a "Rosetta Stone" for psychometricians and psychologists to understand the jargon that data scientists are inventing, in many cases for concepts we have been using for decades. Amuse your friends by talking like a data scientist!
Precise estimation of type I error inflation from questionable research practices (Hernandez) abstract
Abstract: Certain strategies researchers use when analyzing their data make finding statistically significant results more likely when no true effect exists. Prior research implicates five specific "questionable research practices" that lead to a greater number of false positives. Because of those findings, there is a greater demand for researchers to be transparent about their methods, and to disclose all methodological steps taken. However, no research has addressed how reviewers and the public should incorporate the presence of questionable research practices into their evaluation of a paper. The current paper addresses this issue by presenting precise equations for quantifying the exact effect that the five main questionable research practices have on the false positive rate. Through Monte Carlo simulations and symbolic regression, closed form solutions to the exact change in the Type I error are described. These equations allow the peer review process greater precision and consistency in their assessment of research.
Exploring Adolescent Personality with a Sliding Response Scale (Yankov, Bowling Green State University & Testify Software Solutions) abstract
Abstract: MindMap is an innovative application assessing high school students' personality, vocational interests, and learning styles. By using gamification and an intuitive sliding response scale it enables the assessment of multiple traits while keeping students engaged. After students finish playing and item responding, their scale scores go to their teacher's dashboard. This talk will answer three research questions. First, how well can personality items be reliably administered through mobile devices? Second, how is scale reliability affected by the use of the slider response scale? Third, does age affect the construct validity and interpretation of the personality scales?
Relative Index Score Report based on Estimated Domain Score (Denbleyker, Houghton Mifflin Harcourt) abstract
Abstract: Utilizing true score equating methodology and articulated standards, a relative index score report is derived from an IRT true score distribution conditional on the estimated ability (test characteristic function). In a CAT item pool framework, the total item pool for a particular assessment can be used to construct the test characteristic curve. Using IRT avoids disadvantages inherent in a CTT estimate and improves operational flexibility. The talk will discuss some of the features and advantages of this domain-sampling approach to reporting.
The psychometrics of Likert surveys: Lessons learned from analyses of the 16pf Questionnaire(Mead, Talent Algorithms Inc.) abstract slides
Abstract: Prior to joining the faculty at IIT, I thought I knew everything an I/O psychologist needed to know about performing psychometric analyses. And, I thought that the same procedures that apply to ability tests could be trivially adapted to personality, attitude and other non-ability scales using Likert response scales, or other formats (e.g., balanced forced-choice). I was wrong, and it was a learning experience. This talk uses examples from my work with the 16pf Questionnaire to illustrate lessons learned about some of the most significant differences between analysis of Likert surveys and more mainstream psychometric analysis in the areas of educational, certification, and ability testsing.
Questions about the seminar may be directed to Alan Mead (), Scott Morris (), or Kirk Becker (). We hope you will join us.
Back to the main page