Ideas in Testing Research Seminar Schedule, October 20, 2017
Coffee & Networking (9:15 — 9:45)
Welcome and Introduction (9:45 — 10:00)
Compute Adaptive Testing (CAT) (10:00 — 11:00)
Item exposure mediation in CAT using temporary enemy item relationships (Becker, Pearson)
abstract
Abstract: Enemy items and repeat tests created challenges
in testing operations especially in the continuous administration of
linear fixed-length test forms. This study uses simulation analysis to
evaluate how these two might affect ability estimation and item exposure
in computerized adaptive testing, and finds enemy item relationship can
helps balance item exposures without impacting precision.
Self efficacy and the CAT environment (Mark Brow, UIC)
abstract
Abstract: Several authors have found that computer adaptive
testing (CAT), with optimized psychometric precision that corresponds
to a 50 percent pass rate, may increase test anxiety and debilitate
performance, though no studies have examined explicitly the mechanism
that impacts performance. This paper will offer a tentative explanation
of this mechanism and examine methods that decrease CAT difficulty
while maintaining precision. This study uses simulations modeled
after two studies, Eggen and Verschoor (2006) and Bergstrom, Lunz, and
Gershon (1992), that proposed methods for lessening test difficulty for
CAT. The aim of these papers was to examine the loss of psychometric
efficiency vis-a-vis alternate item selection criteria, i.e., as an
alternative to the maximum information criteria, and offer compensatory
procedures. This paper will explore some of these procedures as a way
to bolster self-efficacy.
A Predicted Standard Error Reduction Stopping Rule for Multidimensional Computer Adaptive Tests (Neopolitan, Morris, Bass, Lauristen)
abstract slides
Abstract: CAT stopping rules often administer items until a
specified level of measurement precisions is reached, which is effective
when all trait levels are well represented in the item bank. When the
item bank is misaligned with the trait distribution, however, there are
some trait levels where the available items provide little information
and some respondents might be asked a large number of questions without
ever achieving the SE cutoff.
Choi, Grady & Dodd (2010) proposed an alternative approach based on
predicted SE reduction (PSER). If no item is expected to substantially
improve measurement precision, there is no point administering additional
items and the exam would stop, regardless of whether the SE cutoff
has been reached. This approach has been found to substantially reduce
the number of items administered to individuals for whom the item bank
provides little information (Morris, Bass & Neapolitan, 2016).
The current research extends the PSER stopping rule to multidimensional
CAT. MCAT introduces the additional complication that we are attempting
to simultaneously minimize the SE of multiple traits. We will present
three versions of the multidimensional PSER, which alternately apply the
stopping rule to: a) the trait with the largest SE, sum of SE across all
traits, and c) the sum of posterior variances across all traits. The
relative efficiency and precision of the three alternatives will be
examined using a 3-dimensioanl CAT for the PROMIS emotional distress
banks (depression, anxiety and anger; Cella et al., 2010; Morris Bass &
Neapolitan, 2017).
Break 11:00-11:10 Break
Item Analysis & DIF (10:10 — 12:10)
Investigating item characteristics and DIF in MIMIC (Reboucas & Cheng, Notre Dame)
abstract
Abstract: In DIF assessment (Holland & Thayer, 1988;
Thissen, Steinberg, & Wainer, 1998; Shealy & Stout, 1993),
researchers frequently identify a set of DIF-free items prior to DIF
detection. Such items are known as the anchor, and many methods of anchor
selection have been developed (Kopf, Zeileis & Strobl, 2015). Using
the multiple causes, multiple indicators (MIMIC; Camilli & Shepard,
2004) model, Shih & Wang (2009) proposed the two-step procedure
M-IT/M-PA consisting of first selecting an anchor (M-IT) and then
testing all other items for DIF with the pure anchor (M-PA). M-IT/M-PA
yields high power and nominal type I error rates even with a four-item
anchor. Limited research has been done on the association between item
characteristics and DIF detection (Magis and De Boeck, 2014), especially
with an anchor set. This study aims to (1) investigate the relationship
between item characteristics and accuracy of the anchor and (2) assess
the accuracy of DIF detection.
Simulation study results show that DIF assessment with the M-IT/M-PA
has decreased power when items are not discriminating or either very
easy/difficult. Low discrimination is also associated with poor accuracy
in the selection of DIF-free items for the anchor. In general, a DIF
item with low discrimination is more likely to be mistakenly chosen
into the anchor set; even if it is tested for DIF, it is difficult to
detect the DIF effect. In the future, other anchor selection and DIF
detection methods will be studied, such as the IRT-LRT method, and we
believe similar trends will be found.
Exploring the linguistic characteristics of DIF (Jorion, Pearson)
abstract
Using Response Time to Detect Speededness Based on CUSUM (Yu & Cheng, Notre Dame)
abstract
Abstract:
Test speededness occurs when an examinee does not have sufficient time
to fully consider every question on a test within a fixed time limit
(Bejar, 1985). Test speededness has been a long-standing issue in test
theory (Schnipke & Scrams, 1997). Item-level response times (RTs) refer
to the times that an examinee spends on each item during a test. They
can provide additional information to item responses about examinees'
testing taking behavior, as well as information about item and test
characteristics (Marianti, Fox, Avetisyan, & veldkamp, 2014).
This paper focused on using response time data to detect aberrant response
behavior, more specifically speededness, based on CUSUM procedure. CUSUM
procedure is a widely used Statistical Process Control (SPC) procedure. We
conducted simulation studies to investigate the performance of a number of
popular CUSUM statistics in IRT research in detecting speededness. Two
different models, gradual change model (GCM) and mixed hierarchical
model (MHM), were used to generate response time data under speeded
behavior. Normal response time is assumed to follow the log-normal model.
Results suggest that CUSUM statistics are more powerful in detecting
gradual change than abrupt change caused by speededness. The type I error
rates for most of the CUSUM statistics are close to the nominal level. The
overall performance of the statistic T4 is the best everything considered.
Lunch (12:10 — 1:00)
Research Discussion (1:00 — 1:45)
An agenda for psychometric research (Mead, Talent Algorithms Inc., Becker, Pearson, & Morris, IIT)
abstract
Abstract: This discussion session will present research ideas for comments by the audience.
Computational Modeling (1:45 — 2:45)
A computational model of targeted recruiting (Morris, IIT)
abstract
Abstract: Ethnic disparities in employment outcomes are a
persistent concern for organizations. Efforts to reduce disparities
through the design of selection systems have had only limited success. A
complementary approach is to focus on the quality of diversity of the
applicant pool. Targeted recruiting aims to increase the number of
highly qualified minority candidates in the applicant pool. In order to
reduce disparities, recruitment efforts must simultaneously target both
minority populations and job qualifications. The current research further
explores the impact of qualification-focused recruitment. A distinction
is made between recruiting efforts that encourage more applications from
highly-qualified applicants and those that discourage applications from
unqualified candidates (the so-called chilling effect). Simulations are
used to explore the consequences of these different recruitment effects
on minority hires and adverse impact statistics.
A machine learning "Rosetta Stone" for psychologists and psychometricians (Mead, Talent Algorithms Inc. & Huang, Amazon)
abstract slides
Abstract: Do you know how and why to perform "one-hot"
encoding for a "co-occurrence matrix?" Or how and why to perform "2-fold"
crossvalidation? What are "bagging," "boosting," or "features?" You will
understand these terms after you hear this talk about a "Rosetta Stone"
for psychometricians and psychologists to understand the jargon that
data scientists are inventing, in many cases for concepts we have been
using for decades. Amuse your friends by talking like a data scientist!
Precise estimation of type I error inflation from questionable research practices (Hernandez)
abstract
Abstract:
Certain strategies researchers use when analyzing their data make
finding statistically significant results more likely when no true effect
exists. Prior research implicates five specific "questionable research
practices" that lead to a greater number of false positives. Because
of those findings, there is a greater demand for researchers to be
transparent about their methods, and to disclose all methodological steps
taken. However, no research has addressed how reviewers and the public
should incorporate the presence of questionable research practices into
their evaluation of a paper. The current paper addresses this issue
by presenting precise equations for quantifying the exact effect that
the five main questionable research practices have on the false positive
rate. Through Monte Carlo simulations and symbolic regression, closed form
solutions to the exact change in the Type I error are described. These
equations allow the peer review process greater precision and consistency
in their assessment of research.
Break 2:45 — 3:00
Applications (3:00 — 4:00)
Exploring Adolescent Personality with a Sliding Response Scale (Yankov, Bowling Green State University & Testify Software Solutions)
abstract
Abstract: MindMap is an innovative application assessing
high school students' personality, vocational interests, and learning
styles. By using gamification and an intuitive sliding response scale
it enables the assessment of multiple traits while keeping students
engaged. After students finish playing and item responding, their scale
scores go to their teacher's dashboard. This talk will answer three
research questions. First, how well can personality items be reliably
administered through mobile devices? Second, how is scale reliability
affected by the use of the slider response scale? Third, does age affect
the construct validity and interpretation of the personality scales?
Relative Index Score Report based on Estimated Domain Score (Denbleyker, Houghton Mifflin Harcourt)
abstract
Abstract: Utilizing true score equating methodology and
articulated standards, a relative index score report is derived from
an IRT true score distribution conditional on the estimated ability
(test characteristic function). In a CAT item pool framework, the total
item pool for a particular assessment can be used to construct the test
characteristic curve. Using IRT avoids disadvantages inherent in a CTT
estimate and improves operational flexibility. The talk will discuss
some of the features and advantages of this domain-sampling approach
to reporting.
The psychometrics of Likert surveys: Lessons learned from analyses of the 16pf Questionnaire(Mead, Talent Algorithms Inc.)
abstract slides
Abstract: Prior to joining the faculty at IIT, I thought I
knew everything an I/O psychologist needed to know about performing
psychometric analyses. And, I thought that the same procedures that
apply to ability tests could be trivially adapted to personality,
attitude and other non-ability scales using Likert response scales,
or other formats (e.g., balanced forced-choice). I was wrong, and it
was a learning experience. This talk uses examples from my work with
the 16pf Questionnaire to illustrate lessons learned about some of
the most significant differences between analysis of Likert surveys
and more mainstream psychometric analysis in the areas of educational,
certification, and ability testsing.
Closing comments (4:00)
Questions about the seminar may be directed to Alan Mead
(), Scott Morris (), or Kirk Becker (). We hope you will join us.
Back to the main page