Ideas in Testing Research Seminar Schedule, November 4, 2022
Coffee & Networking (9:00 — 9:50)
Welcome and Introduction (9:50 — 10:00)
Technology Showcase (10:00 — 10:45)
Breakthrough Technologies Demonstration of VR/Simulations in Assessment — Kevin Leonard (Breakthrough Technologies)
Demonstration of Certiverse Automatic SME Coaching — Alan Mead (Certiverse)
abstract
Abstract: Subject matter experts (SMEs) are fundamentally
critical to exam development. There will never be any way to avoid
a near complete reliance on SMEs. But although SMEs are experts in
their subject, they are rarely experts in exam development and SME
time is frequently a scarce resource that must be used as efficiently
as possible. Recent draft guidelines for computer-based tests (ITCC,
ATP, March, 2022) recommend that content management systems "configure
as many item writers' guidelines as possible by default in the
system." At Certiverse, we have implemented automated coaching for JTA
task elicitation and item-writing. During task elicitation, the build-in
coaching helps SMEs write well-formed job tasks. The coaching built
into item writing helps SMEs adhere to Haladyna and Downing's (1989)
rules for writing multiple-choice items. This demo will show these
coaching features and discuss some machine learning research used to
trigger the coaching prompts.
Break (10:45 — 11:00)
IRT and Dimensionality (11:00 — 12:00)
IRT as a Method to Improve Matchmaking in Online Competitive Games — Matthew Lauritsen (City of Chicago)
abstract
slides
Abstract: Games that incorporate a skill-based ranking system
require a decision criteria on which to match players. Typically, this
is accomplished by matching win rates of players. Item-response theory
(IRT) is proposed to improve the measurement of player skill thereby
improving matchmaking. IRT allows for the explicit modeling of in-game
statistics which are related to player skill (e.g, points earned) in
addition to dichotomous win rates. Theta would then replace win rates
as a more precise measure of player skill. Implications for matchmaking
procedures are discussed.
Modeling Misspecification of IRT and Its Implications in the Context of Recognition Task Data — Qi (Helen) Huang, Daniel M. Bolt (University of Wisconsin-Madison)
abstract
slides
Abstract: Traditional item response theory (IRT) models, like
the 2PL, are often indiscriminately applied to tests with dichotomously
scored items. However, the wide variety of forms of response process that
may underlie such items raise concerns about the universal appropriateness
of the traditional logistic/normal functional form. In this paper we
consider an author recognition test (ART). Applications of item response
models to ART data show results highly suggestive of metric distortion:
strong positive correlations between the discrimination and difficulty
parameter estimates of the 2PL model. This correlation is referred
to as "scale shrinkage" (Lord, 1984; Yen, 1985). We suggest that a
theoretically plausible cause of this with the ART is the presence of a
memory component for each author name. We use both a simulation study
and an empirical study to illustrate how a metric distortion ensues,
and examine its consequences. We show that an underlying multicomponent
response process (presumed to reflect an exposure and a memory component
for each item) result is a distortion of the IRT metric that becomes
manifest when performing separate calibrations for groups of different
mean proficiency levels.
Metric Misspecification Due to Test Multidimensionality and Consequences for the Measurement of Growth — Xiangyi Liao, Daniel M. Bolt, Jee-Seon Kim (University of Wisconsin-Madison)
abstract
slides
Abstract: Educational research outcomes frequently rely
on an assumption that measurement metrics have interval-level
properties. While most investigators know enough to be suspicious of
interval-level claims, and in some cases even question their findings
in light of such suspicions, what is absent is an understanding of
the measurement conditions that create metric distortions. This study
investigates a potential cause of metric distortion, specifically
that some forms of test multidimensionality can create conditions that
render nonlinear distortions of score metrics. The work is motivated
by attempts to understand observations of the "Matthew effect" in
reading achievement, and the inconsistent observations of its presence
and correlation magnitude. Matthew effects refer to the frequently
observed positive correlations between baseline scores and score gains.
This study takes a psychometric perspective on this controversy, and to
demonstrate how the observation of such correlations can be driven by
subtle effects associated with test multidimensionality in the measures
used. Using the ECLS-K data, which has been used to demonstrate
a Matthew effect, I will show both in simulation and real-data examples
of how an artificial positive correlation would be produced under
multidimensional conditions.
Lunch (12:00 — 12:45)
IRT and Exam Content (12:45 — 2:00)
Recovering Domain Scores From Bifactor IRT Models — Tony Lam, Sheng Zhang, and Scott Morris (Illinois Institute of Technology)
abstract
slides
Abstract: The Patient-Reported Outcomes Measurement System
(PROMIS) is an application of Computer Adaptive Testing that provides
practitioners and researchers with comprehensive measures of health
status across physical, mental, and general well-being domains. Given
the multidimensional nature of many health-related construct, a bifactor
model that better fits the structure of responses could potentially
provide better estimates of trait scores. However, one drawback to the
bifactor model is that the specific factor scores differ in interpretation
from the more familiar hierarchical factor model. To address this
challenge, this current paper explores methods to estimate traditional
unidimensional subscale scores from the results of a bifactor model.
We utilize several different bi-factor model scoring interpretation
methods across two studies: (1) in study 1 we explore the effectiveness
of bi-factor modeling of the short-form PROMIS depression, anxiety,
and anger scales; (2) and in study 2 we examine the effectiveness of
bi-factor modeling of short-form PROMIS depression and anxiety scales,
the correlations between the two constructs, and their correlation with
external measures. Across both studies, we find that modest gains to
reliability can be made by applying bi-factor modeling.
IRT Person Misfit and Person Reliability in Rating Scale Measures: The Role of Response Styles — Tongtong Zou and Daniel Bolt (University of Wisconsin-Madison)
abstract
slides
Abstract: One of the underused advantages afforded by item
response theory (IRT) is the ability to evaluate validity at the
respondent level. Person fit indices typically assume a fixed person
trait, and evaluate the response pattern in terms of its consistency
with the item parameters and the probability of responses at that fixed
person trait level. In contrast, person reliability views the person
trait as variable across items, and characterizes person reliability
in terms of the quantified within-person trait variability across item
responses. Although research suggests a strong inverse relationship
between these indices in the case of binary items, our empirical study
suggests much less consistency with rating scale items. The present
paper examines response style heterogeneity as a possible source of
this disagreement. It is speculated that highly valid response patterns
(e.g., ones that show consistent selection of item scores corresponding a
common trait level) will on occasion use score categories less commonly
selected in the population, and thus show high person reliability but
also person misfit.
A New Item Design Approach to the Measurement of Mental Processes — Ozge Altintas (Purdue University)
abstract
slides
Abstract: Developments and changes in testing and testing
technologies in the 20th century have also led to changes in educational
measurements and classroom assessment practices. In the 21st century,
schools are expected to develop competencies in three different
competence areas (cognitive, intrapersonal, and interpersonal) using
task-based assessment practices in the classroom rather than using
knowledge or content-based items. It is necessary to design items that
measure learning in real-life situations, not items that measure content
in depth. In this study, a new item design approach will be presented,
which aims to measure the student's skills in the three competence areas
based on a situation. For this purpose, first, the learning outcome was
defined, and following, the skills to be able to achieve this outcome
were determined. Then, the items associated with all three competence
areas were written that enable students to think about a realistic
situation. The underlying idea of this study is to try to disseminate an
approach that will holistically measure the student's capacity to use
the fundamental knowledge and skills learned in lessons in real-life
situations.
Using Machine Learning to Predict Bloom's Taxonomy Level for Certification Exam Items — Alan Mead and Chenxuan Zhou (Certiverse)
abstract
slides
paper
Abstract: This study fit a Naive Bayesian classifier to the
words of exam items to predict the Bloom's taxonomy level of the
items. We addressed five research questions, showing that reasonably
good prediction of Bloom's level was possible, but accuracy varies
across levels, but the performance of a model distinguishing Level 1
from all other levels was quite good. Applying a model developed on
an IT certification exam domain to a more diverse set of items showed
poor performance, suggesting that models may generalize poorly. Finally,
we showed what features of items the classifier was using. Examples and
implications for practice are discussed.
Break (2:00 — 2:10)
Test Security (2:10 — 2:50)
Using Item Scores and Response Times in Person-Fit Assessment — Kylie Gorney (University of Wisconsin-Madison), Xiang Liu and Sandip Sinharay (Educational Testing Service)
abstract
slides
Abstract: Person-fit assessment is used to identify individuals
displaying unusual behavior with respect to an assumed measurement
model. The purpose of this paper is two-fold. First, we will introduce and
derive the asymptotic null distributions of two new person-fit statistics
(PFSs) for item scores and item response times (RTs). Second, we will
compare the performance of our new PFSs to that of several existing PFSs,
including the statistic for item scores, the statistic for item RTs,
and several Bayesian PFSs. Using detailed simulations and a real data
example, we show that the new PFSs are promising tools for detecting
aberrant behavior.
Test Security and Online Proctored Test Administration — Kirk Becker (Pearson)
abstract
slides
Abstract: This paper examines indicators of test security
violations across a wide range of programs in professional, admissions,
and IT fields. High levels of response overlap are used as a potential
indicator of collusion to cheat on the exam and compare rates by
modality (in-person at test centers (TC) versus remote online proctored
(OP) testing). Following this, indicators of potential test security
violations are examined for a single large testing program over the
course of 14 months, during which the program went from exclusively
in-person TC testing to a mix of OP and TC testing. Test security
indicators include high response overlap, large numbers of fast correct
responses, large numbers of slow correct responses, large test-retest
score gains, unusually fast response times for passing candidates, and
measures of candidate-response misfit. These indicators are examined and
compared prior to and after the introduction of OP testing. In addition,
test-retest modality is examined for candidates who fail and retest
subsequent to the introduction of OP testing, with special attention
paid to test takers who change modality between the initial attempt
and the retest. These data allow us to understand whether indications of
content exposure increase with the introduction of OP testing, and whether
testing modalities affect potential score increase in a similar way.
Break (2:50 — 3:00)
Adaptive Testing (3:00 — 4:00)
On-the-fly Multistage Testing Design for PISA Assessment Incorporating Response Time — Xiuxiu Tang, Yi Zheng, Tong Wu, Kit-Tai Hau, Hua-Hua Chang (Purdue University)
abstract
slides
Abstract: Multistage adaptive testing (MST) has drawn a
widespread interest over the past decades. This study explores the
potential use in PISA of a new adaptive test design named "on-the-fly
multistage adaptive testing" (OMST; Zheng & Chang, 2015), which
combines the merits of computerized adaptive testing (CAT) and MST and
offsets their limitations (e.g, alleviating the over-/under-estimation
problem, easing examinees' test anxiety by allowing them to review
and answers within a stage). The main difference between OMST and MST
is that modules in OMST are assembled on the fly to match the given
examinee's level, while modules in MST are all preassembled before
test administration. The OMST design in this study also incorporates
response time information of the items to improve measurement efficiency.
Based on the time an examinee spends on a specific item, we may obtain
additional information about both the characteristics of the item and
the latent trait of the examinee. Via simulations mimicking the PISA 2018
reading test, we compares the performance of our OMST design against the
MST design in terms of (1) measurement accuracy, (2) test time efficiency
and consistency, (3) item exposure rates, (4) constraint violations,
and (5) recalibration sample.
Extended Sequential Item Response Models for Multiple-Choice, Multiple-Attempt Test Items — Yikai Lu and Ying Cheng (Notre Dame)
abstract
slides
Abstract: The answer-until-correct (AUC) procedure allows
subjects to respond to a multiple-choice item until the correct answer
option is selected, which may have several advantages including enhancing
learning, increasing reliability, and discouraging guessing. In this
study, we extended prior research on sequential item response models
for multiple-choice, multiple-attempt test items (SIRT-MM) to include
a freely-estimated pseudo-guessing parameter and conducted a simulation
study to evaluate item and person parameter recovery of SIRT-MMe models
under various sample size and test length conditions, compared to the
equivalent 3PL model. Our simulation studies demonstrated that item
parameters can be estimated accurately and person parameters are estimated
more accurately than using the corresponding 3PL models by gaining more
information from multiple attempts. Implications for research and practice
will be discussed.
Quantifying the CAT Potential of an Item Bank — Mike Bass (Northwestern) and Scott Morris (Illinois Institute of Technology)
abstract
slides
Abstract: It would be useful to have a statistic that
characterizes the potential of an item bank to benefit from CAT
administration. In this study, we explore a variety of metrics based
on item information to assess the potential efficiency of CAT relative
to a fixed-item short form. These statistics can be computed from
IRT item parameters, and therefore allow assessment of a bank prior to
implementation of the CAT. These statistics are discussed in the context
of two PROMIS item banks (depression and physical function).
Closing comments (4:00)
Questions about the seminar may be directed to Alan Mead
(), Scott Morris (), or Kirk Becker (). We hope you will join us.
Back to the main page