Ideas in Testing Research Seminar Schedule, November 15, 2024
Coffee & Networking (9:00 — 9:50)
Welcome and Introduction (9:50 — 10:00)
AI and item generation (10:00 — 10:55)
Assessing ChatGPT's Proficiency in Generating Accurate Questions and Answers for Statistics Education — Meredith Sanders, Nancy Le, & Alison Cheng (University of Notre Dame)
abstract
slides
Abstract:
The rise of Artificial Intelligence (AI) technology has proved
to be advantageous in many regards, especially within the field of
education. Students and teachers alike are excited by the potential of
certain systems, such as ChatGPT, to enhance the learning and teaching
experience, from automatically grading essays to acting as a virtual
tutor. For example, one recent study focuses on ChatGPT's ability to
create lesson plans, citing generative AI as a resource to efficiently
reduce this time-consuming aspect of teaching (Powell & Courchesne,
2024). However, results showed that while ChatGPT provided an accurate
view of the lesson plan model, iterations of the generated lesson
plan included "questionable components" and omitted key details,
highlighting the growing need to assess the program's accuracy. One
area in which this accuracy has yet to be explored is in ChatGPT's
ability to generate questions and solutions across topic areas.
Artificial Intelligence and Testing — Kirk Becker & Paul Jones (Pearson)
abstract
slides
An agenda and checklist for psychometric research using generative AI — Alan Mead (Certiverse) & Chenxuan Zhou (Talent Algorithms)
abstract
slides
Abstract: Since 2022, generative AI (GenAI) has emerged as a
valuable tool for researchers, demonstrating broad applicability across
various domains. This paper focuses on the potential of GenAI within
psychometric practice and research, specifically in generating exam items
and psychological scales. Through a review of the existing literature,
it outlines the current state of AI-driven psychometric applications
and proposes a structured research agenda and checklist to guide future
inquiry in this field.
Comparison of Zero-Shot, RAG, and Agentic Methods of Generating Items Using AI — Alan Mead (Certiverse) & Chenxuan Zhou (Talent Algorithms)
abstract
slides
Abstract: Recent research has demonstrated that large
language models, like GPT-4 Omni (GPT-4o), can create exam questions for
certification/knowledge exams (Mead & Zhou, 2024) and has explored
how different types of prompts affect the results (Jones & Becker,
2024). So far, this method is better as an assistant to SMEs because the
items require review/editing. This presentation will discuss a comparison
of item generation using zero-shot (which is the current default) to two
new methods: Retrieval-Augmented Generation (RAG) and an agentic approach.
Break (10:55 — 11:05)
CAT and item formats (11:05 — 12:00)
Pros and cons of compositional forced choice measurement — Austin Thielges & Bo Zhang (University of Illinois Urbana-Champaign)
abstract
slides
Abstract: Traditional forced choice format has demonstrated
potential with its ability to diminish response biases and faking to
which Likert scales are susceptible. However, it suffers from reduced
reliability compared to its Likert counterpart. In addition, respondent
reactions are less favorable. This is likely related to individuals
being forced to make judgments between two or more statements, rather
than simply rating agreement with a single statement. Compositional
forced choice format has been proposed as an alternative to traditional
forced choice format. In compositional forced choice, individuals
have a fixed number of 'points' to distribute across compared
statements. Compositional forced choice has shown promise as a way to
maintain the benefits of traditional forced choice while allowing for
individuals to express variability in the extent to which they think
one statement is more or less like them than another. This maintains
the relative indication of personality that is key to forced choice
measurement without having to go back to Likert scales and reintroducing
response biases and the potential for individuals to fake. In the
present study, a Prolific sample of 377 individuals responded to Likert
and compositional forced choice scales consisting of statements from
the Big Five Inventory - 2 (Soto & John, 2016). Participants were
assigned to one of two compositional forced choice conditions. In
one condition, participants were given 20 points to distribute across
three statements. In the other, participants were given 100 points to
distribute across the statements. Participants responded to 60 statements
in Likert format and to 20 blocks of three statements in their respective
conditions. The statements from the Likert format are identical to
those used to create the triplet blocks. The blocks are identical
across the two compositional forced choice conditions. We compare
the psychometric properties of the Likert scale with the 20-point and
100-point compositional forced choice scale conditions, discussing model
fit, reliability, convergent validity, and criterion-related validity. In
addition, we examine the favorability of respondent reactions to the
compositional forced choice formats. Finally, differences in response
times across the different formats are considered.
Two-phase Content-balancing CD-CAT Online Item Calibration — Jing Huang, Yuxiao Zhang, & Hua-hua Chang, (Purdue University)
abstract
slides
Abstract: Online calibration has proven to be an effective
method for continuous management of a cognitive diagnosis computerized
adaptive testing (CD-CAT), ensuring that new items' parameters are on
the same scale as the previously calibrated items. However, few studies
have developed item selection criteria for CD-CAT online calibration that
take content balancing into account, potentially leading to uneven and
inefficient item pool usage. The aim of this study is to investigate the
effectiveness of incorporating two-phase content-balancing constraints
into the CD-CAT online calibration procedure.
Polytomous Item Sets in CAT: Novel Representations for Online Calibration — Zhuoran Wang & William Muntean (National Council of State Boards of Nursing)
abstract
slides
Abstract: In Computerized Adaptive Testing (CAT), online
calibration improves pretesting efficiency. However, there has not been a
discussion on how to conduct online calibration on item sets. This study
explored multiple representations of the Fisher information matrix for
item sets, which were used to adaptively assign pretesting item sets
to candidates.
Lunch (12:00 — 1:00)
Group differences, discrimination, and the law (1:00 — 1:55)
Finding Words Associated with DIF: Predicting and Describing Differential Item Functioning using Large Language Models — Hotaka Maeda (Smarter Balanced) & Yikai Lu (University of Notre Dame)
abstract
slides
Abstract: Large language models (LLMs) are commonly recognized
as "black boxes", but the explainability of these models has improved
recently (Kokhlikyan et al., 2020). If LLMs can be used to predict and
describe differential item functioning (DIF) analysis results, they may
be able to help review DIF results, or screen potentially biased and
unfair items during the item-writing process. Therefore, the purpose
of this study was to train an encoder LLM that can predict DIF between
female and male examinees from the item text, and identify the specific
words associated with such DIF.
An Update on AI Discrimination Laws — Scott Morris (Illinois Institute of Technology)
abstract
slides
Modeling Diversity-Validity Tradeoffs: A Comparison of Pareto- and Multi-Penalty Optimization. — Hudson Pfister, Amanda Neuman, Tony Lam, & Scott Morris (Illinois Institute of Technology)
abstract
slides
Abstract: Multiple objectives need to be balanced when
designing an employee selection system. The diversity-validity dilemma
occurs because organizations seek both highly qualified applicants and a
diverse workforce, yet some of the best predictors of job performance also
produce large subgroup differences (Ployhart & Holtz, 2008). Advances
in data analysis techniques and machine learning are creating innovative
approaches to addressing this dilemma. In this study, we explore two
methods for creating predictor composites that balance validity and
diversity concerns: Pareto-optimization and Multi-penalty Optimization.
Break (1:55 — 2:00)
Test design/test constructs (2:00 — 2:55)
The Work Disability Functional Assessment Battery (WD-FAB) — Michael Bass (Northwestern University)
abstract
slides
Abstract: In the United States, national disability programs are
challenged to adjudicate millions of work disability claims each year in
a timely and accurate manner. The Work Disability Functional Assessment
Battery (WD-FAB) was developed to provide work disability agencies
and other interested parties a comprehensive and efficient approach to
profiling a person's function related to their ability to work. The WD-FAB
is constructed using contemporary item response theory methods to yield
an instrument that can be administered efficiently using computerized
adaptive testing techniques. The WD-FAB could provide relevant information
about work-related functioning for a wide range of clinical and policy
applications. CAT characteristics and implementation of the WD-FAB
assessment will be discussed and demonstrated.
Faking Detection Using Item-level Machine Learning —
Chen Tang (American University), Bo Zhang (University of Illinois Urbana-Champaign), Zheting Lin (CCCC Highway Consultants Co., Ltd.), Jeromy Anglim (Deakin University), & Jian Li (Beijing Normal University)
abstract
Abstract: Past research has shown that, compared to those based on scale scores,
machine learning models based on personality item scores can improve
the accuracy of faking detection. However, little is known regarding
conditions under which these models are superior in identifying fakers,
and how they can be applied in real-world applications. Moreover,
the ground truth (who fakes and who does not fake) is often unknown in
previous studies as all of them were based on empirical data, rendering
the conclusions potentially confounded. In this study, we first replicated
previous findings in two diverse datasets and showed that machine learning
models based on personality item scores outperformed those based on facet
and domain scores (Study 1) in terms of differentiating respondents from
low-stakes vs high-stakes conditions. In Study 2, we used Monte Carlo
simulations based on a high-fidelity data generation model to examine
faking detection accuracy of machine learning models at different levels
of (a) faking effect size, (b) sample size, and (c) scale length (number
of items). Results showed that item-level models were effective even
when sample size was small, and number of items was limited. Based on
these findings, we further demonstrated that item-level faking detection
models are cost-effective as they do not require large training data. More
interestingly, item-level models could facilitate early faking detection
and intervention via a special procedure we termed sequential predictive
faking detection. Overall, this paper showed that machine learning based
on personality item scores is a desirable method for faking detection.
Assessing Engagement in Academic Contexts: A Multi-Method Validation — Nancy Le & Alison Cheng (University of Notre Dame)
abstract
Abstract: Engagement in academic contexts significantly
influences students' learning outcomes and performance. This study
aimed to validate a self-report measure of engagement by examining its
relationship with behavioral indicators of engagement, self-report and
behavioral indicators of procrastination using a multitrait multi-method
validation approach. The sample consisted of students between the ages of
14 and 18 years (Mean age = 16.68 years, SD age = 0.90) enrolled in high
school Advanced Placement (AP) Statistics courses in the Midwestern
United States (N = 804) during the 2017-2018 and 2018-2019 academic
years. Over the course of two academic years, these students participated
in an online assessment system, which provided them with personalized
performance reports.
Break (2:55 — 3:00)
Working with open ended responses (3:00 — 3:55)
Bridging Constructs Underlying Quantitative and Textual Data: A Joint Factor-Topic Model — Yuxiao Zhang, David Arthur, Yukiko Maeda, & Hua-Hua Chang (Purdue University)
abstract
Abstract: Accurately identifying latent constructs from
observed behavior is a crucial task in psychological and educational
measurement. Traditional methods have primarily focused on quantitative
data. However, psychological constructs often emerge in textual data, such
as attitudes or interests expressed in open-ended responses. Consequently,
there is an increasing need for methodologies that integrate text-derived
and scale-based constructs. This study proposes a factor-topic model that
integrates the two types of constructs into a unified framework. Employing
factor analysis and topic modeling as measurement components, the joint
model is estimated via a Bayesian approach using Stan. This method
broadens the scope of traditional psychometrics and enables deeper
insights into complex datasets.
Head-to-Head: Comparing AI versus Human Categorizations of Open-Ended Survey Responses — Nicholas Williams, Lidia Martinez, Tara McNaughton (American Osteopathic Association)
abstract
slides
Abstract: Codifying open-ended survey data for qualitative
analysis can be time intensive for staff. Our case study explores the
potential of utilizing an AI model to parse this type of data into useful
category tags. Capitalizing on an AI model that approximates trained human
raters on a repeatable task would yield considerable time savings. We
will present a case study where we compared the AI model's performance
on this task to three human raters performing the same task. We hope
that the case study results will help inform others into any potential
trade-offs with using AI for similar categorization tasks.
Assessing Topic Recovery in Open-Ended Responses: The Effects of Sample Size, Document Length, and Similarity Threshold — Xiyu Wang, Yukiko Maeda, & Yuxiao Zhang (Purdue University)
abstract
slides
Abstract: Open-ended (OE) questions usually serve as a critical
component of survey research, allowing respondents to express their
opinions, emotions, and knowledge with fewer restrictions. Despite
the benefits of OE questions, manual coding is time-consuming,
requires qualitative skills, and raises concerns about inconsistency,
emphasizing the need for quantitative tools to improve efficiency and
consistency. Topic modeling, a computerized method, categorizes a corpus
of texts into meaningful coding categories or “topics.” Recent
studies have applied this technique to analyze OE question responses,
yet few have explored how factors like sample size and response length
impact the consistency of topic recovery. This study aims to address
this gap by examining how the sample size and response length of OE
responses influence the consistency of topic recovery, and how varying
the threshold of cosine similarity affects the findings.
Closing comments (3:55)
Questions about the seminar may be directed to Alan Mead
(), Scott Morris (), or Kirk Becker (). We hope you will join us.
Back to the main page