Ideas in Testing Research Seminar Schedule, November 10, 2023
Coffee & Networking (9:00 — 9:50)
Welcome and Introduction (9:50 — 10:00)
Applications of NLP (10:00 — 10:50)
Tracking Text and Item Revisions Across Examination Forms for Assessment Programs — Nicholas Williams, MEd, and Eden Racket (American Osteopathic Association)
abstract
Abstract: Tracking item revisions can be a tricky business. Some
item banking software solutions incorporate no versioning, others require
a manual process to track item revisions, and still others are overly
sensitive in automatically tracking new revisions. Our current item
banking system is an example of the latter, as metadata changes give an
item a new item version number. A full revision history is available for
each item, but manually checking individual changes against revision dates
for scoring purposes is a time-consuming process. As our primary interest
is whether an item's text, key, and/or associated assets have changed,
we have developed a custom solution utilizing a new process in which we
back up item banks and published examinations in a processing-friendly
data format. Afterwards, a custom tool developed with python and the
pandas library is run to create a report of items that have potentially
been revised. Our new approach has led to increased efficiency in
determining the number of items that have undergone significant changes
since the last exam administration. It is our hope that our implementation
may inspire others to consider similar methods of tackling this problem
for their items.
Analyzing open-responses in a post-exam survey using natural language processing methods. — Yunyi Long and Xia Mao (NBOME)
abstract
Abstract: Open-ended questions in the survey after licensure
examinations can capture candidates' insightful feedback and thus
facilitate improvement in future examinations. However, the time
and labor costs associated with open-ended responses, as well as the
inevitable human biases, can impede the wide usage of this type of
question. With the development of Natural Language Process (NLP),
qualitative data can be processed effectively, offering alternatives
for human coding. The present study compares two NLP methods with human
categorization to assess the effectiveness of these computer-assisted tools
in analyzing candidates' open feedback. Preliminary results revealed
that both NLP methods cleaned out noisy data effectively. They could also
classify hundreds of open-responses into five to six major categories.
Analyzing Constructed Responses in Educational Survey with LDA: A Demonstration on Students' Career Aspiration — Yuxiao Zhang, Nielsen Pereira, David Arthur, and Hua Hua Chang (Purdue University)
abstract
Abstract: This study explores the potential of Latent Dirichlet Allocation (LDA; Blei et al.,
2003) as a tool for analyzing constructed responses in educational surveys. The traditional
manual coding approach can be labor-intensive and time-consuming, especially when dealing
with large sample sizes and long textual responses. LDA, a statistical algorithm from Natural
Language Processing, provides an efficient solution for discovering major topics within a
collection of documents. In this study, using students' constructed responses regarding their
career aspirations, we demonstrated the utility of LDA in obtaining insights from these
responses and transforming textual data into numerical variables that can be used in
subsequent statistical analyses.
Break (10:50 — 11:00)
Cognitive assessments (11:00 — 11:50)
Prediction of Cognitive Impairment Using Higher Order Item Response Theory and Machine Learning — Lihua Yao (Northwestern)
abstract
Abstract: Early detection of Cognitive impairment (CI)
is very important for aged adults. The MyCog assessment uses two
well-validated iPad-based measures from the NIH Toolbox for the Assessment
of Neurological Behavior and Function Cognitive Battery (NIHTB-CB) that
address two of the first domains to show CI: Picture Sequence Memory
(PSM) which assesses episodic memory and Dimensional Change Card Sort
(DCCS) measuring cognitive flexibility. The purpose of this study was to
explore machine learning models for the purpose for a better prediction
of CI patients. Our talk will discusss the methodological approach.
Our results suggest that relying on a single, simple cut point for a
composite score, regardless of how well it is derived, may not yield
optimal outcomes. Instead, employing machine learning models that utilize
scores derived from IRT and encompass features such as age can lead to
more effective prediction models.
Introducing the newly normed NIH Toolbox Cognition Battery (V3) — Emily Ho, Aaron Kaat, Erica LaForte, Amy Giella, Julie Hook, and Richard Gershon (Northwestern)
abstract
Abstract: This talk describes the results of a large-scale
norming study of the NIH Toolbox Cognition Battery (V3), including
measures of convergent validity, divergent validity, and other relevant
psychometric indices. We collected a nationally representative sample
of N = 3848 US participants. A subset of 200 participants completed a
retest 7 to 21 days later. Measures included two newly developed tests
(Speeded Matching and Visual Reasoning) and convergent validity measures.
Our talk will describe the sampling and norming procedure. We found that
growth curves for each of the measures follows hypothesized trajectories
across the life span. Confirmatory factor analyses showed a two-factor
model that separates fluid and crystallized intelligence fit well, and
the convergent validity analyses demonstrated good convergence with
established gold standards. We concludes that the NIH Toolbox is a
multidimensional set of assessments meant to be a "common currency"
for a diverse set of study design and research settings. The updated NIH
Toolbox V3 incorporates new scientific developments in neuropsychology
and psychometrics, includes two validated measures of processing speed
and non-verbal reasoning, respectively. There is good convergence with
established gold standards and a robust factor structure that aligns
with a two-factor model of cognition.
Explaining Performance Gaps with Problem-Solving Process Data via Latent Class Mediation Analysis — Sunbeom Kwon and Susu Zhang (University of Illinois, Urbana-Champaign)
abstract
Abstract: Computer-based assessment platforms allowed for
the collection of problem-solving process data, offering insights
into examinees’ problem-solving strategies. This study explores
performance gaps among groups using process data and introduces a
latent class mediation analysis procedure. Through this analysis, the
study reveals latent classes underlying the distribution of sequence
features, explaining performance gaps between groups. Process data from
the National Assessment of Educational Progress (NAEP) Math Assessment
was analyzed to highlight differences in test-taking processes that
explain performance gaps between learners with learning disabilities
(LD) and their typically developing (TD) peers.
Lunch (11:50 — 12:50)
Fundamentals of measurement (12:50 — 1:40)
Nonparametric Response Time Estimation for Evaluating Model Fit — Quizhou Duan and Ying Cheng (University of Notre Dame)
abstract
Abstract: As response time data becomes widely, it is useful
to consider response time data in addition to the analysis for response
accuracy. In many instances the response time data might uncover a
different aspect of items in a given test. In the present study,
we propose a nonparametric estimation approach for response time
modeling. Previous developments addressed the fit of parametric models
for response accuracy by comparing parametric models and nonparametric
ones. This approach first gives us a way to graphically assess the
goodness of fit as the deviation of the parametric curves from the
nonparametric curves is visually displayed. In addition, resampling
methods can be added to make the approach inferential. Simulation results
show that the proposed nonparametric approach can adequately pick up
aberrancies. A real data analysis was performed on PISA science items
in 2018, and four items were detected using response time. The future
direction of the study includes comparing item fit statistics for response
time and the distance measure in our nonparametric approach.
Link-DIF: An Iterative DIF Detection and Equating Procedure Using Logistic Regressions — Nancy Le and Ying Cheng (University of Notre Dame)
abstract
Abstract: In typical common-item link procedures, items from two
forms can be placed on the same scale by mean & sigma transformation based
on regression of item parameter estimates of the common items. However,
when one or more of the common items have DIF, equating coefficients
may be affected by the direction and magnitude of DIF. In this study,
we propose an iterative procedure that equates and removes DIF items
from the anchor set to achieve a "purified" anchor and stabilized
equating results.
Bayesian Item-Level Model Selection for Cognitive Diagnostic Models via Reversible Jump Markov chain Monte Carlo — David Arthur (Purdue University)
abstract
Abstract: Cognitive Diagnostic Models (CDMs) are powerful
tools that provide personalized feedback regarding skill mastery within
a specific domain of learning. Issues of model selection, however, leads
to subpar model fit for the overall assessment which ultimately results
in an inadequate picture regarding students’ skill mastery. This in
turn may lead to inadequate instructional and learning practices in
the future. In this research, we propose a fully Bayesian approach
to the item-level model selection problem for CDMs that performs well
in both small and large sample settings. Our approach has several
advantages. First, we obtain more information regarding uncertainty
in the model selection process via the posterior distribution over
candidate models. Second, it provides a natural way to perform Bayesian
model averaging which can be useful when no single candidate model is
the correct model for an item. Third, this approach does not rely on
asymptotic assumptions and relies only on information available about
the posterior distribution. Finally, as a fully Bayesian approach, it
offers a way to incorporate prior information about model parameters
which can be crucial in assessments settings with small sample sizes.
Via simulation studies, we demonstrate that the proposed approach leads
to higher classification accuracy of candidate models in small samples
when compared to traditional approaches. Our findings suggest that
the proposed approach should be used for item-level model selection,
especially when sample sizes are small.
Break (1:40 — 1:50)
Evaluating Programs and Measurement Issues (1:50 — 3:00)
Comparing Performance and Perceptions of CRNAs on a Longitudinal Assessment and the Continued Professional
Certification Assessment: A Randomized Controlled Trial
— Shahid A. Choudhry, Timothy J. Muckle, Christopher J. Gill, Rajat Chadha, Magnus Urosev, Matt Ferris, and John C. Preston
(The National Board of Certification and Recertification for Nurse Anesthetists)
abstract
Abstract: The National Board of Certification and
Recertification for Nurse Anesthetists (NBCRNA) conducted a one-year
research study comparing performance on the Continued Professional
Certification Assessment (CPCA) administered at a test center or online
using live remote proctoring, to a longitudinal assessment that required
answering questions quarterly on demand from any location (CPC-LA).
Overall results suggest that the CPC-LA format is a feasible, usable, and
valid method to assess CRNA's anesthesia knowledge and well-aligned to
the goals for the CPC Program. On balance, CPC-LA participants may exhibit
a different level of intensity in their approach to the assessment for and
of learning (compared to the CPCA) and more study is needed to identify
contributing factors affecting performance. Both groups were satisfied
with their experience. The CPC-LA group's feedback ratings were slightly
higher, and they found the platform easy to use and navigate.
Addressing the Effects of Home Resource Variables on Achievement: Are Latent Variables the Right Approach? — Lionel Meng and Dan Bolt (University of Wisconsin, Madison)
abstract
Abstract: When attempting to control for the influence
of multiple contextual variables on achievement outcomes, applied
methodologists are often quick to introduce data reduction in the form
of latent variables, following a belief that the latent variable is the
source of the causal influence and the observed variables are little
more than a reflection of the latent variable with error. However, this
approach makes strong assumptions and can yield highly misleading results
(VanderWeele, 2022). We examine this issue in the context of attempts
to define and control for the influence of home resources through a
home resources for learning index (HRL) on math and reading achievement
scores in the PIRLS and TIMSS assessments. Our results call into question
the common practice of modeling the HRL indicators with a single latent
variable. We propose an explanation for this phenomenon. The analytic
implication is insufficient control of the relevant manifest variable
through full reliance on the latent variable. Similar concerns apply
to studies that integrate measurement invariance analyses using latent
variable models into their study and control of contextual variables.
We show that allowing different loadings across countries may do more
to harm the control of HRL in performing cross-country comparisons
than enhance it. Throughout our analyses we rely on structural equation
modeling techniques using the lavaan package in R. We illustrate some
practical examples involving individual countries.
Assessing risk across credentialing programs — Kirk Becker (Pearson VUE)
abstract
Abstract: Different testing programs are subject to different
levels of cheating for different reasons. A self-assessment, for example,
is unlikely to elicit cheating as there are no stakes attached to the
results and the test takers are interested in the valid results of the
test. While this is easy to state in general terms, quantifying the
effects of program characteristics on security needs has not, to our
knowledge, been attempted. This study will explore a research study
evaluating characteristics such as industry, volume, pre-requirements,
and credential value. Dependent variables that can be used for this
study include indicators of test misconduct (response overlap, unusual
item times, short test times, etc.). Example comparisons across testing
modalities (online proctor and various types of test centers), and a
discussion of complications and caveats, will be included.
Bias Audits for Artificial Intelligence Assessments — Scott Morris (Illinois Institute of Technology)
abstract
Abstract: The increased use of artificial intelligence (AI)
in hiring (e.g. automated resume screening, automated video interviews)
has triggered concerns about algorithmic bias and the negative impact
AI-based assessments might have on disadvantaged groups. The first law
regulating the use of AI in employment was issued by the City of New York
(Local Law 144), which covers the use of Automated Employment Decision
Tools to "substantially assist or replace" human judgment in hiring
and promotion decisions. Employers are required to conduct yearly 3rd
party bias audits and post the results on the company website. These
bias audits focus on group differences in passing rates, or what is
commonly referred to as adverse impact, and provide no evaluation of
psychometric notations of bias (e.g., differential item functioning
or differential prediction). The law also requires no remedial action
other the publication of bias reports. The Equal Employment Opportunity
Commission has also issued guidance on the use of AI in employment
decision, which for the most part treat AI assessments the same as any
other selection tool. The emerging regulatory framework to evaluating
bias in AI is not well suited to detect or mitigate bias. The pitfalls
of failing to distinguish between bias and impact will be discussed.
Break (3:00 — 3:10)
Technology in assessment design (3:10 — 4:00)
Developing Dual-objective CD-CAT Algorithms for College Gate-way STEM Courses — Xiuxiu Tang, Yuxiao Zhang, and Hua-Hua Chang (Purdue University)
abstract
Abstract: In college gate-way STEM courses, there are
usually a large number of students in the classroom and it remains
challenging to conduct precise and efficient assessments. Cognitive
diagnostic computerized adaptive testing (CD-CAT) combines the strength
of both computerized adaptive testing (CAT) and cognitive diagnosis
(CD). The purpose of the study is to select an appropriate CD-CAT
algorithm for college gate-way STEM courses. The algorithm will be
a dual information-based method, which combines information from both
θ (overall proficiency) and α (skill mastery).
We are also interested in investigating the effect of different test
lengths on the estimation accuracy. To implement the dual-objective
CD-CAT, both cognitive diagnostic models (CDM) and the item response
theory (IRT) models are used in the process of selecting items for
test takers.
Development of an Augmented Reality Game-Based Assessment of Cognitive Ability
— Kristina N. Bauer, Ivan Mutis, Gady Agam, Xuanchang Liu, and Gai Hao (Illinois Institute of Technology)
abstract
Abstract: Utilizing design thinking (Plattner et al., 2011)
and game mechanics (Salen et al., 2004), we developed a holographic 3D
puzzle cube game to measure cognitive ability that was inspired by the
Block Design test from the Wechsler Adult Intelligence Scale and the
game Tetris. Results of this pilot study showed that perceptions
of difficulty and mental effort were significantly different from
each other in the expected direction in each level. Additionally,
we ran a concatenated neural network to classify puzzle difficulty
with physiological measures as features and reached an impressive 79%
accuracy. Finally puzzle performance was positively correlated with WMC,
negatively correlated with age, and uncorrelated with gender. Together
these findings suggest difficulty was successfully manipulated and the
test has initial evidence for construct validity, but additional research
with larger samples is needed.
Technology in Employment Interviews: Common Practice and Applicant Reactions — Andrew Greenagel and Erin Young (Illinois Institute of Technology)
abstract
Abstract: More technology is being introduced into selection
processes in recent years; yet research on how applicants react to these
novel technologies is still limited. Therefore, the purpose of this study
was twofold: to investigate current interview practices and examine how
applicants perceive these practices. A sample of N = 404 participants
were asked to think about a recent employment interview and answer
questions based on their experience. Questions assessed applicant
reactions of organizational justice, transparency, organizational
attractiveness, and intentions to pursue a job offer. Results suggested
that most organizations are using human-based hiring practices and that
justice perceptions were similar across all interview administration
mediums. However, transparency perceptions were significantly
lower for online interviews compared to in-person interviews. While
no significant differences emerged between human and AI evaluation
methods, several participants (n = 62) indicated not knowing whether
a human or AI evaluated their interview, revealing that organizations
may not be explicit in stating the details of their selection system to
applicants. Notably, these results differed from much of the past research
(mainly lab studies) on technology-mediated interviews. The talk will
provide conclusions and directions for future research.
Closing comments (4:00)
Questions about the seminar may be directed to Alan Mead
(), Scott Morris (), or Kirk Becker (). We hope you will join us.
Back to the main page