2023 Ideas in Testing Research Seminar, November 10, 2023, IIT, Chicago, IL

Tracking Text and Item Revisions Across Examination Forms for Assessment Programs — Nicholas Williams, MEd, and Eden Racket (American Osteopathic Association) abstract

Abstract: Tracking item revisions can be a tricky business. Some item banking software solutions incorporate no versioning, others require a manual process to track item revisions, and still others are overly sensitive in automatically tracking new revisions. Our current item banking system is an example of the latter, as metadata changes give an item a new item version number. A full revision history is available for each item, but manually checking individual changes against revision dates for scoring purposes is a time-consuming process. As our primary interest is whether an item's text, key, and/or associated assets have changed, we have developed a custom solution utilizing a new process in which we back up item banks and published examinations in a processing-friendly data format. Afterwards, a custom tool developed with python and the pandas library is run to create a report of items that have potentially been revised. Our new approach has led to increased efficiency in determining the number of items that have undergone significant changes since the last exam administration. It is our hope that our implementation may inspire others to consider similar methods of tackling this problem for their items.

Analyzing open-responses in a post-exam survey using natural language processing methods. — Yunyi Long and Xia Mao (NBOME) abstract

Abstract: Open-ended questions in the survey after licensure examinations can capture candidates' insightful feedback and thus facilitate improvement in future examinations. However, the time and labor costs associated with open-ended responses, as well as the inevitable human biases, can impede the wide usage of this type of question. With the development of Natural Language Process (NLP), qualitative data can be processed effectively, offering alternatives for human coding. The present study compares two NLP methods with human categorization to assess the effectiveness of these computer-assisted tools in analyzing candidates' open feedback. Preliminary results revealed that both NLP methods cleaned out noisy data effectively. They could also classify hundreds of open-responses into five to six major categories.

Analyzing Constructed Responses in Educational Survey with LDA: A Demonstration on Students' Career Aspiration — Yuxiao Zhang, Nielsen Pereira, David Arthur, and Hua Hua Chang (Purdue University) abstract

Abstract: This study explores the potential of Latent Dirichlet Allocation (LDA; Blei et al., 2003) as a tool for analyzing constructed responses in educational surveys. The traditional manual coding approach can be labor-intensive and time-consuming, especially when dealing with large sample sizes and long textual responses. LDA, a statistical algorithm from Natural Language Processing, provides an efficient solution for discovering major topics within a collection of documents. In this study, using students' constructed responses regarding their career aspirations, we demonstrated the utility of LDA in obtaining insights from these responses and transforming textual data into numerical variables that can be used in subsequent statistical analyses.

Break (10:50 — 11:00)

Cognitive assessments (11:00 — 11:50)

Prediction of Cognitive Impairment Using Higher Order Item Response Theory and Machine Learning — Lihua Yao (Northwestern) abstract

Abstract: Early detection of Cognitive impairment (CI) is very important for aged adults. The MyCog assessment uses two well-validated iPad-based measures from the NIH Toolbox for the Assessment of Neurological Behavior and Function Cognitive Battery (NIHTB-CB) that address two of the first domains to show CI: Picture Sequence Memory (PSM) which assesses episodic memory and Dimensional Change Card Sort (DCCS) measuring cognitive flexibility. The purpose of this study was to explore machine learning models for the purpose for a better prediction of CI patients. Our talk will discusss the methodological approach. Our results suggest that relying on a single, simple cut point for a composite score, regardless of how well it is derived, may not yield optimal outcomes. Instead, employing machine learning models that utilize scores derived from IRT and encompass features such as age can lead to more effective prediction models.

Introducing the newly normed NIH Toolbox Cognition Battery (V3) — Emily Ho, Aaron Kaat, Erica LaForte, Amy Giella, Julie Hook, and Richard Gershon (Northwestern) abstract

Abstract: This talk describes the results of a large-scale norming study of the NIH Toolbox Cognition Battery (V3), including measures of convergent validity, divergent validity, and other relevant psychometric indices. We collected a nationally representative sample of N = 3848 US participants. A subset of 200 participants completed a retest 7 to 21 days later. Measures included two newly developed tests (Speeded Matching and Visual Reasoning) and convergent validity measures. Our talk will describe the sampling and norming procedure. We found that growth curves for each of the measures follows hypothesized trajectories across the life span. Confirmatory factor analyses showed a two-factor model that separates fluid and crystallized intelligence fit well, and the convergent validity analyses demonstrated good convergence with established gold standards. We concludes that the NIH Toolbox is a multidimensional set of assessments meant to be a "common currency" for a diverse set of study design and research settings. The updated NIH Toolbox V3 incorporates new scientific developments in neuropsychology and psychometrics, includes two validated measures of processing speed and non-verbal reasoning, respectively. There is good convergence with established gold standards and a robust factor structure that aligns with a two-factor model of cognition.

Explaining Performance Gaps with Problem-Solving Process Data via Latent Class Mediation Analysis — Sunbeom Kwon and Susu Zhang (University of Illinois, Urbana-Champaign) abstract

Abstract: Computer-based assessment platforms allowed for the collection of problem-solving process data, offering insights into examinees’ problem-solving strategies. This study explores performance gaps among groups using process data and introduces a latent class mediation analysis procedure. Through this analysis, the study reveals latent classes underlying the distribution of sequence features, explaining performance gaps between groups. Process data from the National Assessment of Educational Progress (NAEP) Math Assessment was analyzed to highlight differences in test-taking processes that explain performance gaps between learners with learning disabilities (LD) and their typically developing (TD) peers.

Lunch (11:50 — 12:50)

Fundamentals of measurement (12:50 — 1:40)

Nonparametric Response Time Estimation for Evaluating Model Fit — Quizhou Duan and Ying Cheng (University of Notre Dame) abstract

Abstract: As response time data becomes widely, it is useful to consider response time data in addition to the analysis for response accuracy. In many instances the response time data might uncover a different aspect of items in a given test. In the present study, we propose a nonparametric estimation approach for response time modeling. Previous developments addressed the fit of parametric models for response accuracy by comparing parametric models and nonparametric ones. This approach first gives us a way to graphically assess the goodness of fit as the deviation of the parametric curves from the nonparametric curves is visually displayed. In addition, resampling methods can be added to make the approach inferential. Simulation results show that the proposed nonparametric approach can adequately pick up aberrancies. A real data analysis was performed on PISA science items in 2018, and four items were detected using response time. The future direction of the study includes comparing item fit statistics for response time and the distance measure in our nonparametric approach.

Link-DIF: An Iterative DIF Detection and Equating Procedure Using Logistic Regressions — Nancy Le and Ying Cheng (University of Notre Dame) abstract

Abstract: In typical common-item link procedures, items from two forms can be placed on the same scale by mean & sigma transformation based on regression of item parameter estimates of the common items. However, when one or more of the common items have DIF, equating coefficients may be affected by the direction and magnitude of DIF. In this study, we propose an iterative procedure that equates and removes DIF items from the anchor set to achieve a "purified" anchor and stabilized equating results.

Bayesian Item-Level Model Selection for Cognitive Diagnostic Models via Reversible Jump Markov chain Monte Carlo — David Arthur (Purdue University) abstract

Abstract: Cognitive Diagnostic Models (CDMs) are powerful tools that provide personalized feedback regarding skill mastery within a specific domain of learning. Issues of model selection, however, leads to subpar model fit for the overall assessment which ultimately results in an inadequate picture regarding students’ skill mastery. This in turn may lead to inadequate instructional and learning practices in the future. In this research, we propose a fully Bayesian approach to the item-level model selection problem for CDMs that performs well in both small and large sample settings. Our approach has several advantages. First, we obtain more information regarding uncertainty in the model selection process via the posterior distribution over candidate models. Second, it provides a natural way to perform Bayesian model averaging which can be useful when no single candidate model is the correct model for an item. Third, this approach does not rely on asymptotic assumptions and relies only on information available about the posterior distribution. Finally, as a fully Bayesian approach, it offers a way to incorporate prior information about model parameters which can be crucial in assessments settings with small sample sizes. Via simulation studies, we demonstrate that the proposed approach leads to higher classification accuracy of candidate models in small samples when compared to traditional approaches. Our findings suggest that the proposed approach should be used for item-level model selection, especially when sample sizes are small.

Break (1:40 — 1:50)

Evaluating Programs and Measurement Issues (1:50 — 3:00)

Comparing Performance and Perceptions of CRNAs on a Longitudinal Assessment and the Continued Professional Certification Assessment: A Randomized Controlled Trial — Shahid A. Choudhry, Timothy J. Muckle, Christopher J. Gill, Rajat Chadha, Magnus Urosev, Matt Ferris, and John C. Preston (The National Board of Certification and Recertification for Nurse Anesthetists) abstract

Abstract: The National Board of Certification and Recertification for Nurse Anesthetists (NBCRNA) conducted a one-year research study comparing performance on the Continued Professional Certification Assessment (CPCA) administered at a test center or online using live remote proctoring, to a longitudinal assessment that required answering questions quarterly on demand from any location (CPC-LA). Overall results suggest that the CPC-LA format is a feasible, usable, and valid method to assess CRNA's anesthesia knowledge and well-aligned to the goals for the CPC Program. On balance, CPC-LA participants may exhibit a different level of intensity in their approach to the assessment for and of learning (compared to the CPCA) and more study is needed to identify contributing factors affecting performance. Both groups were satisfied with their experience. The CPC-LA group's feedback ratings were slightly higher, and they found the platform easy to use and navigate.

Addressing the Effects of Home Resource Variables on Achievement: Are Latent Variables the Right Approach? — Lionel Meng and Dan Bolt (University of Wisconsin, Madison) abstract

Abstract: When attempting to control for the influence of multiple contextual variables on achievement outcomes, applied methodologists are often quick to introduce data reduction in the form of latent variables, following a belief that the latent variable is the source of the causal influence and the observed variables are little more than a reflection of the latent variable with error. However, this approach makes strong assumptions and can yield highly misleading results (VanderWeele, 2022). We examine this issue in the context of attempts to define and control for the influence of home resources through a home resources for learning index (HRL) on math and reading achievement scores in the PIRLS and TIMSS assessments. Our results call into question the common practice of modeling the HRL indicators with a single latent variable. We propose an explanation for this phenomenon. The analytic implication is insufficient control of the relevant manifest variable through full reliance on the latent variable. Similar concerns apply to studies that integrate measurement invariance analyses using latent variable models into their study and control of contextual variables. We show that allowing different loadings across countries may do more to harm the control of HRL in performing cross-country comparisons than enhance it. Throughout our analyses we rely on structural equation modeling techniques using the lavaan package in R. We illustrate some practical examples involving individual countries.

Assessing risk across credentialing programs — Kirk Becker (Pearson VUE) abstract

Abstract: Different testing programs are subject to different levels of cheating for different reasons. A self-assessment, for example, is unlikely to elicit cheating as there are no stakes attached to the results and the test takers are interested in the valid results of the test. While this is easy to state in general terms, quantifying the effects of program characteristics on security needs has not, to our knowledge, been attempted. This study will explore a research study evaluating characteristics such as industry, volume, pre-requirements, and credential value. Dependent variables that can be used for this study include indicators of test misconduct (response overlap, unusual item times, short test times, etc.). Example comparisons across testing modalities (online proctor and various types of test centers), and a discussion of complications and caveats, will be included.

Bias Audits for Artificial Intelligence Assessments — Scott Morris (Illinois Institute of Technology) abstract

Abstract: The increased use of artificial intelligence (AI) in hiring (e.g. automated resume screening, automated video interviews) has triggered concerns about algorithmic bias and the negative impact AI-based assessments might have on disadvantaged groups. The first law regulating the use of AI in employment was issued by the City of New York (Local Law 144), which covers the use of Automated Employment Decision Tools to "substantially assist or replace" human judgment in hiring and promotion decisions. Employers are required to conduct yearly 3rd party bias audits and post the results on the company website. These bias audits focus on group differences in passing rates, or what is commonly referred to as adverse impact, and provide no evaluation of psychometric notations of bias (e.g., differential item functioning or differential prediction). The law also requires no remedial action other the publication of bias reports. The Equal Employment Opportunity Commission has also issued guidance on the use of AI in employment decision, which for the most part treat AI assessments the same as any other selection tool. The emerging regulatory framework to evaluating bias in AI is not well suited to detect or mitigate bias. The pitfalls of failing to distinguish between bias and impact will be discussed.

Break (3:00 — 3:10)

Technology in assessment design (3:10 — 4:00)

Developing Dual-objective CD-CAT Algorithms for College Gate-way STEM Courses — Xiuxiu Tang, Yuxiao Zhang, and Hua-Hua Chang (Purdue University) abstract

Abstract: In college gate-way STEM courses, there are usually a large number of students in the classroom and it remains challenging to conduct precise and efficient assessments. Cognitive diagnostic computerized adaptive testing (CD-CAT) combines the strength of both computerized adaptive testing (CAT) and cognitive diagnosis (CD). The purpose of the study is to select an appropriate CD-CAT algorithm for college gate-way STEM courses. The algorithm will be a dual information-based method, which combines information from both θ (overall proficiency) and α (skill mastery). We are also interested in investigating the effect of different test lengths on the estimation accuracy. To implement the dual-objective CD-CAT, both cognitive diagnostic models (CDM) and the item response theory (IRT) models are used in the process of selecting items for test takers.

Development of an Augmented Reality Game-Based Assessment of Cognitive Ability — Kristina N. Bauer, Ivan Mutis, Gady Agam, Xuanchang Liu, and Gai Hao (Illinois Institute of Technology) abstract

Abstract: Utilizing design thinking (Plattner et al., 2011) and game mechanics (Salen et al., 2004), we developed a holographic 3D puzzle cube game to measure cognitive ability that was inspired by the Block Design test from the Wechsler Adult Intelligence Scale and the game Tetris. Results of this pilot study showed that perceptions of difficulty and mental effort were significantly different from each other in the expected direction in each level. Additionally, we ran a concatenated neural network to classify puzzle difficulty with physiological measures as features and reached an impressive 79% accuracy. Finally puzzle performance was positively correlated with WMC, negatively correlated with age, and uncorrelated with gender. Together these findings suggest difficulty was successfully manipulated and the test has initial evidence for construct validity, but additional research with larger samples is needed.

Technology in Employment Interviews: Common Practice and Applicant Reactions — Andrew Greenagel and Erin Young (Illinois Institute of Technology) abstract

Abstract: More technology is being introduced into selection processes in recent years; yet research on how applicants react to these novel technologies is still limited. Therefore, the purpose of this study was twofold: to investigate current interview practices and examine how applicants perceive these practices. A sample of N = 404 participants were asked to think about a recent employment interview and answer questions based on their experience. Questions assessed applicant reactions of organizational justice, transparency, organizational attractiveness, and intentions to pursue a job offer. Results suggested that most organizations are using human-based hiring practices and that justice perceptions were similar across all interview administration mediums. However, transparency perceptions were significantly lower for online interviews compared to in-person interviews. While no significant differences emerged between human and AI evaluation methods, several participants (n = 62) indicated not knowing whether a human or AI evaluated their interview, revealing that organizations may not be explicit in stating the details of their selection system to applicants. Notably, these results differed from much of the past research (mainly lab studies) on technology-mediated interviews. The talk will provide conclusions and directions for future research.

Closing comments (4:00)

Questions about the seminar may be directed to Alan Mead (), Scott Morris (), or Kirk Becker (). We hope you will join us.

Ideas in Testing Research Seminar Schedule, November 10, 2023

Coffee & Networking (9:00 — 9:50)

Welcome and Introduction (9:50 — 10:00)

Applications of NLP (10:00 — 10:50)