2015 Ideas in Testing Research Seminar, November 13, 2015, IIT, Chicago, IL

Reducing burden when reporting patient-reported outcomes using multidimensional computer adaptive testing (Scott B. Morris, Illinois Institute of Technology; Michael Bass, Northwestern University; Mirinae Lee, Illinois Institute of Technology; & Richard E. Neapolitan, Northwestern University) abstract slides

Abstract: Utilization of patient-reported outcome (PRO) measures has been limited by the lack of psychometrically sound measures scored in real-time. The Patient Reported Outcomes Measurement Information System (PROMIS) initiative developed a broad array of unidimensional computer adaptive PRO measures. By only administering questions targeted to the respondent's trait level, a computer adaptive test (CAT) provides high measurement precision with substantially fewer items. The goal of the current study was to determine the advantage in terms of test efficiency (i.e., number of items required) for multidimensional CAT (MCAT) relative to unidimensional CAT and fixed form scales. By taking advantage of the correlations among traits, MCAT can further reduce the number of items administered and therefore lessen patient burden. We estimated a 3-dimensional IRT model of the PROMIS emotional distress banks (anxiety, depression and anger). Using these item parameters, a Monte Carlo simulation was used to examine the relative performance of fixed item testing, unidimensional CAT and MCAT in terms of accuracy of trait estimates and number of items required.

Multidimensional adaptive personality assessment: A real-data demonstration using the 16PF questionnaire (Kevin Franke, Illinois Institute of Technology) abstract

Abstract: Multidimensional CAT (MCAT) promises efficient measurement of attitudes and personality by leveraging the collateral information from the correlated nature of these constructs. Previous Monte-Carlo research on MCAT administration of the 16PF personality questionnaire using Segall's (1996) method has suggested that test length could be substantially reduced with little loss of reliability. The current study extends these Monte-Carlo results to real data simulations using archival 16PF responses. This study is important for two reasons. First, it is always important to show that simulated results generalize to actual use. Even more importantly, recent research on personality has suggested that traditional IRT models do not fit personality data well and may not be the most appropriate models. If the IRT model is a poor fit to 16PF data, the Monte-Carlo results will not hold for real data. On the other hand, if the real-data results replicate the simulation results, then we may assume that traditional IRT models fit 16PF data sufficiently well.

CAT with ideal-point: Practical issues in applying GGUM to employment CAT (Alan D. Mead, Talent Algorithms Inc.) abstract slides

Abstract: Recently I-O researchers have advocated for the use of "ideal-point" or "unfolding" IRT models for personality traits. While some researchers argue that responding to personality items is an unfolding psychological process (where respondents indicate agreement when the scaled extremity of the personality item matches their standing on the latent trait), others have adopted a purely pragmatic perspective and pointed out that ideal-point models seem to fit data better. Although there have been several studies of the use of CAT with dominance models, relatively little literature has examined the use of ideal-point CAT. This paper will describe practical problems with the scaling metric, information function, and response set that affect ideal-point CAT implementations.

Comparison of Different Ability Estimation Methods for Strand Scores in a Grade 6 Mathematics Computerized Adaptive Test (Johnny Denbleyker, Houghton Mifflin Harcourt) abstract slides

Abstract: This study compares sub-score estimation methods in a computer adaptive testing (CAT) environment. A unique aspect of this study involves the analysis of student test scores across multiple test opportunities within the accountability testing window for an NCLB mathematics assessment. This allowed assessing aspects of reliability in a practical test-retest manner while accounting for error associated with both sampling of items and an occasion facet. The primary interest of this study is to compare EAP estimation of ability using a variable (strong) prior based on total score to that of maximum likelihood estimation (MLE). Multiple facets of the frequentist approach (MLE) and Bayesian approaches are considered to support the comparisons between the two primary methods.

Frontiers (11:20 — 12:20)

Item Difficulty Modeling on a Logical Reasoning Test (Kuan Xing, University of Illinois at Chicago; Kirk Becker, Pearson) abstract slides

Abstract: Psychometric researchers and testing practitioners must investigate the quality of large-scale tests. Checking item parameters such as item difficulty is important in the test quality control process. Furthermore, it's essential to investigate the item features, and use that information to make predictions on the item qualities for future item generation (Irvine, 2003). In this proposal we did a pilot study on modeling the item difficulty parameters using the logical reasoning test items from a university administration test.

Modeling the Evaluative Content of Personality Questionnaires: A Bifactor Application (Samuel T. McAbee, Illinois Institute of Technology; Michael D. Biderman, University of Tennessee at Chattanooga; Zhou "Job" Chen, University of Oregon; & Nhung Hendy, Towson University) abstract slides

Abstract: Personality questionnaires are typically designed to reduce the evaluative content of personality items, yet recent research has identified a general factor present in personality measures that is related to the desirability of the items themselves. To examine this finding, the present study applied confirmatory bifactor analysis to responses on the NEO-FFI-3 and HEXACO-PI-R. A general factor was found for both inventories, and relationships between this general factor and measures of positive and negative affect were assessed. Across scales, this general factor demonstrated strong negative correlations with measures of negative affect (negative affect and depression) and strong positive correlations with measures of positive affect (positive affect and self-esteem). Factor loadings on this general factor were strongly related to third-party item valence ratings, suggesting that the general factor present in both Big Five and Big Six data assesses the evaluative aspects of personality items and is highly related to general affective state.

Automated scoring of open-ended mechanical aptitude items (Alan D. Mead, Talent Algorithms Inc.; Diana Bairaktarova, Virginia Tech; & Anna Woodcock, California State University San Marcos) abstract

Abstract: Multiple-choice items have dominated the testing industry, but modern computerized assessments offer the possibility of innovative items that may improve upon multiple-choice items. Previous research has shown that fill-in-the-blank (FITB) open-ended questions are more difficult but dramatically more reliable (i.e., higher item-total correlations). The current study examines automated algorithms for scoring open-ended responses to Mechanical Aptitude items accurately and validly.

Lunch (12:30 — 1:30)

Employment Testing (1:30 — 2:30)

A comparison of different methods of detecting inattentive responding on self-report personality measures (Avi Fleischer, Tetrics LLC and Illinois Institute of Technology) abstract slides

Abstract: This study sought to identify the best indices for detecting inattentive respondents in unmotivated samples completing self-report measures. Data were collected under attentive and inattentive conditions and five different detection methods (item-based: instructional items, nonsensical items, and Fleischer type items; and non-item based: total time and psychometric consistency) were directly compared in terms of classification accuracy and salience. It was hypothesized and found that Fleischer items were viewed as less salient as compared to nonsensical and instructional item types. The Fleischer type and total time worked to be the best identifiers of inattentive respondents.

The use of mobile for pre-employment testing (Erin Wood, PAN & Kelsey Stephens, PAN) abstract slides

Abstract: The rise of the smartphone has led to a situation in which many job candidates are starting to use mobile devices to complete high-stakes pre-employment testing. Current research suggests that device type can impact test time and performance for certain types of assessments, specifically timed and cognitively loaded assessments. The present research has two goals: to expand upon recent findings by examining the impact of device type on test performance for image heavy assessments, and to examine reasons that test performance is impacted by timed assessments. Results and implications will be discussed.

An application of Pareto-optimality to public safety selection data: Assessing the feasibility of optimal composite weighting (Maxwell G. Porter & Scott B. Morris, Illinois Institute of Technology) abstract

Abstract: This study took an exploratory approach to examining the application of Pareto-optimal weighting schemes to real selection data from the public safety sector. Of particular interest was the evaluation of how well Pareto-optimal weighting estimates reflect observed validity and adverse impact statistics when applied to selection data. Pareto-optimal methodology was applied to entry-level public safety selection data from two U.S. municipalities. Results found that, overall, Pareto-optimal estimates maintain robustness when compared to real selection outcomes. Specifically, near identically-shaped Pareto fronts were obtained when inputting both raw and corrected validity estimates, suggesting suitable sensitivity to varying input parameters. Practical challenges of implementing Pareto-optimal methods as well as the implications of the findings are discussed.

Evaluating data (2:45 — 3:45)

Empowering Decision-Makers: Developing an Analytics Dashboard in R (Nick Redell & Kevin Kalinowski, National Board of Osteopathic Medical Examiners) abstract

Abstract: RStudio has developed an open-source analytics dashboard platform named Shiny based on the R statistical language that makes open-access, real-time data access possible. Our presentation will provide a high level overview of the open source Shiny dashboard application, how a dashboard is built, how it functions in practice, and what types of dashboard presentations we have found to be the most useful in practice.

A Bayesian Robust IRT Outlier Detection Model (Nicole K. Ozturk & George Karabatsos, University of Illinois at Chicago) abstract

Abstract: In the practice of Item Response Theory (IRT) modeling, item response outliers can lead to biased estimates of model parameters. We propose a Bayesian IRT model that is denoted by person outlier parameters in addition to person ability and item difficulty parameters; and by Item Characteristic Curves (ICCs) that are each specified by a robust, Student's t distribution function. The outlier parameters, along with the Student ICCs, enable the model to provide more robust estimates of person and item parameters in the presence of outliers, compared to a standard IRT model without outlier parameters. Hence, under this IRT model, it is not necessary to remove outlying items or persons from the data analysis, a practice that leads to a loss of data information. We illustrate our model through the analysis of exam data involving dichotomous item scores, and through the analysis of polytomous item response data.

Classification Accuracy with a Test Battery under Different Decision Rules (Ying Cheng & Cheng Liu, University of Notre Dame) abstract

Abstract: Test scores from licensure or certification exams are naturally used for classification purposes, e.g., pass or fail. In some cases, test takers need to take a series of tests before they get the license or certificate. For example, the Uniform CPA exam has four tests, and a test taker needs to pass all four sections before being licensed. Tests that are not licensure or certification exams could also be used for classification decisions. For example, candidates taking the ASVAB also need to obtain a composite score that is good enough from subareas of ASVAB to qualify for certain military occupational specialties (see https://en.wikipedia.org/wiki/Armed_Services_Vocational_Aptitude_Battery). It is also no secret that schools often use composite of quantitative and verbal test scores of GRE or SAT to screen applicants, which is essentially the same as failing these test takers. In this paper, we derive analytically the expected classification accuracy when: a) A test taker needs to pass every test in a test battery to pass; b) A test taker needs to obtain a composite score higher than a cut score on the composite score scale to pass; and c) A test taker needs to pass each individual test and obtain a high enough composite score scale to pass.

Practice Differences and Item Parameter Drift in Computer Adaptive Testing (Beyza Aksu Dunya, University of Illinois at Chicago) abstract

Abstract: Implementation of CAT in K-12 assessments is relatively new and its benefits and impact are still being discovered. Despite the benefits such as pinpointing a student's level more closely, one concern raised by experts is related to instructional and practical differences across schools, districts, and states. Research has repeatedly cited curriculum and practice differences as an important source of IPD. The purpose of this simulation study is to evaluate impact of item parameter drift (IPD) that occurs due to teaching and practice differences among test-takers on person parameter estimation and classification accuracy in CAT in K-12 context. Factors such as percentage of drifting items and percentage of examinees receiving differential teaching and practice are manipulated in the study. This study aims to contribute to the IPD literature particularly in education where CAT is used for various purposes such as admissions and statewide evaluations. The findings could also provide useful information to testing organizations and test developers about the amount of drift that can result from curriculum and practice changes and the potential consequences of this drift on the calibration of pilot items in the item bank.

Closing comments (3:45 — 3:50)

Questions about the seminar may be directed to Alan Mead (), Sam McAbee (), or Kirk Becker (). We hope you will join us.

Ideas in Testing Research Seminar Schedule, November 13, 2015

Coffee & Networking (9:15 — 9:45)

Welcome and Introduction (9:45 — 10:00)

Computerized Adaptive Testing (10:00 — 11:00)

Frontiers (11:20 — 12:20)

Lunch (12:30 — 1:30)

Employment Testing (1:30 — 2:30)

Evaluating data (2:45 — 3:45)

Closing comments (3:45 — 3:50)