Ideas in Testing Research Seminar Schedule, November 13, 2015
Coffee & Networking (9:15 — 9:45)
Welcome and Introduction (9:45 — 10:00)
Computerized Adaptive Testing (10:00 — 11:00)
Reducing burden when reporting patient-reported outcomes using
multidimensional computer adaptive testing (Scott B. Morris, Illinois
Institute of Technology; Michael Bass, Northwestern University; Mirinae
Lee, Illinois Institute of Technology; & Richard E. Neapolitan,
Northwestern University)
abstract
slides
Abstract: Utilization of patient-reported outcome (PRO)
measures has been limited by the lack of psychometrically sound measures
scored in real-time. The Patient Reported Outcomes Measurement Information
System (PROMIS) initiative developed a broad array of unidimensional
computer adaptive PRO measures. By only administering questions targeted
to the respondent's trait level, a computer adaptive test (CAT)
provides high measurement precision with substantially fewer items. The
goal of the current study was to determine the advantage in terms of test
efficiency (i.e., number of items required) for multidimensional CAT
(MCAT) relative to unidimensional CAT and fixed form scales. By taking
advantage of the correlations among traits, MCAT can further reduce the
number of items administered and therefore lessen patient burden. We
estimated a 3-dimensional IRT model of the PROMIS emotional distress
banks (anxiety, depression and anger). Using these item parameters,
a Monte Carlo simulation was used to examine the relative performance
of fixed item testing, unidimensional CAT and MCAT in terms of accuracy
of trait estimates and number of items required.
Multidimensional adaptive personality assessment: A real-data
demonstration using the 16PF questionnaire (Kevin Franke, Illinois
Institute of Technology)
abstract
Abstract: Multidimensional CAT (MCAT) promises efficient
measurement of attitudes and personality by leveraging the collateral
information from the correlated nature of these constructs. Previous
Monte-Carlo research on MCAT administration of the 16PF personality
questionnaire using Segall's (1996) method has suggested that test
length could be substantially reduced with little loss of reliability. The
current study extends these Monte-Carlo results to real data simulations
using archival 16PF responses. This study is important for two reasons.
First, it is always important to show that simulated results generalize
to actual use. Even more importantly, recent research on personality has
suggested that traditional IRT models do not fit personality data well
and may not be the most appropriate models. If the IRT model is a poor
fit to 16PF data, the Monte-Carlo results will not hold for real data.
On the other hand, if the real-data results replicate the simulation
results, then we may assume that traditional IRT models fit 16PF data
sufficiently well.
CAT with ideal-point: Practical issues in applying GGUM to
employment CAT (Alan D. Mead, Talent Algorithms Inc.)
abstract
slides
Abstract: Recently I-O researchers have advocated for the
use of "ideal-point" or "unfolding" IRT models for personality
traits. While some researchers argue that responding to personality
items is an unfolding psychological process (where respondents indicate
agreement when the scaled extremity of the personality item matches their
standing on the latent trait), others have adopted a purely pragmatic
perspective and pointed out that ideal-point models seem to fit data
better. Although there have been several studies of the use of CAT with
dominance models, relatively little literature has examined the use
of ideal-point CAT. This paper will describe practical problems with
the scaling metric, information function, and response set that affect
ideal-point CAT implementations.
Comparison of Different Ability Estimation Methods for Strand
Scores in a Grade 6 Mathematics Computerized Adaptive Test (Johnny
Denbleyker, Houghton Mifflin Harcourt)
abstract
slides
Abstract: This study compares sub-score estimation methods
in a computer adaptive testing (CAT) environment. A unique aspect of
this study involves the analysis of student test scores across multiple
test opportunities within the accountability testing window for an NCLB
mathematics assessment. This allowed assessing aspects of reliability in
a practical test-retest manner while accounting for error associated with
both sampling of items and an occasion facet. The primary interest of this
study is to compare EAP estimation of ability using a variable (strong)
prior based on total score to that of maximum likelihood estimation
(MLE). Multiple facets of the frequentist approach (MLE) and Bayesian
approaches are considered to support the comparisons between the two
primary methods.
Frontiers (11:20 — 12:20)
Item Difficulty Modeling on a Logical Reasoning Test (Kuan Xing,
University of Illinois at Chicago; Kirk Becker, Pearson)
abstract
slides
Abstract: Psychometric researchers and testing practitioners
must investigate the quality of large-scale tests. Checking item
parameters such as item difficulty is important in the test quality
control process. Furthermore, it's essential to investigate the item
features, and use that information to make predictions on the item
qualities for future item generation (Irvine, 2003). In this proposal
we did a pilot study on modeling the item difficulty parameters using
the logical reasoning test items from a university administration test.
Modeling the Evaluative Content of Personality Questionnaires:
A Bifactor Application (Samuel T. McAbee, Illinois Institute
of Technology; Michael D. Biderman, University of Tennessee at
Chattanooga; Zhou "Job" Chen, University of Oregon; & Nhung Hendy,
Towson University)
abstract
slides
Abstract: Personality questionnaires are typically designed to
reduce the evaluative content of personality items, yet recent research
has identified a general factor present in personality measures that is
related to the desirability of the items themselves. To examine this
finding, the present study applied confirmatory bifactor analysis to
responses on the NEO-FFI-3 and HEXACO-PI-R. A general factor was found
for both inventories, and relationships between this general factor
and measures of positive and negative affect were assessed. Across
scales, this general factor demonstrated strong negative correlations
with measures of negative affect (negative affect and depression) and
strong positive correlations with measures of positive affect (positive
affect and self-esteem). Factor loadings on this general factor were
strongly related to third-party item valence ratings, suggesting that
the general factor present in both Big Five and Big Six data assesses
the evaluative aspects of personality items and is highly related to
general affective state.
Automated scoring of open-ended mechanical aptitude items (Alan
D. Mead, Talent Algorithms Inc.; Diana Bairaktarova, Virginia Tech;
& Anna Woodcock, California State University San Marcos)
abstract
Abstract: Multiple-choice items have dominated the testing
industry, but modern computerized assessments offer the possibility of
innovative items that may improve upon multiple-choice items. Previous
research has shown that fill-in-the-blank (FITB) open-ended questions are
more difficult but dramatically more reliable (i.e., higher item-total
correlations). The current study examines automated algorithms for
scoring open-ended responses to Mechanical Aptitude items accurately
and validly.
Lunch (12:30 — 1:30)
Employment Testing (1:30 — 2:30)
A comparison of different methods of detecting inattentive
responding on self-report personality measures (Avi Fleischer,
Tetrics LLC and Illinois Institute of Technology)
abstract
slides
Abstract: This study sought to identify the best indices
for detecting inattentive respondents in unmotivated samples completing
self-report measures. Data were collected under attentive and inattentive
conditions and five different detection methods (item-based: instructional
items, nonsensical items, and Fleischer type items; and non-item based:
total time and psychometric consistency) were directly compared in
terms of classification accuracy and salience. It was hypothesized and
found that Fleischer items were viewed as less salient as compared to
nonsensical and instructional item types. The Fleischer type and total
time worked to be the best identifiers of inattentive respondents.
The use of mobile for pre-employment testing (Erin Wood,
PAN & Kelsey Stephens, PAN)
abstract
slides
Abstract: The rise of the smartphone has led to a situation in
which many job candidates are starting to use mobile devices to complete
high-stakes pre-employment testing. Current research suggests that
device type can impact test time and performance for certain types of
assessments, specifically timed and cognitively loaded assessments. The
present research has two goals: to expand upon recent findings by
examining the impact of device type on test performance for image heavy
assessments, and to examine reasons that test performance is impacted
by timed assessments. Results and implications will be discussed.
An application of Pareto-optimality to public safety selection data:
Assessing the feasibility of optimal composite weighting (Maxwell
G. Porter & Scott B. Morris,
Illinois Institute of Technology)
abstract
Abstract: This study took an exploratory approach to examining
the application of Pareto-optimal weighting schemes to real selection
data from the public safety sector. Of particular interest was the
evaluation of how well Pareto-optimal weighting estimates reflect
observed validity and adverse impact statistics when applied to selection
data. Pareto-optimal methodology was applied to entry-level public safety
selection data from two U.S. municipalities. Results found that, overall,
Pareto-optimal estimates maintain robustness when compared to real
selection outcomes. Specifically, near identically-shaped Pareto fronts
were obtained when inputting both raw and corrected validity estimates,
suggesting suitable sensitivity to varying input parameters. Practical
challenges of implementing Pareto-optimal methods as well as the
implications of the findings are discussed.
Evaluating data (2:45 — 3:45)
Empowering Decision-Makers: Developing an Analytics Dashboard in R
(Nick Redell & Kevin Kalinowski, National Board of Osteopathic
Medical Examiners)
abstract
Abstract: RStudio has developed an open-source analytics
dashboard platform named Shiny based on the R statistical language that
makes open-access, real-time data access possible. Our presentation
will provide a high level overview of the open source Shiny dashboard
application, how a dashboard is built, how it functions in practice,
and what types of dashboard presentations we have found to be the most
useful in practice.
A Bayesian Robust IRT Outlier Detection Model (Nicole K. Ozturk
& George Karabatsos, University of Illinois at Chicago)
abstract
Abstract: In the practice of Item Response Theory (IRT)
modeling, item response outliers can lead to biased estimates of model
parameters. We propose a Bayesian IRT model that is denoted by person
outlier parameters in addition to person ability and item difficulty
parameters; and by Item Characteristic Curves (ICCs) that are each
specified by a robust, Student's t distribution function. The
outlier parameters, along with the Student ICCs, enable the model to
provide more robust estimates of person and item parameters in the
presence of outliers, compared to a standard IRT model without outlier
parameters. Hence, under this IRT model, it is not necessary to remove
outlying items or persons from the data analysis, a practice that leads to
a loss of data information. We illustrate our model through the analysis
of exam data involving dichotomous item scores, and through the analysis
of polytomous item response data.
Classification Accuracy with a Test Battery under Different Decision
Rules (Ying Cheng & Cheng Liu, University of Notre Dame)
abstract
Abstract: Test scores from licensure or certification
exams are naturally used for classification purposes, e.g., pass
or fail. In some cases, test takers need to take a series of tests
before they get the license or certificate. For example, the Uniform
CPA exam has four tests, and a test taker needs to pass all four
sections before being licensed. Tests that are not licensure
or certification exams could also be used for classification
decisions. For example, candidates taking the ASVAB also need
to obtain a composite score that is good enough from subareas of
ASVAB to qualify for certain military occupational specialties (see
https://en.wikipedia.org/wiki/Armed_Services_Vocational_Aptitude_Battery).
It is also no secret that schools often use composite of quantitative
and verbal test scores of GRE or SAT to screen applicants, which is
essentially the same as failing these test takers. In this paper,
we derive analytically the expected classification accuracy when: a)
A test taker needs to pass every test in a test battery to pass; b) A
test taker needs to obtain a composite score higher than a cut score on
the composite score scale to pass; and c) A test taker needs to pass each
individual test and obtain a high enough composite score scale to pass.
Practice Differences and Item Parameter Drift in Computer Adaptive
Testing (Beyza Aksu Dunya, University of Illinois at Chicago) abstract
Abstract: Implementation of CAT in K-12 assessments
is relatively new and its benefits and impact are still being
discovered. Despite the benefits such as pinpointing a student's level
more closely, one concern raised by experts is related to instructional
and practical differences across schools, districts, and states. Research
has repeatedly cited curriculum and practice differences as an important
source of IPD. The purpose of this simulation study is to evaluate
impact of item parameter drift (IPD) that occurs due to teaching and
practice differences among test-takers on person parameter estimation
and classification accuracy in CAT in K-12 context. Factors such as
percentage of drifting items and percentage of examinees receiving
differential teaching and practice are manipulated in the study.
This study aims to contribute to the IPD literature particularly in
education where CAT is used for various purposes such as admissions and
statewide evaluations. The findings could also provide useful information
to testing organizations and test developers about the amount of drift
that can result from curriculum and practice changes and the potential
consequences of this drift on the calibration of pilot items in the item
bank.
Closing comments (3:45 — 3:50)
Questions about the seminar may be directed to Alan Mead
(), Sam McAbee (), or Kirk Becker (). We hope you will join us.
Back to the 2015 page
Back to the current meeting