2016 Ideas in Testing Research Seminar, October 21, 2016, IIT, Chicago, IL

The Effect of Random Guessing on Reliability and Validity. (Ying "Alison" Cheng & Max Hong, University of Notre Dame) abstract

Abstract: In the context of high-stakes tests, test takers who do not have enough time to complete a test would rush towards the end and may engage in random guessing behavior, when tests do not penalize guessing. Via mathematical derivations and simulations, Attali (2005) showed that such random guessing responses may lower reliability.

Meanwhile, some believe random guessing responses actually increase estimates of reliability. Supporting this belief, Wise & DeMars (2009) showed that random guessing does in fact inflate reliability estimates under certain conditions. Unmotivated participants who complete in low stake exams or scale would omit or respond incorrectly to questions throughout the test. A greater proportion of these responses could occur and inflate reliability estimates, as a result. This issue is more common for low-stakes tests, such as psychological assessments or surveys.

Our research attempts to bridge the gap between these two positions. In this paper, we will provide analytical and empirical evidence that random guessing responses can lead to lower/higher estimates for: correlation amongst items, Cronbach's alpha, factor loadings, and total item correlation scores, depending on the prevalence of such responses and how these responses are scored. Furthermore, we will extend previous research by reporting how such responses affect various forms of validity.

Quantifying the Item Order Effect on Item Difficulty in Large-Scale Testing. (Kuan Xing, University of Illinois at Chicago; & Kirk Becker, Pearson VUE) abstract

Abstract: While many studies exploring the effect of item order, testing professionals remain concerned about changing or randomizing item order. For this study item difficulty is computed and compared based on the relative position where the item was administered. Strong evidence shows no significant effect of item order on item difficulty.

Response time (10:40 — 11:25)

Detection of Test Speededness Using Change-Point Analysis with Response Time Data. (Can Shao, National Board of Osteopathic Medical Examiners; & Ying "Alison" Cheng, University of Notre Dame) abstract

Abstract: Imposing a time limit is usually a practical administration necessity for educational and psychological tests. A test is speeded when some examinees do not have sufficient time to fully consider every item on the test within a fixed time limit (Bejar, 1985). Using response time to detect speededness has great potential. In this paper, we propose to apply the change-point analysis to item level response time data. Our simulation study demonstrated how the change-point analysis can help to detect test speededness using resopnse time data. This may help practitioners to monitor the point where examinees have speeded responses and set up proper test time limit.

Researching Technology-Enhanced Items Incorporating Response Time Modeling. (Johnny Denbleyker, Houghton Mifflin Harcourt) abstract

Abstract: The advancement in the past decade in computer-based testing along with increased interest is measuring constructs typically deemed difficult with standard MC items has permitted a rather large development of various technology enhanced (TE) item formats. However, despite advancements in authoring, administrating, and scoring of these new formats, more and better research is needed to investigate their realized psychometric properties (Zenisky & Sireci, 2002, Jodoin, 2003, Lorie, 2014; Parshall, Harmes, Davey, & Pashley, 2010). This presentation plans to overview various operational research applications of technology-enhanced items in the particular application of applying response time modeling.

The Application of Response Time Modeling in Reviewing Quality and Option Choice for TEI Items. (Linlin Wu, University of Chicago; & Johnny Denbleyker, Houghton Mifflin Harcourt) abstract

Abstract: Research studies on response time (RT) modeling are growing (see Lee and Chen, 2011; van der Linden, 2011, for reviews) especially as such information that used to be cumbersome to gather is now readily accessible due to computer testing. In operational psychometric work, however, response time remains to be used primarily in a descriptive fashion. The lack of practical applications of RT models might due to the unavailability of commercial software capable of generating RT modeling as well as the reluctance of reporting scores calculated based on any information other than accuracy of the responses. Nonetheless, we argue that testing agencies should borrow the strength of RT modeling to help the review of field test items, if not the final operational calibration and scoring. In this study, it's illustrated that by using current available software a hierarchical RT model can facilitate the review process of technology-enhanced items both in terms of item quality and options selected by examinees. The hierarchical RT model can bring more accurate and more amount of information of the field test items, and therefore more informed decisions can be made when selecting them as future operational items. The extension of investigating response time to TEI option choices can help test development better understand various TEI formats efficiencies than potentially can lead to developing TEI items having improved psychometric characteristics.

CAT (11:30 — 12:00)

CAT Stopping Rules with Skewed Item Pools. (Scott Morris, Illinois Institute of Technology; Michael Bass, & Richard Neapolitan, Northwestern University) abstract

Abstract: Rather than administering a fixed number of items, computer adaptive testing (CAT) allows test length to be tailored to match the needs of each examinee. A common approach is to administer items until a specified level of measurement precisions is reached (e.g., SE < .32). This approach is effective when all trait levels are well represented in the item bank. However, in some settings, the item bank can be misaligned with the trait distribution, such that there are some trait levels where the available items provide little information. This can be found in assessments of patient reported health outcomes, such as emotional distress (e.g., anxiety, depression) and physical functioning, where items have been primarily developed to differentiate among levels of dysfunction. In such cases, respondents on the positive end of the scale might be asked a large number of questions, without ever achieving the SE cutoff.

Choi, Grady & Dodd (2010) proposed an alternative approach based on the predicted SE reduction (PSER). If no item is expected to substantially improve measurement precision, there is no point administering additional items and the exam would stop, regardless of whether the SE cutoff has been reached. This approach could substantially reduce the number of items administered to individuals for whom the item bank provides little information. We develop an approach to optimize the parameters of the PSER approach and describe several studies comparing its performance (i.e., test length and trait estimation accuracy) to other stopping rules (e.g., fixed length, SE cutoff, ad hoc rules).

Multidimensional Computerized Adaptive Test for the Big Five Personality Assessment. (Justin Kern, Susu Zhang, Tianjun Sun, Bo Zhang, Rachel Amrhein, Alexis Deceanne, Angela Lee, University of Illinois at Urbana-Champaign) abstract

Abstract: Organizations often include personality assessments in the system for hiring new employees. Current personality inventories are long, often containing several hundred items, which can burden test-takers. Computerized Adaptive Testing (CAT) selects test items to optimize test-takers' trait estimation accuracy and efficiency, reducing test length without sacrificing estimation precision. Therefore, a CAT for personality assessment would be desirable for organizations to obtain accurate personality estimates without long, tedious tests.

One prominent personality model is the Big Five Model (Costa & McCrae, 1992). A test-taker's personality is summarized by five factors: Agreeableness, Conscientiousness, Extraversion, Neuroticism, and Openness. The Comprehensive Personality Scale (CPS; Wang, 2013) is a 440-item measure for the 5 dimensions. Each personality factor contains several specific facets, such as trustworthiness and warmth for Agreeableness. Although items are designed to assess one facet of a specific factor, in reality, the factors are correlated with one another, making items multidimensional (Reise, Waller, & Comrey, 2000). In the current study, we exploit the relationships among personality items and expedite assessments by developing a multidimensional CAT for the CPS personality scale

Simulations are conducted to compare the ideal point model-based personality CAT to the more standard multidimensional 2PL-IRT model. Several different conditions were simulated. Results are evaluated by trait estimation accuracy (Bias and MSE), and item exposure rate.

Lunch (12:00 — 1:00)

Organizational Methods A (1:00 — 1:30)

Using O*NET to Identify and Organize Job Characteristics that Inform the Predictive Power of Personality. (Jason Way, ACT; & Jeremy Burrus, Professional Examination Services) abstract

Abstract: It has long been theorized that we can improve prediction of job-related behavior from measures of personality by identifying job characteristics that allow for the expression of individual differences (e.g., Mischel, 1968). Using O*NET data and informed by the theories of situational strength and Trait Activation Theory (Tett & Burnett, 2003), the current paper develops a framework for job characteristics that could improve the extent to which we can predict behavior from personality. More specifically, it investigates relationships between the O*NET variables of Work Styles, Generalized Work Activities (GWA), and Work Context. Because the point of the current analysis was to identify job characteristics that would affect the predictive validity of personality measures for jobs that employ the most people, and to avoid jobs that employ very few people from contributing too much weight to the analysis, the analysis was restricted to the most frequently held jobs that cumulatively employed 70% of the people in the U.S. This was done by merging the May 2014 employment data from the Bureau of Labor Statistics (http://www.bls.gov/oes/) into the database and selecting jobs that employed 70% of the U.S. workforce. This eliminated 771 jobs and resulted in a sample of 117 jobs for the study.

The goal of the analyses was to reduce the O*NET GWA and Work Context variables into a manageable set of job characteristics that could help workforce researchers determine when, and which, personality dimensions will be most predictive of job performance. The first step of the analyses was to attempt to reduce the number of variables in the analysis by creating summed scales. Because exploratory factor analyses did not produce interpretable solutions for Work Styles and Work Contexts, we used a rational approach based on item intercorrelations. For GWAs, exploratory factor analysis did produce an interpretable solution, although expert judgment had to be used in the placement of several items. The second step was to identify the Work Context and GWA scales that were most highly related to the work styles scales by correlating them with the work styles scales. Rational decisions were made about which scales should remain in the analysis and which should be dropped. The third step was to regress the Work Styles scales on the remaining Work Context and GWA scales to examine which scales predict Work Styles while simultaneously controlling for other scales. Once again, rational decision rules were applied when dropping scales.

The remaining variables constituted the final work characteristic framework. The final list of job characteristics includes public speaking, conflict, lack of constraints, not-in-person communication, working with information, and helping others. This set of characteristics accounted for an impressive amount of variance in job incumbents' ratings of the importance of several personality-related dimensions (i.e., Work Styles) to performance on the job, with R2 ranging from .60 to .74. Limitations and future directions are discussed.

Curvilinear Personality-Performance Relationships: Insights from Observer Reports. (Sarah Rusakiewicz & Sam McAbee, Illinois Institute of technology) abstract

Abstract: Personality researchers have posited that although moderate to high levels of several Big Five personality traits enable successful task and contextual performance in the workplace, having extremely high levels of these traits might actually inhibit performance (Benson & Campbell, 2007; Pierce & Aguinis, 2013). Note, however, that research has not consistently supported these findings, yielding mixed results when attempting to demonstrate curvilinear relations (Day & Silverman, 1989; Cucina & Vasilopoulos, 2005; Robie & Ryan, 1999). Moreover, findings from recent simulation research indicates that misalignment between the selection methods and the true shape of the personality-performance relationship can result in notable decrements in the average performance of selected employees (Converse & Oswald, 2014). Thus, a fuller understanding of curvilinear relationships and how to design selection and assessment methods to reflect such relations between personality and performance is needed.

Research has shown that observer-reports of the Big Five often demonstrate stronger relationships with performance outcomes than do self-reports of these same traits (e.g., Connelly & Ones, 2010; Oh, Wang, & Mount, 2011). For example, while meta-analytic estimates of the relationship between self-reports of Conscientiousness and overall performance ratings typically fall around .20 (e.g., Barrick, Mount & Judge, 2001), correlations between observer reports of Conscientiousness and performance are often noticeably stronger (e.g., ρ = .29; Connelly & Ones, 2010). Indeed, research has consistently shown that observer-reports of the Big Five predict incremental variance in performance outcomes over and above self-reports of these traits (Oh et al., 2011).

Research in the area of curvilinear relations has over relied on self-report measures, which could be related to the inconclusive findings regarding if and when this relationships exists. To date, however, limited research has examined whether observer-reports of the Big Five also demonstrate curvilinear relationships with work outcomes. The proposed study will examine curvilinear relations between personality and performance for self- and observer-reports of the Big Five personality traits using the ideal point model. Unlike the traditional dominance model, which assumes agreement with an item is linearly related to a respondent's standing on a trait, an ideal point model assumes that an individual will endorse an item that reflects their standing on the trait, and will fail to endorse an item that reflects either too high or too low a standing on the trait. This allows for more flexibility and is more likely to find a curvilinear relationship if it does exist (Carter et al., 2014).

If such curvilinear relations exist, it stands to reason that observer evaluations of personality should corroborate self-report studies that have found such a relationship. However, if the present study does not find a pattern of curvilinear results consistent with those observed for self-reports (e.g., Carter et al., 2014), this may suggest that the curvilinear relations observed between personality and performance are an artifact of the self-report process. By examining third-party observations of personality, researcher might gain an increased understanding of the role of curvilinearity in personality-performance relations.

Large-Scale Testing B (1:35 — 2:05)

Exploring Rubric-Based Multidimensionality in Polytomously-Scored Test Items. (Daniel J. Adams &smp; Daniel M. Bolt, University of Wisconsin-Madison) abstract

Abstract: Items scored polytomously may display multidimensionality across score categories. Reckase (2009) considered an example involving a writing sample in which the rubric for scoring the writing might at lower item score levels distinguish with respect to simple writing mechanic skills, but at higher item scores in terms of organization and style of writing (Reckase, 2009; pp.110-111). Another example might be a mathematical word problem for which lower score categories may be distinguished more by language ability, while higher score categories may be more related to mathematical problem-solving. In both cases, even though the items may not be intended to display multidimensionality, there is the potential for rubric-related multidimensionality to emerge.

The goal of our paper is to illustrate the use of the multidimensional nominal response model to capture the situation considered by Reckase (2009). This approach is exploratory with respect to the scoring of items across dimensions. We illustrate this approach by showing the possibility of empirically estimating the scoring functions in a multidimensional model, as traditionally occurs when using the nominal response model in unidimensional applications. Importantly, in the multidimensional case the estimated scoring function can assume possibly different values across dimensions, a condition not captured by traditional models (e.g., the multidimensional partial credit model, the multidimensional graded response model) applied to polytomous items.

We provide two sets of illustrations in exploring rubric-related multidimensionality: one by simulation and one by real data analysis. For the simulations, we use two datasets both having 2,000 examinees, 20 items, and 2 dimensions. The items of the first dataset have three score categories where only dimension 1 is measured by the distinction between the lowest two score categories (2 and 1, in this case) and only dimension 2 is measured by the distinction between the highest two (3 and 2). By contrast, the items of the second dataset have four score categories where the main difference is that score categories 2 and 3 now reflect an equally weighted composite of dimensions 1 and 2.

Our real data example uses item response data from TIMSS 2007 administered to 7377 eighth graders in the United States. The students responded to 214 math items (116 MC, 98 CR) and 222 science items (104 MC, 118 CR). As both the math and science assessments contain a mix of multiple-choice and constructed response items, we fit two-dimensional models in which all items were assumed to measure a content dimension (math or science), but only constructed response items measured a format dimension. Our interest is focused on constructed response items scored polytomously (0,1,2). Application of a nominal model permits detection of certain items that show varying sensitivities to the content and format dimensions across score categories. Several examples are presented for illustration. All models are fit using the Latent Gold software.

Developing and Applying Standards in the New Continuous Assessment Framework for Medical Recertification. (Tara McNaughton & Lisa A. Reyes, Measurement Incorporated) abstract

Abstract: Recently, several medical boards have started to adopt a continuous assessment framework for their maintenance of certification (MOC) exams. Previous MOC guidelines have required medical practitioners to pass a long, comprehensive exam once every 10-year period. Newer guidelines call for ongoing testing using shorter modules to regularly assess competency while promoting lifelong learning. While the focus of adopting this new continuous assessment framework has been on how to develop and administer these shorter modules, what has received less attention is how to develop defensible standards and apply them longitudinally. The goal of the present study is to explore methods that may assist boards in developing comparable passing standards under the new MOC continuous assessment framework. As one way to assess competence longitudinally, we propose a two-stage standard setting procedure. The first stage consists of applying a standard derived from modified Angoff ratings to each test module. The second stage consists in determining how many occurrences of substandard performance would be acceptable across the modules. To investigate the effects of applying this standard setting method, we used data from a previously administered linear exam to simulate scoring profiles for nine separate modules given across time. We then compared pass-fail determinations between three scoring conditions: 1) Applying an Angoff-derived standard to the original exam, 2) Applying an Angoff-derived standard to each module, and 3) Applying a two-stage standard by first applying the Angoff-derived standard to each module and then applying a proposed standard to the longitudinal profile of scores.

Psychometric Methods A (2:10 — 2:55)

Extreme Response Style and Measurement of Intra-Individual Variability in Affect. (Sien Deng & Daniel M. Bolt, University of Wisconsin-Madison) abstract

Abstract: Psychologists have become increasingly interested in intra-individual variability as a meaningful aspect of personality. However, concern has been expressed over the potential biasing effects of extreme response style (ERS) on the measurement of intra-individual variability in psychological constructs (Baird, Lucas & Donnellan, 2016). This paper explores such bias through a multilevel extension of a nominal response IRT model developed for response styles applied to repeated measures rating scale data.

We first demonstrate how the use of a psychometric item response model can help understand and control for biasing effects that ERS may introduce into traditional measures of intra-individual variance. A fully Bayesian estimation algorithm for the model is described. As an empirical illustration, PANAS measures collected from smokers at clinic visits following a quit attempt were used: 1) to verify the presence of an ERS dimension in the data; 2) to use the model-based estimates as a basis for interpreting how ERS leads to bias in the mean and intra-individual variance of Positive Affect (PA) and Negative Affect (NA) sum scores over time; and 3) to perform a correction of respondent-level mean and intra-individual variance estimates for ERS. ERS is found to generate considerably greater bias in intra-individual sum score variances than means, and in fact may function as a primary source of between-person differences in standard measures of intra-individual variability. In addition, the magnitude and direction of bias due to ERS is found to be heavily dependent on the mean affect level, supporting a model-based approach to the study and control of its effects. Application of the proposed model-based correction is found to improve intra-individual variability as a predictor of smoking cessation.

To validate the effectiveness of the model in detecting and controlling for ERS, we simulate response data from the proposed model involving either 8 or 20 time points. Person parameters are manipulated with respect to 1) mean latent levels of the substantive traits, 2) levels of latent intra-individual variability, and 3) level of ERS. In the simulation, we compare estimates from the latent trait model with and without ERS control, as well as estimates based on the sum scores. Recovery is considered for both the latent means and variances for both the PA and NA scales, and evaluated by correlations between the estimates and generating parameters. Recovery is seen to be consistently better for the latent trait model with ERS control as compared to other two approaches (i.e., a latent trait model without ERS control, and traditional mean and variance sum score statistics), providing evidence that the model is functioning as intended in providing a correction in regard to ERS bias.

Bayesian and Sequential Methods to Promote Learning and Detect Mastery. (Sangbeak Ye & Jeff Douglas, University of Illinois at Urbana Champaign) abstract

Abstract: Cognitive diagnosis models(CDMs) provide a novel approach to classify examinees' ability as a K-dimensional binary vector that indicates mastery or nonmastery of K fine-grained skills. With growing prevalence of e-learning platforms, there are growing opportunities to implement CDMs to construct computerized assessments that are meant to educate and assess students.

If K binary skills can be identified for an assessment, each item can contribute to teach non-masters while the same item can also assess mastery for any designated subset of K skills. To highlight the benefit of computerizing the pedagogical process, we fully utilized the responses, item parameters of corresponding items and item bank to adaptively educate and assess students online. To that end, we introduce Bayesian item selection methods that promote learning and sequential mastery detection methods that yields an optimal stopping time with a fixed false detection rate of 0.01.

Here, we suppose a notion of pedagogical value of each item. That is, for each item, the itemwise transition rates yield a nonzero probability for some item j. We consider posterior probability of mastery given the item parameters including learning rates of administered items and corresponding responses as a measurement to induce mastery and further detect mastery ultimately.

To evaluate the fit of a subsequent item given the past performance, the expected posterior probability of candidate items can be computed by letting the unobserved response be the random variable. The expected posterior probability is to compute the cumulative probability of mastery occurrence up to the subsequent candidate item, utilizing the existing item parameters such as guessing and slipping parameters for DINA model, for instance, and the transition probabilities.

In Ye et al.(2016), sequential change detection methods including CUSUM, Shiryeav-Roberts and Shiryaev statistics were showcased to detect mastery in an e-learning scenario. The methods fully utilize the item parameters and observed responses to update proximity to mastery threshold at each item. It was shown that the sequential detection methods performs superiorly to the ad-hoc methods that do not utilize the item parameters. In our study, the CUSUM statistics were adopted and posterior probability based method was newly proposed to be benchmarked together. Both methods were calibrated to yield 0.01 false detection rate.

The simulation results provide a significant evidence that data-driven item administration and detection yields reduction in both discrete learning times and detection delays. The synergy of the two methods offers a more effective online education by administering items adaptively and a more efficient online evaluation by hastening mastery and detection to improve the economy of the item bank.

Detection of Differential Item Functioning under Small Sample Size. (Chansoon "Danielle" Lee, University of Wisconsin-Madison; David Magis, University of Liège, Belgium; & Doyoung Kim, National Council of State Boards of Nursing) abstract

Abstract: This research focuses on differential item functioning (DIF) among dichotomously scored items with small sample sizes (such as 20 or 30 per group). Previous studies show that some recently proposed DIF methods performed better than or as good as the existing test score-based and IRT modeling DIF methods for small sample sizes. However, no study examined DIF identification with such small samples, although testing organizations often encounter the problem. Using both the existing and recent DIF methods, this current research will conduct a large set of simulation studies with various factors. We also apply those methods to real data from a testing organization.

This research is the first study to examine the performance of the non-IRT DIF methods with small samples. Our simulation study highlights which methods are reliable for small samples and which factors have meaningful impact to detecting DIF items. The findings of our research will offer a solution as well as useful practical guidelines for successful implementation of small sample DIF methods.

Afternoon break (2:55 — 3:15)

Organizational Methods B (3:15 — 4:00)

Group-Level Meta-Analyses: An Empirical Examination of the Effects of Reliability Information and Correction Procedures on the Accuracy of Parameter Estimates. (Maura Burke & Ron Landis, Illinois Institute of Technology) abstract

Abstract: This paper is the proposal for an empirical investigation of group-level meta-analytic conditions and the accuracy of resulting population parameter estimates associated with these conditions. More specifically, empirical Monte Carlo simulation procedures will be employed to examine how the number of respondents per group, the proportion of available group-level reliability information, the assumed reliability values used when this information is missing, and the type of group-level reliability estimate affect the accuracy of estimates of the mean and variance of rho parameter when these population values are known.

Implications of this study should guide researchers on estimating group-level reliabilities when conducting group-level meta-analyses with sample-based data. Furthermore, results from this study will provide a benchmark or guide for researchers in determining when proportions of group-level reliability and the values used are no longer accurately estimating their true population parameters.

General Factors in Employee Engagement Surveys: Predicting Employee Turnover. (Jordan McDonald & Sam McAbee, Illinois Institute of Technology) abstract

Abstract: Employee engagement has gained increasing attention in the organizational sciences over the past decade. Research has shown that engagement is meaningfully related to a variety of work outcomes, including increased employee performance and reduced turnover. However, employee engagement surveys are subject to a variety of potential limitations if not developed with psychometric rigor. One concern is related to the multidimensionality of such scales: Specifically, items are commonly aggregated into scales to measure specific aspects of the employees' work experience and environment, yet ostensibly distinct dimensions on engagement surveys are often highly related due to the presence of general psychological tendencies (e.g., trait affectivity) and/or methodological artifacts (e.g. common method bias). Ignoring the influence of such general factors may result in the overestimation of the importance of specific factors measured within a survey, leading to inaccurate interpretations of the survey data. The present study compares several alternative confirmatory factor analytic (CFA) models that account for general factors in employee engagement surveys: specifically, higher-order and bifactor models. Conceptually, higher-order models account for covariation among specific factors by specifying a latent factor that sits at a higher level of generality. Implicitly, higher-order models assume that the general factor only influences responses at the item level indirectly through the specific factors. In contrast, bifactor models place the general and specific factors at the same level of generality, such that the general and specific factors simultaneously and directly influence responses to individual items. Notably, the bifactor model is advantageous when the goal of the research is to assess predictive relationships between general, specific, and external criteria.

A Multi-Rater Perspective on Personality and Performance: Applying the Trait-Reputation-Identity Model in a Military Sample. (Samuel T. McAbee, Illinois Institute of Technology; Brian S. Connelly, University of Toronto; Yongsuhk Jung, Republic of Korea Air Force Academy; & In-Sue Oh, Temple University) abstract

Abstract: The use of personality inventories in personnel selection is among the most well researched topics in the organizational sciences. Although the majority of this research has relied on the use of self-reports of personality, there is convincing evidence that observer-reports of the Big Five demonstrate incremental prediction for a variety of organizationally relevant outcomes. One reason that observer-reports might predict incremental variance in organizational outcomes is that targets and observers have access to different information when providing personality ratings Recently, McAbee and Connelly (2016) proposed the Trait-Reputation-Identity (TRI) Model as a multi-rater framework for studying personality. The TRI Model posits that a person's standing on an individual trait continuum can be conceived according to the shared versus unique information available across rating sources. Under the TRI Model, the shared information that is captured across self- and observer-reports of a target's standing on a given trait (i.e., consensus) is reflected in the Trait factor; the unique variance captured within self-reports, alone, is reflected in the Identify factor; and shared perceptions of a targets' personality among multiple observers, independent of self-perceptions, are reflected in the Reputation factor. This model allows researchers to simultaneously study consensually valid "traits" alongside intrapersonal processes of self-perception (identity) and interpersonal projections of character (reputation). The present study extends the previous work to the organizational context by applying TRI models of the Big Five traits to the prediction of supervisor performance ratings, peer-rated organizational citizenship behaviors (OCBs), and academic grade point average (GPA) in a sample of 422 Korean Air Force cadets. Technical details for estimating TRI models and assessing model fit and theoretical implications of the TRI Model will be discussed.

Closing comments (4:00)

Questions about the seminar may be directed to Alan Mead (), Sam McAbee (), or Kirk Becker (). We hope you will join us.

Ideas in Testing Research Seminar Schedule, October 21, 2016

Coffee & Networking (9:15 — 9:45)

Welcome and Introduction (9:45 — 10:00)

Large-Scale Testing A (10:00 — 10:35)