# ETS's Research Legacy in Statistics and Psychometrics

ETS has pursued research in statistics and psychometrics since its founding in 1947. Below are examples of ETS's research contributions in these areas:

## Classical Test Theory

Classical test theory goes back to the early 20th century and has provided the foundation for how to calculate test scores and correct for measurement error in order to get reliable results from educational and psychological tests. The theory forms the basis for many calculations done with test scores, especially those involving reliability. The theory is based on partitioning a test taker's score into two components: a component called the "true score" that corresponds to the target of measurement (say, math skills) and a component called "error of measurement."

A number of ETS researchers left lasting contributions to this field, which is reflected in key publications such as these:

• Theory of Mental Tests (1950) — This influential textbook by Harold O. Gulliksen, a psychology professor at Princeton University and research advisor to ETS, covers a wide variety of mathematical theory and statistical methods for interpretation of test results. It is still in use.
• Statistical Theories of Mental Test Scores (1968) — This is a seminal work on classical test theory by Frederic Lord, who had joined ETS as director of statistical analysis in March 1949, and statistician Melvin Novick.

## Item Response Theory (IRT)

Item Response Theory (IRT) goes beyond the classical test theory and offers new ways to design, analyze and score psychometric assessments of human abilities, attitudes and other variables. In a widely used IRT model, the test taker's probability of providing a correct answer to a test question depends on the test taker's ability as well as three parameters:

• how well the questions separate test takers of different ability
• the questions' level of difficulty
• the probability that a test taker with no knowledge of the subject would give the right answer

At ETS, IRT is used for item analysis, item banking and score equating. A key work in this field was Lord's Applications of Item Response Theory to Practical Testing Problems (1980). ETS also carried out early work on IRT software (LOGIST) (Wood, Wingersky, & Lord, 1976).

In another example of pioneering research in item response theory, ETS incorporated IRT statistics and expert test construction algorithms into an operational computerized adaptive testing paradigm (A Method for Severely Constrained Item Selection in Adaptive Testing, Stocking & Swanson, 1992, ETS Research Report No. RR-92-37), a paradigm that was used in the administration of GRE® and GMAT® assessments.

## Equating Test Scores

Innovative approaches to equating test scores helped establish the standard methods that testing programs around the world use to make their scores comparable over time and across multiple test forms. Key ETS-authored publications include:

These three chapters in the highly respected editions of the reference work Educational Measurement summarized the state of the art of equating and linking methodologies for each of the associated time periods.

## Exploratory and Confirmatory Factor Analysis

ETS's research on exploratory and confirmatory factor analysis contributed to modern factor analysis as codified in the highly respected book, "Modern Factor Analysis" (Harman, 1976). In addition, work at ETS provided an approach and software (LISREL) for estimating a linear structural equation system involving multiple indicators of unmeasured variables. The analytic procedures in LISREL are used throughout the social sciences to test theoretical relationships among variables ("LISREL: A General Computer Program for Estimating a Linear Structural Equation System Involving Multiple Indicators of Unmeasured Variables," Joreskog & Van Thillo, ETS Research Bulletin, RB-72-56, 1972).

## Large-Scale Survey Assessment Research

Research in support of large-scale survey assessments by ETS researchers left significant contributions to the science behind large-scale survey assessments.

In Estimating Norms by Item Sampling (ETS Research Bulletin, RB-61-02, 1961) Lord demonstrated a method for more efficiently estimating norms by using matrix sampling, an approach where a large item pool is broken up into smaller, disparate items sets and administering these to different samples of students. Such matrix sampling is now widely used in the design of both national and international group-score assessments ("Monitoring Educational Progress with Group-Score Assessments." Mazzeo, Lazer, & Zieky, chapter in Educational Measurement, 4th Ed., 2006).

In A Least Squares Solution for Paired Comparisons with Incomplete Data (ETS Research Bulletin, RB-55-05, 1955), Gulliksen described an early approach to working with incomplete data.

In the 1970s, Rubin & Thayer ("Relating Tests Given to Different Samples." Psychometrika, v43 n1 p3–10, Mar 1978) proposed new imputation-based methods where statistical models are used to make assumptions about the nature of the missing values; methods that are now common.

Subsequently, Mislevy (1991), working for ETS on the newly acquired contract for the National Assessment of Educational Progress (NAEP), created a novel synthesis of Rubin's work with Lord's work in matrix sampling, Bock's work on marginal maximum likelihood approaches to IRT estimation, and general advances in analysis of data from complex survey sampling designs ("A Framework for Studying Differences Between Multiple-Choice and Free-Response Test Items," ETS Research Report, RR-91-36). The resulting synthesis — and the methods derived from it — form the basis for the analysis methods still in use for NAEP. The basic ideas behind this approach undergird the analysis method used in all the modern international group surveys (e.g., TIMSS, PISA and PIRLS) (Mazzeo, Lazer, & Zieky, 2006).

## Test Fairness

ETS pioneered improvements in test fairness through a standardized approach to differential item functioning (DIF). A good test question should pose the same level of difficulty for test takers of all social and cultural backgrounds after taking into account the overall skill level of each group. This is to avoid the possibility that the question's content is difficult for reasons that may reflect test takers' life experiences rather than their knowledge or skill in the area being tested. DIF describes this variation after controlling for the overall ability of a group. It is possible to perform a DIF analysis for any group of test takers, but DIF analysis typically focuses on female test takers and test takers from specified ethnic groups.

ETS's research in this area, including "Differential Item Performance and the Mantel-Haenszel Procedure," (a chapter by Holland & Thayer in "Test Validity", edited by Wainer & Braun, 1988, and "DIF Detection and Description: Mantel-Haenszel and Standardization" (a chapter by Dorans & Holland in Differential Item Functioning, edited by Holland & Wainer, 1993) led to the operational use of DIF procedures, which has become standard practice when evaluating fairness in test results. Prior to the invention of DIF procedures, Angoff and Ford developed an early method for looking at item functioning across groups in their book Item-Race Interaction on a Test of Scholastic Aptitude (1973).