When scoring essays, the e-rater® engine will:
- validate that the features are not only predictive of a readers' score but also have some logical relevance to the writing prompt
- automatically flag responses that are off-topic or inconsistent, so that they can be set aside for review
- combine the scoring features in a statistical model to produce a final score estimate
The e-rater engine is continually being developed and improved, with the aim of extending its ability to model important and challenging aspects of writing proficiency. Ongoing research aims to enhance the e-rater engine capabilities so that it can identify and evaluate the structure of an argument in an essay, as well as assess the creative use of language in student and test-taker writing.
The features used for e-rater scoring are the result of nearly two decades of Natural Language Processing research at ETS, and each feature may itself be composed of a multiplicity of independent sub-features. Work has also been done to establish a vertically linked scale of K–12 writing scores across grades based on the e-rater engine, known as the Developmental Writing Scale.
The features of the e-rater scoring engine currently include:
- content analysis based on vocabulary measures
- lexical complexity/diction
- proportion of grammar, usage and mechanics errors
- proportion of style comments
- organization and development scores
- rewarding idiomatic phraseology
The adjustment of features to assign a total score to an essay can be tailored to a specific prompt, or in a "generic" fashion, allowing the same e-rater model to be used to score a variety of prompt responses.
Computer's Score Agreement
For tasks that are appropriate for the e-rater engine (essay-length writing tasks that are scored for writing quality rather than correctness of claims made in the response), agreement with human raters can be very strong. As Attali, Bridgeman & Trapani found in 2010, the e-rater engine's agreement with a human rater on the TOEFL® Independent and GRE® Issue tasks was higher than the agreement between two independent human raters.