Mapping the seam where humans and software interact and evaluate written language is a challenge for scholars in the humanities and the social sciences as much as it is a challenge for researchers in computer science and computational linguistics.
Automated Essay Scoring (AES) software is driven by algorithms that attempt to codify the complexities of written language. AES packages have been developed and marketed by educational content and assessment firms such as ETS, Pearson, and Vantage Learning; these efforts have been publicly critiqued with over 4,000 signatures on the petition Professionals Against Machine Scoring of Student Essays in High-Stakes Assessment. In addition, algorithms act as readers in TurnItIn’s new auto-grading service. Based on human readers’ scoring of sample essays, the Turnitin Scoring Engine “identifies patterns to grade new writing like your own instructors would.” These patterns are based on algorithms that attempt to map what the human readers have done on the sample essays and create a statistical model that the software can apply to “an unlimited number of new essays.” Speed and reliability are the promised benefits of this algorithmic reading.
This post sketches some of the history, challenges, and dynamics around having software algorithms score and respond to written language. Understanding how different algorithms and software packages work is essential for entering into debates about not only the reliability of AES but also its validity. Like human readers, each algorithmic reading engine emphasizes different aspects of a piece of writing, has its own quirks, its own biases, if you will. These algorithmic-readerly biases have something to do with history. In this case, it is not a reader’s biographical history but the software’s developmental history that shapes the reading approach encapsulated in the automated essay scoring and response engines.
ETS’s e-rater’s core engine was developed during the 1990s by researchers at ETS. It constructs an ideal model essay for a task based on up to twelve features. When it reads, it reports its comparative analysis back through different areas of analysis such as style, errors in grammar, usage, and mechanics, and identification of organizational segments, such as thesis statement, and vocabulary content. Its writing construct and model of the reading process is informed primarily by Natural Language Processing (NLP) theories (Attali & Burstein, 2006; Leacock & Chodorow, 2003; and Burstein, 2003).
Cengage/Vantage Learning’s Intellimetric® development was also informed by Natural Language Processing (NLP) theories. This algorithmic model draws on up to 500 component proxy features and organizes them into a few selected clusters. These clusters include elements such as content, word variety, grammar, text complexity, and sentence variety. Intellimetric supplements its NLP-based approach with with semantic wordnets that attempt to determine how close the written response is to vocabulary used in model pieces. Intellimetric also integrates grammar and text complexity assessments into its formulas. It aims to predict the expert human scores and then optimized to produce a final predicted score (Elliot, 2003).
Pearson’s Intelligent Essay Assessor (IEA) is delivered through the Write to Learn! platform. Originally it was developed by Thomas Landauer and Peter Flotz. It uses Latent Semantic Analysis (LSA) to measure how well an essay matches the model content within a piece of writing. In contrast to e-rater, mechanics, style, organization are not fixed features, but rather are constructed as a function of the domains assessed in the rating rubric. The weights for proxy variables then associated with these domains are predicted based on sample human readings (rating scores). These can then be combined by the software with the score calculated by the LSA-based algorithm for the piece’s semantic content (Landauer, Foltz, & Laham, 1998; Landauer, Laham & Foltz, 2003).
TurnItIn’s Scoring Engine (sometimes called Revision Assistant) builds on the algorithms developed by Elijah Mayfield at LightSide Labs. This scoring engine attempts to read and respond to student writing based on the elements that human readers would value in the same writing and reading contexts. Samples of graded student writing are input into the algorithms to help the scoring engine learn; this machine learning approach differs from the algorithms used in the NLP- and LSA-based systems. This set of algorithms can point out both strengths and weaknesses in a sample of writing. This algorithmic reader’s history and practice might be as much about feedback and revision as it is about scoring and placement.
These four sketches of how algorithms have been bundled in AES software only scratch the surface of the ways in which algorithms are being used at the seam where humans and machines interact — where humans and machines read and write each other. These algorithmic readers are only a few of the pieces of software that are being developing within the AES, or what is beginning to be called Automated Writing Evaluation (AWE), field. (For a more comprehensive review see Elliot et al., 2013). This field itself is only a sliver of the ways in which algorithms are being used to read and manage written texts as data.
In some ways, the iterative feedback possible with these emerging scoring engines is the logical evolution of the intimate green or red squiggly lines already in Microsoft Word, Gmail, and Google Docs. These pieces of AWE software, these bundles of algorithms, are shaping how we think about our words, our writing.
They are intervening in our writing processes, become our intimates. They may be closer to our thoughts than our lovers are. What does it mean when we text a love note, and a software agent autocorrects our spelling before our lover reads it? What does it mean when students are writing essays and a bundle of algorithms pushes and shapes their language before a peer or a teacher even sees it?
These algorithms are not evil. They are us, or they are working with us, or they are working us.
As a foray into machine reading practices, this post should not end as a dystopia. Rather I want it to end by asking that we go back, that we consider the algorithms and how they are bundled. E-rater is not Intellimetric, and IEA is not TurnItIn’s Revision Assistant. We need to pick at the seam where algorithms read our writing and where we write into the deep well of natural language processing, semantic webs, and machine learning.
We need to pick at these threads not to undo them, but to come to understand the pluses and minuses they each hold. Knowing how these algorithmic readers work, knowing the threads that bundle these software packages together is a vital task for humanists as well as for computational linguists, for teachers and writers as well as for software developers. I’m curious to see how this discussion plays out.
Attali, Y., & Burstein, J. (2006). Automated Essay Scoring With e-rater V.2. Journal of Technology, Learning, and Assessment, 4(3). Available from http://www.jtla.org
Burstein, J. (2003). The E-rater scoring engine: Automated essay scoring with natural language processing. In M. D. Shermis & J. Burstein (Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 113-122). Mahwah, NJ: Erlbaum.
Elliot, S. (2003). Intellimetric: From here to validity. In M. D. Shermis & J. Burstein (Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 71-86). Mahwah, NJ: Lawrence Erlbaum Associates, Inc.
Elliot, N., Gere, A. R., Gibson, G, Toth, C, Whithaus, C, &; Presswood, A. (2013). Uses and Limitations of Automated Writing Evaluation Software, WPA-CompPile Research Bibliographies, No. 23. WPA-CompPile Research Bibliographies. http://comppile.org/wpa/bibliographies/Bib23/AutoWritingEvaluation.pdf.
Landauer, T. K., Foltz, P. W., & Laham, D. (1998). Introduction to latent semantic analysis. Discourse Processes, 25, 259-284.
Landauer, T. K., Laham, D., & Foltz, P. W. (2003). Automated scoring and annotation of essays with the Intelligent Essay Assessor. In M. D. Shermis & J. Burstein (Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 87-112). Mahwah, NJ: Erlbaum.
Leacock, C., & Chodorow, M. (2003). C-rater: Scoring of short-answer questions. Computers and the Humanities, 37(4), 389-405.