My original research in trying to fool Automated Essay Scoring machines was unsystematic. Moreover, proponents of AES systems just repeated the long used mantra that expert writers could fool AES machines but students could not.
I decided to test that hypothesis, along with the claim that AES passed the Turing Test by attempting to fool the computer with something less intelligent than any student, another computer.
The traditional Turing Test is what Turing dubbed “The Imitation Game” in his seminal 1950 essay, ” Computing machinery and intelligence.” It has a human typing into a screen or teletype communicating with two entities in other rooms. One entity is a human being; the other entity is a computer. (Figure 1)
If the human typing into the screen cannot differentiate the computer from the human in the discourse, then the machine would be considered intelligent.
There are several forms of the Reverse Turing Test, the most widely known being the CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) Protocol that has become a common feature on internet pages. The basic form of the Reverse Turing Test is that the role of the human operator has been replaced by a machine. The Reverse Turing Test I and my co-investigators devised had various AES machines as the operator trying to differentiate between actual human essays and gibberish created by the BABEL Generator (Figure 2).
Figure 2. Reverse Turing TestOur hypothesis was simple. If the AES machine consistently gave high scores to machine generated gibberish, we could surmise that 1) the construct being measured by the machines is not an essential component of human communication; and 2) students could be taught similar strategies to achieve high scores on computer scored writing tests by sprinkling their prose with long meaningless sentences composed of pretentious and irrelevant words.
Our greatest surprise was how easy it was to fool all of the machines. We succeeded on our first try, demonstrating that rather than being elegant and complex manifestations of state-of-the-art artificial intelligence, these engines could best be characterized as crude stupid machines.
Although in the past, the Educational Testing Service has allowed me access to its e-raterĀ® scoring engine, they now will not allow me access unless I signan agreement that they could review all presentations and publications coming from such research, and they could then force me to remove all references to their product or organization before publication or presentation.. When I wrote about this attempt to censor me in The Washington Post, their reply first used examples that had no relevance to the issue at hand and boiled down to something like “we are not censoring Dr. Perelman; we are just trying to prevent him from presenting or publishing anything we do not like.“
We tested the the Babel Generator on a variety of Automated Essay Scoring platforms and the gibberish it generated consistently achieved high scores on all of of platforms including Vantage Technologies Intellimetric and ETS’s e-rater. E-rater is used to produce one of two scores on the two essays that constitute part of the Graduate Record Exam. ETS partners with a website, ScoreItNow where one can get representative sample questions, write essays, and have them scored by e-rater. We have now used the Babel Generator over twenty times to generate essays for the site, which, when submitted, receive top scores with comments such as articulates a clear and insightful position on the issue in accordance with the assigned task and sustains a well-focused, well-organized analysis, connecting ideas logically” for essays that read like this following opening paragraph:
Careers with corroboration has not, and in all likelihood never will be compassionate, gratuitous, and disciplinary. Mankind will always proclaim noesis; many for a trope but a few on executioner. A quantity of vocation lies in the study of reality as well as the area of semantics. Why is imaginativeness so pulverous to happenstance? The reply to this query is that knowledge is vehemently and boisterously contemporary.
Here are two sample PDF files, each containing the GRE Questions, the BABEL Generated essay, and ETS’s response using e-rater:
Each exam consists of a set of two essays. The first essay, which ETS defines as the Issue Essay, asks the test-taker to write an argumentive essay responding to a specific assertion. The second essay, which ETS defines as the Argument Essay, requires a written analysis of a short argument. In reality, e-Rater’s scoring algorithms are almost identical for the two essay types as evidenced by the scores presented below for a total of 38 BABEL generated essays, 19 each for both the Issue and Argument Essays.
There were twenty sets of essays but there was one score missing for each essay type. One of the BABEL responses to an Issue Essay topic was given a 0 with the explanation that the essay was “Off topic (i.e., provides no evidence of an attempt to respond to the assigned topic), is in a foreign language, merely copies the topic, consists of only keystroke characters, or is illegible or nonverbal).” Followed by an ADVISORY: This essay is longer than essays that can be accurately scored. Your essay must be within the word limit to receive a score. My first submission accidentally omitted the Argument Essay, leaving exactly 19 scores for each essay.
BABEL Experiment Generating GRE Essays Graded by e-rater
Issue | Score | # words | Argument | Score | #words | ||
A | National Curriculum | 4 | 489 | ||||
B | Imagination vs. Knowledge | 5 | 896 | Late Night News | 5 | 910 | |
C | Competition vs Cooperation | 6 | 896 | Super Screen Movies | 6 | 975 | |
D | National Curriculum | ADVISORY | 1071 | Late Night News | 6 | 981 | |
E | Imagination vs. Knowledge | 5 | 788 | Bardville Theatre | 5 | 621 | |
F | Competition vs Cooperation | 5 | 858 | Super Screen Movies | 5 | 934 | |
G | National Curriculum | 6 | 985 | Bardville Theatre | 5 | 943 | |
H | Imagination vs. Knowledge | 6 | 978 | Late Night News | 5 | 841 | |
I | Competition vs Cooperation | 4 | 491 | Super Screen Movies | 4 | 481 | |
J | Imagination vs. Knowledge | 6 | 922 | Late Night News | 6 | 969 | |
K | National Curriculum | 5 | 961 | Bardville Theatre | 6 | 990 | |
L | Competition vs Cooperation | 6 | 990 | Super Screen Movies | 5 | 973 | |
M | Competition vs Cooperation | 5 | 558 | Bardville Theatre | 4 | 536 | |
N | National Curriculum | 5 | 955 | Late Night News | 6 | 996 | |
O | Imagination vs. Knowledge | 6 | 991 | Super Screen Movies | 5 | 673 | |
P | National Curriculum | 5 | 998 | Bardville Theatre | 5 | 979 | |
Q | Competition vs Cooperation | 6 | 998 | Late Night News | 5 | 986 | |
R | National Curriculum | 6 | 971 | Bardville Theatre | 6 | 967 | |
S | Problems with Technology | 5 | 992 | Mason City | 6 | 996 | |
T | National Curriculum | 6 | 998 | Mason City | 5 | 946 |