BABEL Generator

BABEL Team

The Basic Automatic B.S. Essay Language Generator (BABEL Generator) Team: Louis Sobel, Les Perelman Milo Beckman, and Damien Jiang

My original research in trying to fool Automated Essay Scoring machines was unsystematic.  Moreover, proponents of AES systems just repeated the long used mantra that expert writers could fool AES machines but students could not.
I decided to test that hypothesis, along with the claim that AES passed the Turing Test by attempting to fool the computer with something less intelligent than any student, another computer.
The traditional Turing Test is what Turing dubbed “The Imitation Game” in his seminal 1950 essay, ” Computing machinery and intelligence.” It has a human typing into a screen or teletype communicating with two entities in other rooms.  One entity is a human being; the other entity is a computer.  (Figure 1)
Turing Test 2

Figure 1. Traditional Turing Test

If the human typing into the screen cannot differentiate the computer from the human in the discourse, then the machine would be considered intelligent.

There are several forms of the Reverse Turing Test, the most widely known being the CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) Protocol that has become a common feature on internet pages.  The basic form of the Reverse Turing Test is that the role of the human operator has been replaced by a machine.  The Reverse Turing Test I and my co-investigators devised had various AES machines as the operator trying to differentiate between actual human essays and gibberish created by the BABEL Generator (Figure 2).

Reverse Turing Test 2

Figure 2. Reverse Turing Test

Our hypothesis was simple.  If the AES machine consistently gave high scores  to machine generated gibberish, we could surmise that 1) the construct being measured by the machines is not an essential component of human communication; and 2) students could be taught similar strategies to achieve high scores on computer scored writing tests by sprinkling their prose with long meaningless sentences composed of pretentious and irrelevant words.

Our greatest surprise was how easy it was to fool all of the machines.  We succeeded on our first try, demonstrating that rather than being elegant and complex manifestations of state-of-the-art artificial intelligence, these engines could best be characterized as crude stupid machines.

Although in the past, the Educational Testing Service has allowed me access to its e-raterĀ® scoring engine, they now will not allow me access unless I signan agreement that they could review all presentations and publications coming from such research, and they could then force me to remove all references to their product or organization before publication or presentation..  When I wrote about this attempt to censor me in The Washington Post, their reply first used examples that had no relevance to the issue at hand and boiled down to something like “we are not censoring Dr. Perelman; we are just trying to prevent him from presenting or publishing anything we do not like.

We tested the the Babel Generator on a variety of Automated Essay Scoring platforms and the gibberish it generated consistently achieved high scores on all of of platforms including Vantage Technologies Intellimetric and ETS’s e-rater.  E-rater is used to produce one of two scores on the two essays that constitute part of the Graduate Record Exam.  ETS partners with a website, ScoreItNow where one can get representative sample questions, write essays, and have them scored by e-rater.  We have now used the Babel Generator over twenty times to generate essays for the site, which, when submitted, receive top scores with comments such as articulates a clear and insightful position on the issue in accordance with the assigned task and sustains a well-focused, well-organized analysis, connecting ideas logically” for essays that read like this following opening paragraph:

Careers with corroboration has not, and in all likelihood never will be compassionate, gratuitous, and disciplinary. Mankind will always proclaim noesis; many for a trope but a few on executioner.  A quantity of vocation lies in the study of reality as well as the area of semantics. Why is imaginativeness so pulverous to happenstance? The reply to this query is that knowledge is vehemently and boisterously contemporary.

Here are two sample PDF files, each containing the GRE Questions, the BABEL Generated essay, and ETS’s response using e-rater:    

filetype_pdf

filetype_pdf

 

 

 

 

Each exam consists of a set of two essays. The first essay, which ETS defines as the Issue Essay, asks the test-taker to write an argumentive essay responding to a specific assertion. The second essay, which ETS defines as the Argument Essay, requires a written analysis of a short argument. In reality, e-Rater’s scoring algorithms are almost identical for the two essay types as evidenced by the scores presented below for a total of 38 BABEL generated essays, 19 each for both the Issue and Argument Essays.

There were twenty sets of essays but there was one score missing for each essay type. One of the BABEL responses to an Issue Essay topic was given a 0 with the explanation that the essay was “Off topic (i.e., provides no evidence of an attempt to respond to the assigned topic), is in a foreign language, merely copies the topic, consists of only keystroke characters, or is illegible or nonverbal).” Followed by an ADVISORY: This essay is longer than essays that can be accurately scored. Your essay must be within the word limit to receive a score. My first submission accidentally omitted the Argument Essay, leaving exactly 19 scores for each essay.

BABEL Experiment Generating GRE Essays Graded by e-rater

Issue Score# wordsArgumentScore#words
ANational Curriculum4489
BImagination vs. Knowledge 5896Late Night News 5910
CCompetition vs Cooperation6896Super Screen Movies6975
DNational CurriculumADVISORY1071Late Night News 6981
EImagination vs. Knowledge 5788Bardville Theatre5621
FCompetition vs Cooperation5858Super Screen Movies5934
GNational Curriculum6985Bardville Theatre5943
HImagination vs. Knowledge 6978Late Night News 5841
ICompetition vs Cooperation4491Super Screen Movies4481
JImagination vs. Knowledge 6922Late Night News 6969
KNational Curriculum5961Bardville Theatre6990
LCompetition vs Cooperation6990Super Screen Movies5973
MCompetition vs Cooperation5558Bardville Theatre4536
NNational Curriculum5955Late Night News 6996
OImagination vs. Knowledge 6991Super Screen Movies5673
PNational Curriculum5998Bardville Theatre5979
QCompetition vs Cooperation6998Late Night News 5986
RNational Curriculum6971Bardville Theatre6967
SProblems with Technology5992Mason City6996
TNational Curriculum6998Mason City5946

Above is my real-time demonstration on NHK, Japanese Public Television, of the BABEL Generator producing an essay that received a perfect score on the AES graded Graduate Record Examination Practice Test