Researchers at MIT have created a database of annotated sentences written by non-native English speakers, the university announced in a press release Friday, in an effort to improve the ways in which computers process written or spoken language.
Most natural language processing (NLP) technology is based on machine learning, whereby computers identify patterns in large datasets. The problem, though, is that these systems are based on standard English, and therefore might not be able to pick up on the quirks and subtleties of non-native speakers.
"Most of the people who speak English are non-native speakers."
"Most of the people who speak English in the world or produce English text are non-native speakers," Yevgeni Berzak, a graduate student in electrical engineering and computer science who led the project, said in a statement. "This characteristic is often overlooked when we study English scientifically or when we do natural-language processing for English."
The database was compiled from 5,124 sentences written in exam essays by English as a second language (ESL) students. They were written by native speakers of ten different languages that, together, are spoken by around 40 percent of the world's population. Each sentence contained at least one grammatical error, which was annotated by Cambridge University, but they lacked other grammatical and syntactic data.
MIT's team of annotators added this information themselves, highlighting both parts of speech (nouns, verbs, adjectives) as well as more detailed descriptions, including verb tenses and plural or singular nouns. They then used the Universal Dependency (UD) standard to map syntactic relationships for both corrected and non-corrected sentences, identifying, for example, which adjectives modify which nouns, and verbs that are auxiliaries of other verbs.
The researchers say there was some disagreement on how to annotate the grammatically incorrect sentences — each went through three levels of review — but they weren't any more contentious than annotations of grammatically correct sentences. That, according to MIT, suggests that the English of non-native speakers could be mapped in a similarly uniform way, which could pave the way for it smarter grammar correction software. The researchers will present their work at the Association for Computational Linguistics annual conference next month.
Joakim Nivre, a professor of computational linguistics at Uppsala University in Sweden, and one of the developers of the UD standard, tells MIT that the research suggests that computers could be trained to systematically compare ESL to both native English and other languages, which could be used for machine translation tasks.