From Siri to Alexa, voice interfaces are becoming increasingly common, but for all their recent advances, they often struggle with one of the most basic characteristics of human speech: accents. The problem is so prevalent that computer scientists have identified the existence of a "machine voice," a standardized way of speaking that individuals with accents adopt in the hope of being understood. Researchers even warn about the existence of a "speech divide" that ostracizes individuals whose accents differ from those the machines have been trained on.
As is often the case with technology built on large data sets, the problem begins with the input. If you only train your interface using a narrow selection of voices, then it won't know how to respond to accents that fall outside of its frame of reference. According to Marsal Gavalda, the head of machine intelligence at Yik Yak and an expert on speech recognition, academics have been studying the problem since the '80s.
Speech recognition's lack of diversity is rooted in data sets collected decades ago
"Historically, speech recognition systems have been trained from data collected mostly in universities, and mostly from the student population," Gavalda tells The Verge. "The [diversity of voices] reflect the student population 30 years ago."
For example, a project in the early '90s known as Call Home gave students credits to make free, long-distance phone calls. Their calls would be recorded, transcribed, and annotated, then sold as data sets to research teams and computer scientists. "It was the easiest place to collect these samples," says Gavalda. He adds that researchers also collected audio from news broadcasters — but again, this meant selecting only the most neutral accents.
There's an easy way to fix this, though: collect more data. Companies like Google have been doing exactly that as their voice interfaces become more integral to their software and hardware, and the internet has made this collection pretty straightforward. As was first spotted by Quartz, the search giant has been using a third-party company, Appen, to corral a diverse array of accented audio samples from the website's users.
recruiting at /r/beermoney
Contractors for Appen have been posting on a number of subredddits, including those dedicated to part-time work (/r/slavelabour, /r/WorkOnline, /r/beermoney, etc) as well as individual cities. /r/Edinburgh was where the request for samples was originally spotted, presumably to iron out understanding of the underserved Scottish accent.
"I'm currently recruiting to collect speech data for Google," reads one typical request. "It requires you to use an Android to complete the task. The task is recording voice prompts like 'Indy now,' 'Google what's the time.' Each phrase takes around 3-5 seconds." Adults are paid £27 ($35) to record 2,000 phrases, while under-17s can record 500 phrases and earn £20 ($26). Neither Appen nor Google would confirm that they were involved in the project, but a well-placed source told The Verge that the search giant regularly collects voice data to improve its services — it's just usually not so visible.
We talked to a number of Redditors who completed the task, and asked them about their experiences with voice interface tech. There were regional accents from the UK and America, as well as Indian and Chinese-accented English, with most users saying they'd had difficulty with tech like Siri and Alexa in the past. All said they went through the same process of being directed to a mobile webpage where there was a record button to tap, and a number of phrases to read out.
These voice samples were mostly addressed to Google (beginning "OK Google," "Hey, Google," etc.), but some just asked for the names of popular TV shows, toys, and video game (including a number of YouTube channels, like Sky Does Minecraft). Others spanned a range of typical Google searches, including hunting for recipes ("how to make a birthday cake"), understanding idioms ("hey Google, get cold feet"), beating pub quizzes ("presidents in order"), and looking for that perfect karaoke number ("you'll be in my heart just music").
"I need to annunciate a ton and use simple phrases."
One respondent told The Verge by email: "[I'm] originally from China, but I've lived in the US for about a decade, so I speak pretty much understandable English. The closest description of my accent would be US northeast with a hint of Singaporean newscaster. However, I do need to enunciate a ton and use simple phrases for applications like Siri and Google Now to work. I can't really 'converse' with my phone."
After audio samples are collected by Appen they're annotated by the company's in-house linguists, with longer sentences broken down grammatically, and contextual information added (was the sample recorded on a phone? Inside? Outdoors?). Mark Brayan, the company's CEO, wouldn’t comment on the company’s work with Google, but told The Verge that the firm collects and annotates (a process it refers to as "decoration") audio samples from around the world, with employees able to translate some 130 languages.
Brayan says demand for the company's services has increased massively in recent years, especially as voice interfaces become more common and users expect more out of them. "To go from understanding 95 percent of words to 99 percent, the recognizer has to digest infrequently used words, of which there are millions," says Brayan.
companies often request audio samples for specific vocabularies
Sometimes the company has to produce samples of specific vocabularies, related to, for example, a sport or a hobby. "One of the big challenges is what we call named entity recognition," says Brayan. "That's brand names, product names, individual names, and so on." Companies can ask for specific accents, or they can just say where they're hoping to launch a product and Appen will produce the relevant voices. "So if you're launching in Canada, for example, you need not only the French language but also French-accented Canadian English."
Incorporating unrepresented English-speaking accents will be a big step forward for voice interfaces, says Gavalda. "You could argue that the majority of English speakers are not even native speakers." He compares the situation to clinical trials for pharmaceutical companies that only recruited white men. It wasn't until an Act of Congress in 1993 that it became illegal to exclude women and minorities from such vital research. "If you think about it you're developing a of medicine," he says, "So it stand to reason you would make it work equally well with all different types of people."
Being able to ask Siri or Alexa questions obviously isn't as important as having access to effective medicine, but it is exclusion all the same. Thankfully, as Google's trawling for accents on Reddit shows, it's relatively easy to remedy. Just collect the audio samples, and let the machine learning systems process them. After all, a computer doesn't really "hear" accents — there are just sounds it recognizes and those it doesn't. It just needs the data.
Correction: A previous version of this story stated that "Appen employees" had posted to Reddit. It was third-party contractors Appen had hired who to do so. We regret the error.