Long before the voice assistants built into our smart speakers could understand our requests to play music, turn off lights, and read the weather report, machines had to learn how to hear, recognize, and process human speech. The technology in use today is more than a century in the making — and it’s come a long way since the earliest listening and recording devices. Here’s a look back at the origin of speech recognition.
Before Thomas Edison patented his light bulb, he invented one of the earliest working dictation machines. The first phonograph, which he built in 1877, contained a stylus that etched grooves into a rotating, tinfoil-covered cylinder in response to the pressure produced by sound vibration. The embossed record could then be used in reverse to vibrate the stylus and turn those movements back into audible sound. These early devices could record speech and play it back, but they could not process words and take any action — and the phonograph’s delicate tinfoil produced garbled recordings at best.
Alexander Graham Bell developed an upgrade on Edison’s phonograph, which his Volta Graphophone Company patented in 1886. The graphophone used wax instead of foil, which allowed for longer recordings and higher-quality playback. Edison also developed a wax version of the phonograph, and both devices were used primarily for dictating letters and other documents.
Researchers at Bell Laboratories built Audrey, the first true speech recognition device, in 1952. The machine understood digits 0-9 if speakers paused in between, though it would have to adapt to each user before it could capture their speech with reasonable accuracy. While Audrey could theoretically be used for hands-free dialing, it lacked mass appeal thanks to its size (it was housed in a six-foot-tall relay rack), power requirements, and cost to produce and maintain. Punching actual phone buttons proved to be faster and more reliable than Audrey’s capabilities.
An IBM engineer introduced the Shoebox at the 1962 World’s Fair in Seattle. The Shoebox was like a voice-activated calculator: It understood 10 digits and six control words — plus, minus, total, subtotal, false, and off — and instructed an attached adding machine to calculate and print out answers to basic spoken math problems. Like Audrey, the device attempted to recognize and act on the specific frequency of the vowels in each spoken digit.
In 1971, the Defense Advanced Research Projects Agency (DARPA) provided funding for a five-year speech recognition project at Carnegie Mellon University, which led to the launch of Harpy in 1976. The machine had a 1,011-word vocabulary and could also understand entire phrases, including where different words started and stopped. Harpy processed speech that followed pre-programmed vocabulary, pronunciation, and grammar structures. Like the voice assistants available in 2018, Harpy returned an ‘‘I don’t know what you said, please repeat’’ message when it couldn’t understand the speaker.
The 1986 Tangora was an upgrade of the Shoebox — but instead of operating an adding machine, it connected to a typewriter. Named after the world’s fastest typist at the time, Tangora recognized approximately 20,000 words and processed speech by predicting the most likely result based on what it had interpreted thus far. However, it still couldn’t automatically adapt to individual speakers.
Consumers had broader access to both personal computers and speech recognition tech in the 1990s. Dragon’s 1997 NaturallySpeaking software could recognize and transcribe natural human speech — meaning users didn’t have to pause between each word — into a digital document at a rate of 100 words per minute. The program cost $695 to purchase, which made it ‘‘affordable’’ relative to earlier speech recognition devices. A version of NaturallySpeaking is still available for download.
2010s: Watson and Google Assistant
The twin innovations from IBM and Google brought what was previously in the realm of sci-fi into reality. When Watson, a computer capable of answering questions based on natural language, beat Jeopardy! grand champion Ken Jennings in a trivia grudge match, it was seen as a major step forward for both voice recognition software intelligence. Watson’s work is geared towards industrial applications like natural language understanding, where it parses massive datasets and returns actionable information. That’s why Google Assistant’s focus on the consumer aspects of voice assistants was so pivotal. This one was of the first times an average user could leverage voice recognition and AI in a practical way, and has set the stage for a future smart home products that you can control with your voice.
Great for managing your day
Great for keeping your cool or bringing the heat
Great for lighting the way
Great for seeing what’s coming