Skip to main content

I used OpenAI’s new tech to transcribe audio right on my laptop

I used OpenAI’s new tech to transcribe audio right on my laptop

/

The company behind DALL-E and GPT has made its automatic speech recognition system, called Whisper, and is letting developers and researchers use it.

Share this story

Illustration of a series of blue microphones on a teal background.
The benefits of AI without the drawbacks of the cloud.
Kristen Radtke / The Verge; Getty Images

OpenAI, the company behind image-generation and meme-spawning program DALL-E and the powerful text autocomplete engine GPT-3, has launched a new, open-source neural network meant to transcribe audio into written text (via TechCrunch). It’s called Whisper, and the company says it “approaches human level robustness and accuracy on English speech recognition” and that it can also automatically recognize, transcribe, and translate other languages like Spanish, Italian, and Japanese.

As someone who’s constantly recording and transcribing interviews, I was immediately hyped about this news — I thought I’d be able to write my own app to securely transcribe audio right from my computer. While cloud-based services like Otter.ai and Trint work for most things and are relatively secure, there are just some interviews where I, or my sources, would feel more comfortable if the audio file stayed off the internet.

Using it turned out to be even easier than I’d imagined; I already have Python and various developer tools set up on my computer, so installing Whisper was as easy as running a single Terminal command. Within 15 minutes, I was able to use Whisper to transcribe a test audio clip that I’d recorded. For someone relatively tech-savvy who didn’t already have Python, FFmpeg, Xcode, and Homebrew set up, it’d probably take closer to an hour or two. There is already someone working on making the process much simpler and user-friendly, though, which we’ll talk about in just a second.

Command-line apps obviously aren’t for everyone, but for something that’s doing a relatively complex job, Whisper’s very easy to use.
Command-line apps obviously aren’t for everyone, but for something that’s doing a relatively complex job, Whisper’s very easy to use.

While OpenAI definitely saw this use case as a possibility, it’s pretty clear the company is mainly targeting researchers and developers with this release. In the blog post announcing Whisper, the team said its code could “serve as a foundation for building useful applications and for further research on robust speech processing” and that it hopes “Whisper’s high accuracy and ease of use will allow developers to add voice interfaces to a much wider set of applications.” This approach is still notable, however — the company has limited access to its most popular machine-learning projects like DALL-E or GPT-3, citing a desire to “learn more about real-world use and continue to iterate on our safety systems.”

Image showing a text file with the transcribed lyrics for Yung Gravy’s song “Betty (Get Money).” The transcription contains many inaccuracies.
The text files Whisper produces aren’t exactly the easiest to read if you’re using them to write an article, either.

There’s also the fact that it’s not exactly a user-friendly process to install Whisper for most people. However, journalist Peter Sterne has teamed up with GitHub developer advocate Christina Warren to try and fix that, announcing that they’re creating a “free, secure, and easy-to-use transcription app for journalists” based on Whisper’s machine learning model. I spoke to Sterne, and he said that he decided the program, dubbed Stage Whisper, should exist after he ran some interviews through it and determined that it was “the best transcription I’d ever used, with the exception of human transcribers.”

I compared a transcription generated by Whisper to what Otter.ai and Trint put out for the same file, and I would say that it was relatively comparable. There were enough errors in all of them that I would never just copy and paste quotes from them into an article without double-checking the audio (which is, of course, best practice anyway, no matter what service you’re using). But Whisper’s version would absolutely do the job for me; I can search through it to find the sections I need and then just double-check those manually. In theory, Stage Whisper should perform exactly the same since it’ll be using the same model, just with a GUI wrapped around it.

Sterne admitted that tech from Apple and Google could make Stage Whisper obsolete within a few years — the Pixel’s voice recorder app has been able to do offline transcriptions for years, and a version of that feature is starting to roll out to some other Android devices, and Apple has offline dictation built into iOS (though currently there’s not a good way to actually transcribe audio files with it). “But we can’t wait that long,” Sterne said. “Journalists like us need good auto-transcription apps today.” He hopes to have a bare-bones version of the Whisper-based app ready in two weeks.

To be clear, Whisper probably won’t totally obsolete cloud-based services like Otter.ai and Trint, no matter how easy it is to use. For one, OpenAI’s model is missing one of the biggest features of traditional transcription services: being able to label who said what. Sterne said Stage Whisper probably wouldn’t support this feature: “we’re not developing our own machine learning model.”

The cloud is just somebody else’s computer — which probably means it’s quite a bit faster

And while you’re getting the benefits of local processing, you’re also getting the drawbacks. The main one is that your laptop is almost certainly significantly less powerful than the computers a professional transcription service is using. For example, I fed the audio from a 24-minute-long interview into Whisper, running on my M1 MacBook Pro; it took around 52 minutes to transcribe the whole file. (Yes, I did make sure it was using the Apple Silicon version of Python instead of the Intel one.) Otter spat out a transcript in less than eight minutes.

OpenAI’s tech does have one big advantage, though — price. The cloud-based subscription services will almost certainly cost you money if you’re using them professionally (Otter has a free tier, but upcoming changes are going to make it less useful for people who are transcribing things frequently), and the transcription features built-into platforms like Microsoft Word or the Pixel require you to pay for separate software or hardware. Stage Whisper — and Whisper itself— is free and can run on the computer you already have.

Again, OpenAI has higher hopes for Whisper than it being the basis for a secure transcription app — and I’m very excited about what researchers end up doing with it or what they’ll learn by looking at the machine learning model, which was trained on “680,000 hours of multilingual and multitask supervised data collected from the web.” But the fact that it also happens to have a real, practical use today makes it all the more exciting.

Today’s Storystream

Feed refreshed Sep 24 Striking out

E
External Link
Emma RothSep 24
California Governor Gavin Newsom vetoes the state’s “BitLicense” law.

The bill, called the Digital Financial Assets Law, would establish a regulatory framework for companies that transact with cryptocurrency in the state, similar to New York’s BitLicense system. In a statement, Newsom says it’s “premature to lock a licensing structure” and that implementing such a program is a “costly undertaking:”

A more flexible approach is needed to ensure regulatory oversight can keep up with rapidly evolving technology and use cases, and is tailored with the proper tools to address trends and mitigate consumer harm.


A
Youtube
Andrew WebsterSep 24
Look at this Thing.

At its Tudum event today, Netflix showed off a new clip from the Tim Burton series Wednesday, which focused on a very important character: the sentient hand known as Thing. The full series starts streaming on November 23rd.


A
The Verge
Andrew WebsterSep 24
Get ready for some Netflix news.

At 1PM ET today Netflix is streaming its second annual Tudum event, where you can expect to hear news about and see trailers from its biggest franchises, including The Witcher and Bridgerton. I’ll be covering the event live alongside my colleague Charles Pulliam-Moore, and you can also watch along at the link below. There will be lots of expected names during the stream, but I have my fingers crossed for a new season of Hemlock Grove.


A
Andrew WebsterSep 24
Looking for something to do this weekend?

Why not hang out on the couch playing video games and watching TV. It’s a good time for it, with intriguing recent releases like Return to Monkey Island, Session: Skate Sim, and the Star Wars spinoff Andor. Or you could check out some of the new anime on Netflix, including Thermae Romae Novae (pictured below), which is my personal favorite time-traveling story about bathing.


A screenshot from the Netflix anime Thermae Romae Novae.
Thermae Romae Novae.
Image: Netflix
J
Twitter
Jay PetersSep 23
Twitch’s creators SVP is leaving the company.

Constance Knight, Twitch’s senior vice president of global creators, is leaving for a new opportunity, according to Bloomberg’s Cecilia D’Anastasio. Knight shared her departure with staff on the same day Twitch announced impending cuts to how much its biggest streamers will earn from subscriptions.


T
Twitter
Tom WarrenSep 23
Has the Windows 11 2022 Update made your gaming PC stutter?

Nvidia GPU owners have been complaining of stuttering and poor frame rates with the latest Windows 11 update, but thankfully there’s a fix. Nvidia has identified an issue with its GeForce Experience overlay and the Windows 11 2022 Update (22H2). A fix is available in beta from Nvidia’s website.


A
External Link
If you’re using crash detection on the iPhone 14, invest in a really good phone mount.

Motorcycle owner Douglas Sonders has a cautionary tale in Jalopnik today about the iPhone 14’s new crash detection feature. He was riding his LiveWire One motorcycle down the West Side Highway at about 60 mph when he hit a bump, causing his iPhone 14 Pro Max to fly off its handlebar mount. Soon after, his girlfriend and parents received text messages that he had been in a horrible accident, causing several hours of panic. The phone even called the police, all because it fell off the handlebars. All thanks to crash detection.

Riding a motorcycle is very dangerous, and the last thing anyone needs is to think their loved one was in a horrible crash when they weren’t. This is obviously an edge case, but it makes me wonder what other sort of false positives we see as more phones adopt this technology.


A
External Link
Ford is running out of its own Blue Oval badges.

Running out of semiconductors is one thing, but running out of your own iconic nameplates is just downright brutal. The Wall Street Journal reports badge and nameplate shortages are impacting the automaker's popular F-series pickup lineup, delaying deliveries and causing general chaos.

Some executives are even proposing a 3D printing workaround, but they didn’t feel like the substitutes would clear the bar. All in all, it's been a dreadful summer of supply chain setbacks for Ford, leading the company to reorganize its org chart to bring some sort of relief.