Skip to main content

Microsoft’s new image-captioning AI will help accessibility in Word, Outlook, and beyond

Microsoft’s new image-captioning AI will help accessibility in Word, Outlook, and beyond


The algorithm even beats humans in some limited tasks

Share this story

If you buy something from a Verge link, Vox Media may earn a commission. See our ethics statement.

The image captioning algorithm will be used to improve apps like Seeing AI, here being used by developer Florian Beijers.
The image captioning algorithm will be used to improve apps like Seeing AI, here being used by developer Florian Beijers.
Image: Microsoft / Maurice Jager

Microsoft has developed a new image-captioning algorithm that exceeds human accuracy in certain limited tests. The AI system has been used to update the company’s assistant app for the visually impaired, Seeing AI, and will soon be incorporated into other Microsoft products like Word, Outlook, and PowerPoint. There, it will be used for tasks like creating alt-text for images — a function that’s particularly important for increasing accessibility.

“Ideally, everyone would include alt text for all images in documents, on the web, in social media — as this enables people who are blind to access the content and participate in the conversation,” said Saqib Shaikh, a software engineering manager with Microsoft’s AI team in a press statement. “But, alas, people don’t. So, there are several apps that use image captioning as way to fill in alt text when it’s missing.”

The new algorithm is twice as good as its predecessor says Microsoft

These apps include Microsoft’s own Seeing AI, which the company first released in 2017. Seeing AI uses computer vision to describe the world as seen through a smartphone camera for the visually impaired. It can identify household items, read and scan text, describe scenes, and even identify friends. It can also be used to describe images in other apps, including email clients, social media apps, and messaging apps like WhatsApp.

Microsoft does not disclose user numbers for Seeing AI, but Eric Boyd, corporate vice president of Azure AI, told The Verge the software is “one of the leading apps for people who are blind or have low vision.” Seeing AI has been voted best app or best assistive app three years in a row by AppleVis, a community of blind and low-vision iOS users.

Microsoft’s new image-captioning algorithm will improve the performance of Seeing AI significantly, as it’s able to not only identify objects but also more precisely describe the relationship between them. So, the algorithm can look at a picture and not just say what items and objects it contains (e.g., “a person, a chair, an accordion”) but how they are interacting (e.g., “a person is sitting on a chair and playing an accordion”). Microsoft says the algorithm is twice as good as its previous image-captioning system, in use since 2015.

The algorithm, which was described in a pre-print paper published in September, achieved the highest ever scores on an image-captioning benchmark known as “nocaps.” This is an industry-leading scoreboard for image captioning, though it has its own constraints.

The nocaps benchmark consists of more than 166,000 human-generated captions describing some 15,100 images taken from the Open Images Dataset. These images span a range of scenarios, from sports to holiday snaps to food photography and more. (You can get an idea of the mixture of images and captions by exploring the nocaps dataset here or looking at the gallery below.) Algorithms are tested on their ability to create captions for these pictures that match those from humans.


It’s important to note, though, that the nocaps benchmarks capture only a tiny sliver of the complexity of image captioning as a general task. Although Microsoft claims in a press release that its new algorithm “describes images as well as people do,” this is only true insomuch as it applies to a very small subset of images contained within nocaps.

“Surpassing human performance on nocaps is not an indicator that image captioning is a solved problem”

As Harsh Agrawal, one of the creators of the benchmark, told The Verge over email: “Surpassing human performance on nocaps is not an indicator that image captioning is a solved problem.” Argawal noted that the metrics used to evaluate performance on nocaps “only roughly correlate with human preferences” and that the benchmark itself “only covers a small percentage of all the possible visual concepts.”

“As with most benchmarks, [the] nocaps benchmark is only a rough indicator of the models’ performance on the task,” said Argawal. “Surpassing human performance on nocaps by no means indicates that AI systems surpass humans on image comprehension.”

This problem — assuming that performance on a specific benchmark can be extrapolated as performance on the underlying task more generally — is a common one when it comes to exaggerating the ability of AI. Indeed, Microsoft has been criticized by researchers in the past for making similar claims about its algorithms’ ability to comprehend the written word.

Nevertheless, image captioning is a task that has seen huge improvements in recent years thanks to artificial intelligence, and Microsoft’s algorithms are certainly state-of-the-art. In addition to being integrated into Word, Outlook, and PowerPoint, the image-captioning AI will also be available as a standalone model via Microsoft’s cloud and AI platform Azure.

Today’s Storystream

Feed refreshed Sep 24 Striking out

External Link
Emma RothSep 24
California Governor Gavin Newsom vetoes the state’s “BitLicense” law.

The bill, called the Digital Financial Assets Law, would establish a regulatory framework for companies that transact with cryptocurrency in the state, similar to New York’s BitLicense system. In a statement, Newsom says it’s “premature to lock a licensing structure” and that implementing such a program is a “costly undertaking:”

A more flexible approach is needed to ensure regulatory oversight can keep up with rapidly evolving technology and use cases, and is tailored with the proper tools to address trends and mitigate consumer harm.

Andrew WebsterSep 24
Look at this Thing.

At its Tudum event today, Netflix showed off a new clip from the Tim Burton series Wednesday, which focused on a very important character: the sentient hand known as Thing. The full series starts streaming on November 23rd.

The Verge
Andrew WebsterSep 24
Get ready for some Netflix news.

At 1PM ET today Netflix is streaming its second annual Tudum event, where you can expect to hear news about and see trailers from its biggest franchises, including The Witcher and Bridgerton. I’ll be covering the event live alongside my colleague Charles Pulliam-Moore, and you can also watch along at the link below. There will be lots of expected names during the stream, but I have my fingers crossed for a new season of Hemlock Grove.

Andrew WebsterSep 24
Looking for something to do this weekend?

Why not hang out on the couch playing video games and watching TV. It’s a good time for it, with intriguing recent releases like Return to Monkey Island, Session: Skate Sim, and the Star Wars spinoff Andor. Or you could check out some of the new anime on Netflix, including Thermae Romae Novae (pictured below), which is my personal favorite time-traveling story about bathing.

A screenshot from the Netflix anime Thermae Romae Novae.
Thermae Romae Novae.
Image: Netflix
Jay PetersSep 23
Twitch’s creators SVP is leaving the company.

Constance Knight, Twitch’s senior vice president of global creators, is leaving for a new opportunity, according to Bloomberg’s Cecilia D’Anastasio. Knight shared her departure with staff on the same day Twitch announced impending cuts to how much its biggest streamers will earn from subscriptions.

Tom WarrenSep 23
Has the Windows 11 2022 Update made your gaming PC stutter?

Nvidia GPU owners have been complaining of stuttering and poor frame rates with the latest Windows 11 update, but thankfully there’s a fix. Nvidia has identified an issue with its GeForce Experience overlay and the Windows 11 2022 Update (22H2). A fix is available in beta from Nvidia’s website.

External Link
If you’re using crash detection on the iPhone 14, invest in a really good phone mount.

Motorcycle owner Douglas Sonders has a cautionary tale in Jalopnik today about the iPhone 14’s new crash detection feature. He was riding his LiveWire One motorcycle down the West Side Highway at about 60 mph when he hit a bump, causing his iPhone 14 Pro Max to fly off its handlebar mount. Soon after, his girlfriend and parents received text messages that he had been in a horrible accident, causing several hours of panic. The phone even called the police, all because it fell off the handlebars. All thanks to crash detection.

Riding a motorcycle is very dangerous, and the last thing anyone needs is to think their loved one was in a horrible crash when they weren’t. This is obviously an edge case, but it makes me wonder what other sort of false positives we see as more phones adopt this technology.

External Link
Ford is running out of its own Blue Oval badges.

Running out of semiconductors is one thing, but running out of your own iconic nameplates is just downright brutal. The Wall Street Journal reports badge and nameplate shortages are impacting the automaker's popular F-series pickup lineup, delaying deliveries and causing general chaos.

Some executives are even proposing a 3D printing workaround, but they didn’t feel like the substitutes would clear the bar. All in all, it's been a dreadful summer of supply chain setbacks for Ford, leading the company to reorganize its org chart to bring some sort of relief.