Skip to main content

Facebook begins using artificial intelligence to describe photos to blind users

Facebook begins using artificial intelligence to describe photos to blind users


Second sight

Share this story

Ask a member of Facebook’s growth team what feature played the biggest role in getting the company to a billion daily users, and they’ll likely tell you it was photos. The endless stream of pictures, which users have been able to upload since 2005, a year after Facebook’s launch, makes the social network irresistible to a global audience. It’s difficult to imagine Facebook without photos. Yet for millions of blind and visually impaired people, that’s been the reality for over a decade.

Not anymore. Today Facebook will begin automatically describing the content of photos to blind and visually impaired users. Called "automatic alternative text," the feature was created by Facebook’s 5-year-old accessibility team. Led by Jeff Wieland, a former user researcher in Facebook’s product group, the team previously built closed captioning for videos and implemented an option to increase the default font size on Facebook for iOS, a feature 10 percent of Facebook users take advantage of.

Using VoiceOver to read descriptions of photos out loud

Automatic alt text, which is coming to iOS today and later to Android and the web, recognizes objects in photos using machine learning. Machine learning helps to build artificial intelligences by using algorithms to make predictions. If you show a piece of software enough pictures of a dog, for example, in time it will be able to identify a dog in a photograph. Automatic alt text identifies things in Facebook photos, then uses the iPhone’s VoiceOver feature to read descriptions of the photos out loud to users. While still in its early stages, the technology can reliably identify concepts in categories including transportation ("car," "boat," "airplane"), nature ("snow," "ocean," "sunset"), sports ("basketball court"), and food ("sushi"). The technology can also describe people ("baby," "smiling," beard"), and identify a selfie.


Last week, I traveled to Facebook’s accessibility lab in Menlo Park to see the technology in action. Wieland was there, along with Matt King, a Facebook engineer who is blind. King, who was born with limited sight and became blind in college, has been advocating for more accessible computers since the 1980s. Today, he represents Facebook on a World Wide Web consortium responsible for the technical specifications that make web pages accessible.

The primary way that blind people access the internet is through a screen reader — software that describes the elements displayed on a screen (a link, a button, some text, and so on) and makes it possible to interact with them. The web has evolved over the years to be friendlier to blind people. For example, the downward-facing triangle you see on every Facebook post, which allows you to hide the post or report it as spam, gets described by the screen reader not as a triangle but as as "story options, collapsed pop-up button." That way, blind users know they can interact with it.

Much of the web has been out of reach for blind people

But much of the web has long been out of reach for blind people. "You used to hear file names, and you didn’t know if they were clickable," King says. "It was a big Easter Egg hunt — and it wasn’t any fun at all. Even when I found the eggs, a lot of the eggs were photos. People talk in pictures, and talking in pictures is inherently out of reach for me." Facebook considered a range of approaches to the problem. "We don’t want to add a lot of friction," King says. "We could probably require people when they upload a photo: ‘please describe this for blind people.’ It would drive people nuts — that would never work at scale." (This is the actual approach Twitter is taking to the problem, though adding descriptions is optional.)

Facebook’s scale is enormous: each day, users upload 2 billion photos across Facebook, Instagram, Messenger, and WhatsApp. And so the accessibility team turned to Facebook’s artificial intelligence division, which is building software that recognizes images automatically. "We need a solution to that problem if people who cannot see photos and understand what’s in them are going to be part of the community and get the same enjoyment and benefit out of the platform as the people who can," King says.

In a demonstration, King pulled up a few stories on Facebook that include photos. He set the screen to black so we couldn’t see anything. If you’d like to re-think everything you ever thought you knew about web design, watch a blind person use the internet for five minutes. King normally has his screen reader speak to him incredibly quickly — the slightest audio cues now orient him on the page, reading Facebook posts out loud, identifying links, and exposing various buttons. His fingers were a blur as he entered commands on a standard MacBook Air. I remained totally lost until King turned the screen back on, save for the handful of words that described what we were seeing on Facebook.

One Facebook post had a photo with the caption "Sunday night splurge," and the description read aloud by the phone was "pizza, food." When King turned the screen back on, there was a photo of a giant pepperoni pizza with olives. Another photo had the caption "celebrations," and the phone described the photo as "three people smiling outdoors." It turned out to be … three people smiling outdoors. "Now I’m really understanding the essence of the story," King says. "Sometimes it’s just really amazing what one word can do."

Facebook is not alone in using machine learning to understand photos; it’s one of a few things artificial intelligence can currently do with any level of sophistication. Similar technology powers keyword searches in Google Photos and Flickr. But the technology is still prone to errors, and millions of objects have yet to be parsed. Last year, Google was forced to apologize after Photos tagged two black people as "gorillas."

By default, Facebook will only suggest a tag for a photo if it is 80 percent confident that it knows what it’s looking at. But in sensitive cases — including ones involving race, the company told me — it will require a much higher level of confidence before offering a suggestion. When it isn’t confident, Facebook simply won’t suggest a description. "In some cases, no data is better than bad data," Wieland says.

"In some cases, no data is better than bad data."

It’s a cliché for tech companies to describe a project as "just the beginning," but in this case it feels particularly true. Today it only works on one platform, and only in English. There are still millions of objects that Facebook can’t recognize with 80 percent confidence. ("Pizza" it knows. "Pepperoni pizza with olives" is still a ways away.) But the team is already pushing hard on two new tools: recognizing objects in videos, a technology it first demonstrated in November; and something it calls "visual Q&A," which will allow users to ask questions about pictures and receive an answer from Facebook’s AI. You might ask who is in a photo, for example, and it would tell you the names of the Facebook friends who appear in it.

At this stage, automatic alt tags represent a fascinating demonstration of technology. But at scale, they could also represent a growth opportunity — people with disabilities have been less likely to use Facebook on average, for obvious reasons. "Inclusion is really powerful and exclusion is really painful," King says. "The impact of doing something like this is really telling people who are blind, your ability to participate in the social conversation that’s going on around the world is really important to us. It’s saying as a person, you matter, and we care about you. We want to include everybody — and we’ll do what it takes to include everybody."