Facebook is pouring a lot of time and money into augmented reality, including building its own AR glasses with Ray-Ban. Right now, these gadgets can only record and share imagery, but what does the company think such devices will be used for in the future?
A new research project led by Facebook’s AI team suggests the scope of the company’s ambitions. It imagines AI systems that are constantly analyzing peoples’ lives using first-person video; recording what they see, do, and hear in order to help them with everyday tasks. Facebook’s researchers have outlined a series of skills it wants these systems to develop, including “episodic memory” (answering questions like “where did I leave my keys?”) and “audio-visual diarization” (remembering who said what when).
Right now, the tasks outlined above cannot be achieved reliably by any AI system, and Facebook stresses that this is a research project rather than a commercial development. However, it’s clear that the company sees functionality like these as the future of AR computing. “Definitely, thinking about augmented reality and what we’d like to be able to do with it, there’s possibilities down the road that we’d be leveraging this kind of research,” Facebook AI research scientist Kristen Grauman told The Verge.
Such ambitions have huge privacy implications. Privacy experts are already worried about how Facebook’s AR glasses allow wearers to covertly record members of the public. Such concerns will only be exacerbated if future versions of the hardware not only record footage, but analyze and transcribe it, turning wearers into walking surveillance machines.
The name of Facebook’s research project is Ego4D, which refers to the analysis of first-person, or “egocentric,” video. It consists of two major components: an open dataset of egocentric video and a series of benchmarks that Facebook thinks AI systems should be able to tackle in the future.
The dataset is the biggest of its kind ever created, and Facebook partnered with 13 universities around the world to collect the data. In total, some 3,205 hours of footage were recorded by 855 participants living in nine different countries. The universities, rather than Facebook, were responsible for collecting the data. Participants, some of whom were paid, wore GoPro cameras and AR glasses to record video of unscripted activity. This ranges from construction work to baking to playing with pets and socializing with friends. All footage was de-identified by the universities, which included blurring the faces of bystanders and removing any personally identifiable information.
Grauman says the dataset is the “first of its kind in both scale and diversity.” The nearest comparable project, she says, contains 100 hours of first-person footage shot entirely in kitchens. “We’ve open up the eyes of these AI systems to more than just kitchens in the UK and Sicily, but [to footage from] Saudi Arabia, Tokyo, Los Angeles, and Colombia.”
The second component of Ego4D is a series of benchmarks, or tasks, that Facebook wants researchers around the world to try and solve using AI systems trained on its dataset. The company describes these as:
Episodic memory: What happened when (e.g., “Where did I leave my keys?”)?
Forecasting: What am I likely to do next (e.g., “Wait, you’ve already added salt to this recipe”)?
Hand and object manipulation: What am I doing (e.g., “Teach me how to play the drums”)?
Audio-visual diarization: Who said what when (e.g., “What was the main topic during class?”)?
Social interaction: Who is interacting with whom (e.g., “Help me better hear the person talking to me at this noisy restaurant”)?
Right now, AI systems would find tackling any of these problems incredibly difficult, but creating datasets and benchmarks are tried-and-tested methods to spur development in the field of AI.
Indeed, the creation of one particular dataset and an associated annual competition, known as ImageNet, is often credited with kickstarting the recent AI boom. The ImagetNet datasets consists of pictures of a huge variety of objects which researchers trained AI systems to identify. In 2012, the winning entry in the competition used a particular method of deep learning to blast past rivals, inaugurating the current era of research.
Facebook is hoping its Ego4D project will have similar effects for the world of augmented reality. The company says systems trained on Ego4D might one day not only be used in wearable cameras but also home assistant robots, which also rely on first-person cameras to navigate the world around them.
“The project has the chance to really catalyze work in this field in a way that hasn’t really been possible yet,” says Grauman. “To move our field from the ability to analyze piles of photos and videos that were human-taken with a very special purpose, to this fluid, ongoing first-person visual stream that AR systems, robots, need to understand in the context of ongoing activity.”
Although the tasks that Facebook outlines certainly seem practical, the company’s interest in this area will worry many. Facebook’s record on privacy is abysmal, spanning data leaks and $5 billion fines from the FTC. It’s also been shown repeatedly that the company values growth and engagement above users’ well-being in many domains. With this in mind, it’s worrying that benchmarks in this Ego4D project do not include prominent privacy safeguards. For example, the “audio-visual diarization” task (transcribing what different people say) never mentions removing data about people who don’t want to be recorded.
When asked about these issues, a spokesperson for Facebook told The Verge that it expected that privacy safeguards would be introduced further down the line. “We expect that to the extent companies use this dataset and benchmark to develop commercial applications, they will develop safeguards for such applications,” said the spokesperson. “For example, before AR glasses can enhance someone’s voice, there could be a protocol in place that they follow to ask someone else’s glasses for permission, or they could limit the range of the device so it can only pick up sounds from the people with whom I am already having a conversation or who are in my immediate vicinity.”
For now, such safeguards are only hypothetical.