Skip to main content

Meta open-sources multisensory AI model that combines six types of data

Meta open-sources multisensory AI model that combines six types of data

/

The new ImageBind model combines text, audio, visual, movement, thermal, and depth data. It’s only a research project but shows how future AI models could be able to generate multisensory content.

Share this story

Meta logo on blue background
Illustration by Alex Castro / The Verge

Meta has announced a new open-source AI model that links together multiple streams of data, including text, audio, visual data, temperature, and movement readings.

The model is only a research project at this point, with no immediate consumer or practical applications, but it points to a future of generative AI systems that can create immersive, multisensory experiences and shows that Meta continues to share AI research at a time when rivals like OpenAI and Google have become increasingly secretive.

The core concept of the research is linking together multiple types of data into a single multidimensional index (or “embedding space,” to use AI parlance). This idea may seem a little abstract, but it’s this same concept that underpins the recent boom in generative AI.

Multimodal AI models are the heart of the generative AI boom

For example, AI image generators like DALL-E, Stable Diffusion, and Midjourney all rely on systems that link together text and images during the training stage. They look for patterns in visual data while connecting that information to descriptions of the images. That’s what then enables these systems to generate pictures that follow users’ text inputs. The same is true of many AI tools that generate video or audio in the same way.

Meta says that its model, ImageBind, is the first to combine six types of data into a single embedding space. The six types of data included in the model are: visual (in the form of both image and video); thermal (infrared images); text; audio; depth information; and — most intriguing of all — movement readings generated by an inertial measuring unit, or IMU. (IMUs are found in phones and smartwatches, where they’re used for a range of tasks, from switching a phone from landscape to portrait to distinguishing between different types of physical activity.)

A screenshot from Meta’s blog post showing different types of linked data, e.g., a picture of a train, audio of a train horn, and depth information about a train’s 3D shape.
Meta’s ImageBind model combines six types of data: audio, visual, text, depth, temperature, and movement.
Image: Meta

The idea is that future AI systems will be able to cross-reference this data in the same way that current AI systems do for text inputs. Imagine, for example, a futuristic virtual reality device that not only generates audio and visual input but also your environment and movement on a physical stage. You might ask it to emulate a long sea voyage, and it would not only place you on a ship with the noise of the waves in the background but also the rocking of the deck under your feet and the cool breeze of the ocean air.

In a blog post, Meta notes that other stream of sensory input could be added to future models, including “touch, speech, smell, and brain fMRI signals.” It also claims the research “brings machines one step closer to humans’ ability to learn simultaneously, holistically, and directly from many different forms of information.” (Which, sure, whatever. Depends how small these steps are.)

This is all very speculative, of course, and it’s likely that the immediate applications of research like this will be much more limited. For example, last year, Meta demonstrated an AI model that generates short and blurred videos from text descriptions. Work like ImageBind shows how future versions of the system could incorporate other streams of data, generating audio to match the video output, for example.

For industry watchers, though, the research is also interesting as Meta is open-sourcing the underlying model — an increasingly scrutinized practice in the world of AI.

Those opposed to open-sourcing, like OpenAI, say the practice is harmful to creators because rivals can copy their work and that it could be potentially dangerous, allowing malicious actors to take advantage of state-of-the-art AI models. Advocates respond that open-sourcing allows third parties to scrutinize the systems for faults and ameliorate some of their failings. They note it may even provide a commercial benefit, as it essentially allows companies to recruit third-party developers as unpaid workers to improve their work.

Meta has so far been firmly in the open-source camp, though not without difficulties. (Its latest language model, LLaMA, leaked online earlier this year, for example.) In many ways, its lack of commercial achievement in AI (the company has no chatbot to rival Bing, Bard, or ChatGPT) has enabled this approach. And for the meantime, with ImageBind, it’s continuing with this strategy.