Last week, Facebook’s parent company Meta shared a new AI model that turns text prompts into short, soundless videos. But it turns out Google has been working on the same problem, and recently released two new AI text-to-video systems, one of which focuses on image quality while the other prioritizes the creation of longer clips.
Let’s a take a look at the high-quality model first: Imagen Video. As the name suggests, this model builds on techniques honed in Google’s earlier text-to-image system Imagen, but straps in a bunch of new components to the pipeline to turn static frames into fluid motion.
The AI-generated videos are incredible, uncanny, and unsettling
As with Meta’s Make-A-Video model, the end results are simultaneously incredible, uncanny, and unsettling. The most convincing samples are those videos that replicate animation, like green sprouts forming the words “Imagen” or the wooden figurine surfing in space. That’s because we don’t necessarily expect such footage to follow strict rules of temporal and spatial composition. They can be a bit looser — which suits the model’s weaknessess.
The least convincing clips are those that replicate the motion of real people and animals, like the figure shoveling snow or the cat jumping on a couch. Here, when we have such a clear idea of how bodies and limbs should move, the deformation and deteriorating of the footage is more obvious. Regardless, though, these videos are all extremely impressive, with each clip generated using nothing more than the text prompt in each caption below.
Take a gander for yourself:
Google’s researchers note that the Imagen Video model outputs 16 frames of 3fps footage at 24x48 resolution. This low-res content is then run through various AI super-resolution models, which boost this output to 128 frames of 24fps footage at 1280x768 resolution. That’s higher-quality than Meta’s Make-A-Video model, which is boosted to 768x768.
As we discussed with the debut of Meta’s system, the coming advent of text-to-video AI brings with it all sorts of challenges; from the racial and gender bias embedded in these systems (which are trained on material scraped from the internet) to their potential for misuse (i.e., creating non-consensual pornography, propaganda, and misinformation).
Google says “there are several important safety and ethical challenges remaining”
Google’s researchers elude to these matters briefly in their research paper. “Video generative models can be used to positively impact society, for example by amplifying and augmenting human creativity,” they write. “However, these generative models may also be misused, for example to generate fake, hateful, explicit or harmful content.” The team notes that that they experimented with filters to catch NSFW prompts and output video, but offer no comment on their success and conclude — with what reads like unintentional understatement —that “there are several important safety and ethical challenges remaining.” Well, quite.
This is not surprising. Imagen Video is a research project, and Google is mitigating its potential harms to society by simply not releasing it to the public. (Meta’s Make-A-Video AI is similarly restricted.) But, as with text-to-image systems, these models will soon be replicated and imitated by third-party researchers before being disseminated as open-source models. When that happens, there will be new safety and ethical challenges for the wider web, no doubt about it.
In addition to Imagen Video, a separate team of Google researchers also published details about another text-to-video model, this one named Phenaki. In comparison to Imagen Video, Phenaki’s focus is on creating longer videos that follow the instructions of a detailed prompt.
So, with a prompt like this:
Lots of traffic in futuristic city. An alien spaceship arrives to the futuristic city. The camera gets inside the alien spaceship. The camera moves forward until showing an astronaut in the blue room. The astronaut is typing in the keyboard. The camera moves away from the astronaut. The astronaut leaves the keyboard and walks to the left. The astronaut leaves the keyboard and walks away. The camera moves beyond the astronaut and looks at the screen. The screen behind the astronaut displays fish swimming in the sea. Crash zoom into the blue fish. We follow the blue fish as it swims in the dark ocean. The camera points up to the sky through the water. The ocean and the coastline of a futuristic city. Crash zoom towards a futuristic skyscraper. The camera zooms into one of the many windows. We are in an office room with empty desks. A lion runs on top of the office desks. The camera zooms into the lion’s face, inside the office. Zoom out to the lion wearing a dark suit in an office room. The lion wearing looks at the camera and smiles. The camera zooms out slowly to the skyscraper exterior. Timelapse of sunset in the modern city.
Phenaki generates a video like this:
Obviously the video’s coherence and resolution is lower quality than that of Imagen Video, but the sustained series of scenes and settings is impressive. (You can watch more examples on the project’s homepage here.)
In a paper describing the model, the researchers say their method can generate videos of an “arbitrary” length — i.e., with no limit. They says that future versions of the model “will be part of an ever-broad toolset for artists and non-artists alike, providing new and exciting ways to express creativity.” But also note that, “while the quality of the videos generated by Phenaki is not yet indistinguishable from real videos, getting to that bar for a specific set of samples is within the realm of possibility, even today. This can be particularly harmful if Phenaki is to be used to generate videos of someone without their consent and knowledge.”