In the latest example of deepfake technology, researchers have shown off new software that uses machine learning to let users edit the text transcript of a video to add, delete, or change the words coming right out of somebody’s mouth.
The work was done by scientists from Stanford University, the Max Planck Institute for Informatics, Princeton University, and Adobe Research, and shows that our ability to edit what people say in videos and create realistic fakes is becoming easier every day.
You can see a number of examples of the system’s output below, including an edited version of a famous quotation from Apocalypse Now, with the line “I love the smell of napalm in the morning” changed to “I love the smell of french toast in the morning.”
This work is just at the research stage right now and isn’t available as consumer software, but it probably won’t be long until similar services go public. Adobe, for example, has already shared details on prototype software named VoCo, which lets users edit recordings of speech as easily as a picture, and which was used in this research.
To create the video fakes, the scientists combine a number of techniques. First, they scan the target video to isolate phonemes spoken by the subject. (These are the constituent sounds that make up words, like “oo” and “fuh.”) They then match these phonemes with corresponding visemes, which are the facial expressions that accompany each sound. Finally, they create a 3D model of the lower half of the subject’s face using the target video.
When someone edits a text transcript of the video, the software combines all this collected data — the phonemes, visemes, and 3D face model — to construct new footage that matches the text input. This is then pasted onto the source video to create the final result.
In tests in which the fake videos were shown to a group of 138 volunteers, some 60 percent of participants though the edits were real. That may sound quite low, but only 80 percent of that same group thought the original, unedited footage was also legitimate. (The researchers note that this might be because the individuals were told their answers were being used for a study on video editing, meaning they’d been primed to look for fakes.)
As ever, though, it’s important to remember there are limitations to what this tech can do.
The algorithms here only work on talking head style videos, for example, and require 40 minutes of input data. The edited speech also doesn’t seem like it can differ too much from the source material, and in their best quality fakes, the researchers asked the subjects to record new audio to match the changes, using AI to generate the video. (This is because audio fakes are sometimes subpar, though the quality is certainly getting much better.)
The researchers also note that they can’t yet change the mood or tone of the speaker’s voice as doing so would lead to “uncanny results.” And that any occlusions of the face — e.g. if someone waves their hands while speaking — throw off the algorithm completely.
So, the technology is not perfect, but these sorts of limitations always feature in early-stage research and it’s almost guaranteed they’ll be overcome in time. That means that society at large will soon have to grapple with the underlying concept this research demonstrates: the arrival of software that lets anyone edit what people say in videos with no technical training.
The potential harms of this technology are hugely worried, and researchers in this field are often criticized for failing to consider the potential misuse of their work. The scientists involved in this particular project say they’ve considered these problems.
In a blog post accompanying the paper, they write: “Although methods for image and video manipulation are as old as the media themselves, the risks of abuse are heightened when applied to a mode of communication that is sometimes considered to be authoritative evidence of thoughts and intents. We acknowledge that bad actors might use such technologies to falsify personal statements and slander prominent individuals.”
But the remedy they suggest is hardly comforting. They say to prevent confusion, AI-edited video should be clearly presented as such, either through the use of watermarking or through context (e.g. an audience understanding that they’re watching a fictional film).
But watermarks are easily removed and a loss of context is one of the hallmarks of online media. Fakes don’t need to be flawless to have an impact either. Plenty of fake news articles can be easily debunked with a few minutes’ research, but that doesn’t stop their spread, especially in communities who want to believe such lies that fit their preconceptions.
The researchers note that technology like this has many beneficial uses, too. It would be of great help to the film and TV industries, allowing them to fix misspoken lines without rerecording footage, and create seamless dubs of actors speaking different languages.
But these benefits seem underwhelming compared to the potential damage. Although there’s a good argument to be made that deepfake propaganda isn’t as much of a threat as many believe, the progress made in research like this is still deeply troubling.