Show a human any photograph and they’ll able to predict what happens next with pretty decent accuracy. The woman riding her bike will keep on moving. The dog will catch the frisbee. The man is going to have a pratfall. And so on. It’s such a basic skill that we don’t consider the vast amount of information that is used to make these predictions — concerning gravity, inertia, the nature of pratfalls, etc. — and teaching computers to do the same is proving to be a key challenge in machine vision.
The videos are short, small, and often nightmarish
Researchers from MIT attempting to solve this problem have come up with some very impressive results, using specially trained neural networks to turn images into videos and getting the computer to essentially predict what happens next. Their model has plenty of limitations — the videos are seconds long, tiny, and often nightmarish — but it’s still an impressive feat of machine imagination, and another step toward computers that understand the world a little more like humans.
The neural net was trained using more than 2 million videos downloaded from Flickr. These were sorted into four types of scenes; golf courses, beaches, train stations, and hospitals (this latter category made up of images of babies), and the footage was stabilized to remove camera shake. Using this data, the team’s neural nets were able to not only generate short videos that resembled these scenes (that's the GIF at the top of the page), but to also look at a still and create footage that might follow (that's the GIF below). This essentially predicting what will happen next, albeit in a limited manner that is only guessing how pixels might change, rather than understanding the scene.
Here’s how that looks:
It’s quite easy to see what’s being achieved here and where it falls short. In the beach videos for example, you can see the waves crashing, and at the train station, the model knows that the train is likely to keep moving past the camera. However, when asked to predict how a human will walk across a golf course, the end results don't actually look anything like a human. They're blurred, smeared, and unrealistic. The researchers themselves note that the computer’s predictions don’t usually follow "the correct video," but that at least "the motions are plausible."
Getting beyond these plausible-but-obviously-fake videos is going to be tough, but other machine learning systems have made progress in related areas, predicting actions like handshakes and hugs, and even generating sounds that match videos. Facebook’s head of AI Yann LeCun addressed this topic in an interview last year, saying that being able to generate future movement like the research above is a "piece of the puzzle" in creating predictive computers, but that true understanding of a video or image and its possible futures will take much more work.
"If you’re watching a Hitchcock movie and I ask, ‘15 minutes from now, what is it going to look like in the movie?’ You have to figure out who the murderer is," said LeCun. "Solving this problem completely will require knowing everything about the world and human nature. That’s what’s interesting about it."