When playing a video game, what motivates you to carry on?
This question is perhaps too broad to yield a single answer, but if you had to sum up why you accept that next quest, jump into a new level, or cave and play just one more turn, the simplest explanation might be “curiosity” — just to see what happens next. And as it turns out, curiosity is a very effective motivator when teaching AI to play video games, too.
In a game without rewards, teaching AI is difficult
Research published this week by artificial intelligence lab OpenAI explains how an AI agent with a sense of curiosity outperformed its predecessors playing the classic 1984 Atari game Montezuma’s Revenge. Becoming skilled at Montezuma’s Revenge is not a milestone equivalent to beating Go or Dota 2, but it’s still a notable advance. When the Google-owned DeepMind published its seminal 2015 paper explaining how it beat a number of Atari games using deep learning, Montezuma’s Revenge was the only game it scored 0 percent on.
The reason for the game’s difficulty is a mismatch between the way it plays and the way AI agents learn, which also reveals a blind spot in machine learning’s view of the world.
Usually, AI agents rely on a training method called reinforcement learning to master video games. In this paradigm, agents are dumped into virtual world, and rewarded for some outcomes (like increasing their score) and penalized for others (like losing a life). The agent starts playing the game random, but learns to improve its strategy through trial and error. Reinforcement learning is often thought of as a key method for building smarter robots.
The problem with Montezuma’s Revenge is that it doesn’t provide regular rewards for the AI agent. It’s a puzzle-platformer where players have to explore an underground pyramid, dodging traps and enemies while collecting keys that unlock doors and special items. If you were training an AI agent to beat the game, you could reward it for staying alive and collecting keys, but how do you teach it to save certain keys for certain items, and use those items to overcome traps and complete the level?
The answer: curiosity.
In OpenAI’s research, their agent was rewarded not just for leaping over pits of spikes, but for exploring new parts in the pyramid. This led to better-than-human performance, with the bot earning a mean score of 10,000 over nine runs (compared to an average human score of 4,000). In one run, it even completed the first of the game’s nine levels.
“There’s definitely still a lot of work to do,” OpenAI’s Harrison Edwards tells The Verge. “But what we have at the moment is a system that can explore lots of rooms, get lots of rewards, and occasionally get past the first level.” He adds that the game’s other levels are similar to the first, so playing through the whole thing “is just a matter of time.”
Beating the “Noisy TV problem”
OpenAI is far from the first lab to try this approach, and AI researchers have been leveraging the concept of “curiosity” as motivation for decades. They’ve also applied it to Montezuma’s Revenge before, though never so successfully without teaching AI to learn from human examples.
However, while the general theory here is well-established, building specific solutions is still challenging. For example, prediction-based curiosity is only useful when learning to play certain types of games. It works for titles like Mario, for example, where there are big levels to explore, full of never-before-seen bosses and enemies. But for simpler games like Pong, AI agents prefer to play long rallies rather than actually beat their opponents. (Perhaps because winning the game is more predictable than following path of the ball.)
AI can become addicted to random rewards, just like humans
Another issue is the “Noisy TV problem,” which is where AI agents that have been programmed to seek out new experiences get addicted to random patterns, such as a TV tuned to static noise. This is because these agents’ sense of what is “interesting” and “new” comes from their ability to predict the future. Before they take a certain action they predict what the game will look like afterwards. If they guess correctly, chances are they’ve seen this part of the game before. This mechanism is known as “prediction error.”
But because static noise is unpredictable, the result is that any AI agent confronted with such a TV (or a similarly unpredictable stimulus) becomes mesmerized. OpenAI compares the problem to human gamblers who are addicted to slot machines, unable to tear themselves away because they don’t know what’s going to happen next.
This new research from OpenAI sidesteps this issue by varying how the AI predicts the future. The exact methodology (named Random Network Distillation) is complex, but Edwards and his colleague Yuri Burda compare it to hiding a secret for the AI to find in every screen of the game. That secret is random and meaningless (something like, “what is the color in the top left of the screen?” suggests Edwards), but it motivates the agent to explore without leaving it vulnerable to the Noisy TV trap.
More importantly, this motivator doesn’t require a lot of calculation, which is incredibly important. These reinforcement learning methods rely on huge amounts of data to train AI agents (OpenAI’s bot, for example, had to play Montezuma’s Revenge for the real-time equivalent of three years) so every step of the journey needs to be as quick as possible.
“It is actually much simpler than other methods of exploration.”
Arthur Juliani, a software engineer at Unity and machine learning expert, says this is what makes OpenAI’s work impressive. “The method they use is really quite simple and therefore surprisingly effective,” Juliani tells The Verge. “It is actually much simpler than other methods of exploration which have been applied to the game in the past (and [which have] not led to nearly as impressive results).”
Juliani says that given the similarities between different levels in Montezuma’s Revenge, OpenAI’s work is “essentially equivalent” to solving the game, but he adds that “the fact that they aren’t able to consistently beat the first level means that there is still some of an open challenge left.” He also wonders whether their approach will work in 3D games, where visual features are more subtle and a first-person view occludes much of the world.
“In scenarios where exploration is required, but the differences between parts of the environment are more subtle, the method may not perform as well,” says Juliani.
The point of curiosity
But why do we need curious AI in the first place? What good does it do us, apart from providing humorous parallels to our human tendency to get ensnared by random patterns
The big reason is that curiosity helps computers learn on their own.
Most machine learning approaches deployed today can be split into two camps: in the first, machines learn by looking at piles of data, working out patterns they can apply to similar problems; and in the second, they’re dropped into an environment and rewarded for achieving certain outcomes using reinforcement learning.
Both of these approaches are effective at specific tasks, but they also require a lot of human labor, either labeling training data or designing reward functions for virtual environments. By giving AI systems an intrinsic incentive to explore for explorations’ sake, some of this work is eliminated and humans spend less time holding their AI agent’s hands. (Metaphorically speaking.)
OpenAI’s Edwards and Burda say that this sort of curiosity-driven learning system is much better for building computer programs that have to operate in the real world. After all, in reality, as in Montezuma’s Revenge, immediate rewards are often scarce, and we need to work, learn, and explore for long periods of time before we get anything in return. Curiosity helps us keep going, and maybe it can help computers, too.