For a less technical explanation with additional context please Mastering Atari Games with Natural Intelligence
It has been several years since deep reinforcement learning (DRL) algorithms first demonstrated superhuman performance in complex environments. One of the most seminal achievements was mastering Atari games using DRL architectures that learned directly from pixel inputs and reward signals (Mnih et al., 2015). However, these models required millions of data frames—equivalent to months of gameplay—and significant computational resources. This high sample and compute cost limited their applicability to real-world challenges and raised questions about their efficiency compared to human learning.
Humans typically become competent at new Atari games within minutes, leveraging prior knowledge about the world. In contrast, traditional DRL models often lack such priors, leading to inefficient learning processes. This blog post explores an alternative approach that equips models with core human-like priors about object dynamics and leverages Bayesian methods and active inference. Our aim is to achieve human-level performance after only 12 minutes of gameplay—a challenge we call the Atari 10k.
The original Atari challenge sought to create a single DRL architecture capable of learning across 57 diverse games (Mnih et al., 2015). While successful in achieving high performance, these models had several drawbacks. They required extensive training data, often consuming tens of millions of frames to reach optimal performance, making them sample inefficient. The large neural networks and prolonged training times demanded substantial computational resources, rendering them impractical for many applications.
To address sample inefficiency, the Atari 100k challenge was introduced, limiting training data to 100,000 frames—roughly two hours of gameplay (Kaiser et al., 2024). While this encouraged the development of more efficient algorithms, the resulting models remain computationally heavy and do not fully bridge the gap between AI architectures and the efficiency of human learning.
Unlike traditional DRL models, humans do not start from scratch. Humans bring a wealth of prior understanding to new tasks, allowing them to learn efficiently even from limited data. For example, innate aspects of human cognition include figure-ground segregation and object permanence—we perceive the world as composed of distinct objects that persist over time, enabling us to track and predict object behavior. Furthermore, the principle of continuity constitutes a prior belief that objects move along smooth, continuous trajectories unless acted upon by external forces. Additionally, causal reasoning allows us to understand cause-and-effect relationships, helping in predicting the consequences of actions. These core priors sit at a level of abstraction not specific to any single task, but are assumptions shared across a large majority of environments.
It is these core priors that enable humans to quickly understand and predict environmental dynamics, facilitating rapid learning in new environments that share these core priors.
Our hypothesis is that embedding similar core priors into AI models can enhance learning efficiency. Adopting a Bayesian framework allows us to formally incorporate prior knowledge with incoming data in an optimal way. Thus, to increase the efficiency of our Atari models, we propose to invest core priors in three ways:
1. Figure-Ground Segregation and Object Permanence: To emulate human perception, we segment Atari frames into a finite set of objects. We do this employing a Bayesian form of slot attention, see Singh and Buckley (2023) for the theoretical foundations, but use Coordinate Ascent Variational Inference (CAVI) to achieve computationally efficient inference and continual learning allowing our systems to optimally segment a stream pixels into objects.
2. The Principle of Continuity: The principle of continuity assumes objects move along smooth trajectories and that appearing and disappearing are special cases. To model this, we use Recurrent Linear Switching Dynamical Systems (RSLDS) (Linderman et al, 2016) where objects are assumed to move along linear paths but undergo discrete switches due to collisions or direction changes. This approach captures essential gameplay mechanics without modeling complex physics, aligning well with the piecewise linear mechanics that underlie the programming of Atai games. Furthermore, RSLDSs are composed of two models from the exponential family, a Gaussian and a categorical distribution, again allowing for fast inference with Coordinate Ascent Variational Inference (CAVI) algorithms.
3. Active Inference for Planning and Exploration: Active inference enables goal-directed behavior, as agents plan actions to minimize expected free energy, balancing exploration and exploitation. The model seeks out information that reduces uncertainty, leading to more efficient learning. By treating perception and action as two sides of the same inference process, active inference provides a unified framework for understanding agent behavior. In the context of Atari gameplay, we utilize the generative model to simulate future states and evaluate potential actions. Agents not only seek to maximize reward but also actively take actions that reduce uncertainty in the model’s parameters, leading to epistemic exploration.
Core priors allow us to model dynamics quickly across diverse games. Examples of our agents' object-centric “imagination” for Pong, Boxing and Freeway demonstrating object segmentation, imagining piecewise linear trajectories and planning. The fading dots show the agent’s belief about the objects and their future positions.
While we intend to apply our architecture to the 100K challenge to demonstrate a key feature of this natural approach to intelligence—its ability to learn gradually and achieve competence with very small amounts of data—we introduce the Atari 10k challenge: achieving or exceeding human-level performance after only 10,000 frames—approximately 12 minutes of gameplay. This benchmark aligns with the time it takes humans to become competent in new games. To make this a strong comparison, in this challenge we restrict our attention to games that humans can play well in 12 minutes. Specifically, we will avoid games with extremely sparse rewards, as games with frequent feedback facilitate faster learning. Furthermore, we exclude games with long narrative arcs, as they may not suit short training periods and are not in the spirit of our 10K challenge.
Early quantitative results on Pong. The IQM (inter-quartile mean, large symbols) of Human normalized score of our architecture versus competitor architecture IRIS. We can achieve almost perfect game play on Pong after only 10k interactions with much smaller parameter counts and orders of magnitude less compute time thanks to core priors. Note: for our results (green) and IRIS (red) we also present the performance of the random seeds (small dots) alongside the IQM. This is work in progress and the spread of our architecture shows a sensitivity to the random seed, something we hope to eliminate as we refine our active exploration strategies. However do note even our worst runs are competitive with IRIS in the low data regime.
Embracing human-inspired priors, Bayesian methods, and active inference in AI models offers a promising path toward more efficient learning. By modeling environments in terms of objects and their interactions, and utilizing probabilistic inference techniques, we can narrow the gap between human and AI learning speeds. The Atari 10k challenge serves as a step toward developing AI systems that learn as swiftly and effectively as humans, opening doors to applications in real-world scenarios such as robotics, autonomous driving, and other fields where data and compute resources are limited.
This is only a brief glimpse at the things we are doing in the VERSES research and our future progress toward the Atari 10K challenge will be reported in future blogs.. As we continue to refine these models and address the associated challenges, we move closer to AI systems that not only match human performance in specific tasks but also emulate the underlying processes that make human learning so efficient and adaptable.