11 min read
Mastering Atari Games with Natural Intelligence
Steven Swanson : Jan 20, 2025 1:07:25 AM
Genius™-Powered Agents Outperform World’s Leading AI Algorithms with New Approach in Industry Benchmark
Highlights
- Atari 100k benchmark tests the ability of software agents to efficiently learn to meet or exceed human performance on a variety of classic video games
- VERSES’ agent, powered by Genius, trained for 2 hours on 90% less data than Atari 100k (thus Atari 10k) on several games
- Genius Agent matched or exceeded performance of top ranking models that were trained on 10x the data and many times more compute
- Genius Agent exceeded human-level performance and achieved perfect play (scoring all 20 points) at Pong multiple times
- Genius Agent outperformed a leading model, IRIS with 96% smaller model size
- The same Genius Agent framework trained on 2 other Atari Games exhibited superior competency
- Our results are preliminary and unoptimized but mark a milestone demonstration of the first hyper-efficient Bayesian-based agent successfully solving high-dimensional games in a generalized way
Introduction
The quest to create systems that exceed human-level intelligence dates back to ancient-mythology and in the modern era games have become a useful benchmark for machine intelligence. In 1996 IBM’s Deep Blue defeated Garry Kasparov at chess, a game with well-defined rules and discrete state space. Systems like Deep Blue excel at brute-force computation, evaluating millions of possible moves to find the best one. In 2016 Google’s AlphaGo defeated Lee Sedol at the game of Go, a game with more possible board configurations than there are atoms in the universe. AlphaGo demonstrated the power of deep reinforcement learning and Monte Carlo tree search and marked a leap from brute-force computation to AI capable of pattern recognition and strategic planning.
Conquering Chess and Go are milestone achievements, however, they don’t emulate the complex dynamics of the real-world where change is constant and one must adapt to changing conditions and even changing rules. Video games have emerged as a new test for intelligence as they are controlled environments with rules that must be learned through interaction and successful gameplay requires planning. Atari games are the gold standard for assessing an agent’s ability to model and navigate complex dynamic systems.
In 2013, Deepmind released a paper outlining how their reinforcement learning based model, DQN, could play Atari games at human or super human levels, however it required hundreds of millions of environment steps to train. In 2020, Deepmind published a paper on how its Agent57, based on deep reinforcement learning, surpassed the human baseline gameplay on 57 Atari games, however it required training on nearly 80 billion frames. The Atari 100k challenge was created as a more sample efficient means of attaining similar gameplay competency using just 100,000 steps, equivalent to 2 hours of gameplay training.
In early 2024, we stated our goal of demonstrating how agents, powered by Genius, can match or exceed performance of state-of-the-art methods on the Atari 100k Challenge using 1/10th of the amount of training data (thus Atari 10k) and dramatically less compute while generating a model a fraction of the size of top ranked models powered by deep reinforcement learning (DRL) or transformers.
If, metaphorically, the methods used by DQN and Agent57 are gas-guzzling Hummers and those used to tackle the Atari 100k challenge are like a fuel-efficient Prius, then our approach used on Atari 10k is like a Tesla, a hyper-efficient alternative architecture. Before diving into the details, let’s take a moment to understand why Atari 10k is such a big deal.
Why Atari 10k Matter
The original Atari 100k benchmark was established to test an agent’s ability to excel, with limited training data, in three critical areas: interactivity, generalization, and efficiency.
- Interactivity measures how well an agent learns and adapts in dynamic environments where its actions directly influence outcomes. In Atari games, agents must act in real time, respond to feedback, and adjust their behavior to succeed. This mirrors real-world scenarios, where adaptability is essential.
- Generalization evaluates an agent’s ability to apply learned strategies across diverse games with different rules and challenges. It ensures the agent isn’t overfitting to a single task and can perform well across various domains, demonstrating true adaptability.
- Efficiency focuses on how quickly an agent can learn effective strategies with limited data and compute. The 100k step constraint highlights the importance of learning efficiently—critical for real-world applications where data is often scarce.
Any competent and experienced developer can write a custom program to solve games and logic puzzles. As evidenced by Deepmind’s original DQN approach and even the more efficient Atari 100k, with enough human intervention and brute force data and compute, conventional machine learning can be tuned and fitted to master games like Atari’s. In contrast, Genius Agents demonstrate the same capabilities but, crucially, by figuring out how to play games on their own, and with 90% less data than Atari 100k, which is more directly relevant to real-world problems where data can be sparse, incomplete, noisy, and change in real time.
What Were the Results?
In order to provide an apples-to-apples comparison against state-of-the-art (SOTA) machine learning, for these initial tests, we selected model-based IRIS, which is based on the 2017 breakthrough transformer architecture which in turn is the foundation of Generative AI and LLMs like GPT, Claude, Gemini, Llama, Grok and others. It was the fastest for us to deploy so that we can focus energy on advancing our own research rather than reproducing others. The top 2 performers on Atari 100K, EfficientZero and BBF, are based on Deep Reinforcement Learning, the same approach that underpins work such as Deepmind’s AlphaZero, AlphaGo and AlphaFold.
We trained both our Genius Agent and IRIS on 10,000 steps of gameplay for 2 hours (10k/2h). We benchmarked the performance of Genius Agent 10k/2h against IRIS, which trained on the same 10k steps – but over 2 days (10k/2d). We compared Genius Agent 10k/2h performance steps as measured by Human Normalized Score (HNS) against the publicly available results for BBF and EfficientZero that were trained on the full 100k. HNS 1.0 is considered human-level performance and equivalent to a human player scoring, for example, an average of 14.6 points in Pong against the computer after 2 hours of practice time (roughly the equivalent of 100k samples).
Unlike transformers and DRL, Genius does not require a powerful GPU for training, however, in order to general an equal comparison, all training was performed on a single NVIDIA A100 GPU running in AWS Cloud. The following are the resulting training times and model sizes. Note that regardless of training time IRIS’ post-trained model size was 8,000,000 parameters while our Genius Agent’s model size is 350,000, equivalent to 96% smaller.
Genius 2h | IRIS 2h | IRIS 2d | IRIS 100k | |
Training Steps | 10k | 10k | 10k | 100k |
Training Time | 2 Hours | 2 Hours | 2 Days | 6 Days |
Model Paras | 350,000 | 8,000,000 | 8,000,000 | 8,000,000 |
Qualitative Pong gameplay
IRIS 10k/2h |
IRIS 10k/2d |
Genius 10k/2h |
IRIS 10k/2h appears to twitch in a corner while IRIS 10k/2d showed some competency ranging between HNS 0.0 and 0.3. Genius 10k/2h was able to achieve HNS 1.0+ and perfect play (scoring all 20 points) multiple times.
Genius Agent Perfect Play
The chart below shows highest attainable performance as measured by HNS for IRIS and Genius given respective sample data and training time. Genius scores are preliminary and unoptimized.
Since IRIS 10k/2h shows no gameplay competency, here we show qualitative sample games of Pong for IRIS 10k/2d and Genius Agent, each playing against the computer. In this particular instance of gameplay, IRIS scored 6 points to the computer’s 20, while Genius scored 20 points to the computer's 6, winning the game.
See Genius learn Pong in this timelapse where, over the course of 10,000 steps, it scored 20-0, 20-0, 20-1, 20-10, 14-15. In the 5th game, just before 9,000 steps, the computer has scored 14 points vs 3 for Genius whereupon Genius begins scoring all the points before running out of training steps. This demonstrates the kind of progressive online learning that Genius excels at.
We applied the same general agent architecture to several other games, each with very different objects and dynamics. Here, we show Boxing and Freeway. In Boxing, the player (white) scores points by punching the opponent (black).
Qualitative Boxing gameplay:
IRIS 10k/2h |
IRIS 10k/2d |
Genius 10k/2h |
In Freeway, the player tries to move the chicken across the road while avoiding cars coming from different directions at different speeds. Both IRIS 10k/2h and IRIS 10k/2d appear exhibit random behavior, never successfully crossing the road. In contrast, Genius appears to exhibit an understanding of the game’s objects and dynamics, consistently and successfully navigating traffic.
Qualitative Freeway gameplay:
IRIS 10k/2h |
IRIS 10k/2d |
Genius 10k/2h |
Qualitative results from a sampling of game runs
Training Time | |||
Genius 2h | IRIS 2h | IRIS 2d | |
Pong | 6 vs 20 | 20 vs 0 | 20 vs 6 |
Boxing | 63 vs 83 | 45 vs 46 | 69 vs 37* |
Freeway** | 17 | 0 | 0 |
For consistency all scores are written as Computer vs Player *We suspect IRIS 2d overfit to the training data resulting in worse than random performance despite 20x more compute **Road Crossings |
Benchmarks and challenges like Atari 100k / 10k, ARC-AGI and others can generate interesting signals about a model, algorithm or framework, but no single test yet exists to measure general intelligence in its various dimensions: Cognitive, Physical, Social, and Emotional Intelligences. There is also a risk of overfitting (overoptimizing) a model to beat a specific benchmark which may look impressive at face value but says nothing about the model’s generalizability, efficiency or applicability to real-world use cases. Instead, a variety of tests are necessary to measure the applicability, reliability, adaptability, sustainability, explainability, and scalability and other capabilities of a given architecture. Because there is no single test, we will continue running tests from time to time on a mix of other games and challenges that showcase generalizability which is essential for tackling real-world problems that are hard, fuzzy, and pervasive.
How It Works
The SOTA ML architectures like transformers, neural nets, deep learning, and reinforcement learning that have dominated the field for well over a decade are all derived from a data-centric, power-hungry approach. They are optimized for Von Neumann architectures that act as giant input-output data processors, resulting in compute and energy inefficiencies as the data grows. They underpin LLMs, generative AI, autonomous vehicles, humanoid robots, protein folding and more, and every contender on the Atari 100k Challenge leaderboard.
VERSES research is not simply an incremental variation of state-of-the-art machine learning. We apply Professor Karl Friston’s Free Energy Principle, the Active Inference framework, and Bayesian Inference architectures. These are grounded in the neuroscience behind biological intelligence, which views intelligent systems not as mere input/output machines that passively process data but as prediction engines that efficiently learn by measuring the difference between their expectation and their sensory data. Their core objective is to continuously reduce uncertainty about their environment by learning to understand the hidden cause-effect dynamics underlying the observations they perceive in order to better predict outcomes and select optimal actions.
Active Inference
This alternative path to applying neuroscience-based methodologies and biologically plausible techniques to tackle Atari began in 2022, when Professor Friston and our colleagues from Cortical Labs demonstrated how lab-grown brain cells, called “Dishbrain”, learned to play Pong demonstrating that neurons apply the Free Energy Principle and operate using Active Inference. A 2023 paper published in Nature confirmed the quantitative predictions of the Free Energy Principle using in vitro networks of rat cortical neurons that perform causal inference. In early 2024 we applied these same underlying Active Inference mechanics demonstrated with Dishbrain to playing Pong purely in software.
Active Inference and its application of Bayesian models and algorithms represent a fundamentally different architecture for AI that is, by design, radically more effective and efficient than SOTA ML. SOTA methods require oceans of data and compute in order to pre-train models that are opaque and can sometimes confidently generate incorrect answers [1, 2, 3]. In contrast, Genius implements Bayesian Inference architectures that we have pioneered [see CAVI-CMN, VBGS] that facilitate sample efficiency and continual learning and gracefully incorporate prior knowledge with new data. The result is a new approach to advanced machine intelligence that is inherently reliable, explainable, sustainable, flexible and scalable.
In the world of statistics, machine learning, and artificial intelligence, Bayesian inference is viewed as a powerful and elegant framework due to its principled, probabilistic methods for reasoning under uncertainty, but the computational demands have made it challenging to implement beyond toy problems until now. The novel algorithms that underpin the Atari 10k results represent a transcendence of this barrier and offer a universal architecture on which to build many efficient general intelligence agents that learn to develop expertise. We believe this marks the beginning of a Bayesian revolution and a natural approach to machine intelligence.
From Research to Product
Great strides have been made in the last several years in AI – computer vision, motor control models, LLMs, and natural language comprehension – but for autonomous agents to reliably and safely operate in the real-world and harness the benefits of various AI systems there is something missing. We believe that Genius is like “Brainpower for Agents” because it provides the executive function – the cognition, reasoning, planning, learning, and decision making – that gives agents agency, curiosity and choice. Agency is the missing link in what many are calling agentic architectures, which are really just powered by LLMs. A recent study by Apple researchers concluded that “current LLMs are not capable of genuine logical reasoning; instead, they attempt to replicate the reasoning steps observed in their training data.”
OpenAI’s latest “reasoning” model o1 is certainly an impressive optimization in a long line of optimizations that began last century but still relies on a neural net architecture and appears to combine the two approaches that BBF/EfficientZero (reinforcement learning) and IRIS (transformers) implement into chain-of-thought inference-time computation. Our recent Mastermind benchmark against o1-preview showed superior performance. We believe both Mastermind and the Atari 10k benchmark results indicate that we now have a new biomimetic approach to general machine intelligence that is markedly better, faster, and cheaper than both of those approaches, even when combined.
Uncertainty and Explainability
As Genius incorporates the research described here and matures as a toolkit for modeling complex dynamic systems and developing autonomous agents it will be ideally suited for addressing problem spaces where change is constant, data and resources are limited and where reliability, explainability and sustainability are paramount.
In these visualizations, the dotted lines and gradient trails represent the agent’s probabilistic beliefs about the trajectories of identified objects – be it a ball, a speeding car, or a boxer’s jab. These quantifications of uncertainty, combined with degrees of confidence, illustrate how an agent’s predictions and decisions are (for the first time) explainable. This level of transparency and auditability stands in stark contrast to the black-box, unexplainable and unquantifiable internal processes of SOTA ML.
Applications and Implications
The reasoning, planning, and real time learning capabilities that allow a Genius Agent to master gameplay are directly relevant to real-world applications such as classification, recommendations, prediction and decision-making across a variety of domains including financial, medical, risk analysis, autonomous driving, robotics and more. The applications for highly efficient generalized machine intelligence are as vast as the usefulness of human intelligence but the implications are even more provocative.
Microsoft announced plans to revive the Three Mile Island nuclear power plant to fuel its AI data center ambitions. Meta stated its intentions of scaling its portfolio of compute power to the equivalent of nearly 600,000 H100s (each GPU retails for $30,000) by the end of 2024. Techcrunch estimates that OpenAI’s latest model, o3, may cost more than $1,000 per task. Unreliability and unexplainability aside, the financial costs, energy consumption and carbon emissions of training and running these mega-sized overparameterized models are not just economically and environmentally unsustainable, they fly in the face of how biological intelligence works. Case in point, the human brain operates on just 20 watts, enough to power a single light bulb.
If that weren’t enough, experts are warning that available high quality data is running out. Elon Musk recently suggested real-world data for AI training is ‘exhausted’ and some caution that his solution – training on synthetically generated data – could lead to ‘model collapse’ where models gradually deteriorate.
M. Boháček & H. Farid / arXiv |
The implications of effective (reliable), trustworthy (explainable) and efficient (sustainable) intelligent agents created from a single generalizable (flexible) architecture are, potentially, civilizational in scale. Rather than a few incomprehensibly costly yet unreliable models controlled by a handful of corporations, imagine trillions of inexpensive, hyper-efficient, specialized autonomous self-organizing agents operating at the edge and in the cloud, coordinating and collaborating with a common intrinsic – and deceptively simple goal at all levels from individual to the collective, of seeking to understand, i.e. to reduce uncertainty. In the age-old quest to create thinking machines, the complex dynamic systems of games as a proxy for the real-world may help show the way but we believe that smarter, safer and more sustainable machine intelligence is needed to take civilization to the next level.
Game on.
Appendix
Benchmarks and challenges like Atari 100k / 10k, ARC-AGI and others can generate interesting signals about a model, algorithm or framework, but no single test yet exists to measure general intelligence in its various dimensions: Cognitive, Physical, Social, and Emotional Intelligences. There is also a risk of overfitting (overoptimizing) a model to beat a specific benchmark which may look impressive at face value but says nothing about the model’s generalizability, efficiency or applicability to real-world use cases. Instead, a variety of tests are necessary to measure the applicability, reliability, adaptability, sustainability, explainability, and scalability and other capabilities of a given architecture. Because there is no single test, we will continue running tests from time to time on a mix of other games and challenges that showcase generalizability which is essential for tackling real-world problems that are hard, fuzzy, and pervasive.