In a head to head test of reasoning abilities using the code breaking game, Mastermind, against OpenAI’s o1-preview, an agent powered by Genius solved 100% of the games, 140 times faster and 5260 times cheaper.
The term reasoning is being used by many AI companies as a catch-all way to reference the processing under the hood. OpenAI promotes its new o1 model as having advanced reasoning, where it (and other LLM-based models) use a new kind of “inference-time-compute” whose improved results over GPT models alone are the basis for an argument that these reasoning AI models have a form of language-based reasoning. However, research from Apple and others suggest that this form of reasoning may struggle to reliably model cause and effect relationships thus leading to issues with accuracy. Regardless, there are many problem spaces that require a class of reasoning where language based reasoning is not enough and where reliability, accuracy and auditability are critical.
...current LLMs are not capable of genuine logical reasoning; instead, they attempt to replicate the reasoning steps observed in their training data.
- Apple Researchers
Speaking at the 2024 K-Science and Technology Global Forum in Seoul, Meta Chief Scientist, Yann LeCun said, “LLMs can deal with language because it is simple and discrete, but it cannot deal with the complexity of the real world,” he explained. “These systems lack the ability to reason, plan, and understand the physical world in the same way humans do.”
There are in fact 7 types of reasoning but in the context of agentic software we define reasoning as:
The ability for an agent to run inference on a generative model in order to deduce the likelihood of a cause or effect based on past data.
Cybersecurity, fraud detection, stock markets, and weather forecasting among others are complex dynamic systems with complicated cause-effect relationships, surprises, and hidden factors and variables where the “right” answer must be probabilistic in nature rather than absolute because the available information is incomplete, unclear, unprecedented, or otherwise fuzzy. Predicting future effects within these systems requires learning causality and quantifying uncertainty as opposed to reconstructing predictions based on correlations in past data. As the saying goes, correlation does not imply causation.
With that in mind, we decided to compare how o1-preview* and Genius fare when reasoning a fuzzy problem. Mastermind is a code breaking game where one player creates a secret code using colored pegs, and the other player must guess the code by making logical deductions based on feedback about the color and position of their guesses. Learn more about the game or play it at mastermindgame.org.
We let o1-preview and Genius each play 100 games with up to 10 guesses using a pool of 6 colors for the four-code secret.
Our findings are as follows:
Metric | Genius | o1-preview |
Success Rate | 100% | 71% |
Avg Compute Time per Game | 3.1 seconds | 345 seconds |
Avg Guesses per Game | 5.6 | 6.1 |
Total Compute Time (100 Games) | 5m 18s | 12.5 hours |
Avg Cost per Game | $0.0005 (estimate) | $1.96 |
Total Cost for 100 Games | $0.05 (estimate) | $263 USD** |
Hardware Requirements | Mac M1 Pro laptop | OpenAI Servers |
One observation is that Genius always started with the exact same guess on all 100 games (RRBB) and there is a consistent pattern in subsequent guesses when guessing the same secret code multiple times which demonstrates a principled and systematic approach to reasoning.
In contrast, in 25 of the games using the same secret code (GYPO), o1-preview started with a "one of each color" strategy (RBGY) 19 times and the "two of each color" strategy (RRGG) 6 times. However, in subsequent guesses over the 25 games, there is no apparent pattern and instead the guesses appear random and unpredictable. o1-preview did not solve 7 of the 25 games within 10 guesses and in the 18 games where it did solve the code, the number of guesses varied from 3 to 10 and solve times ranged from 50 sec to 660 sec.
Following is a side by side time-lapse comparison between the two methods.
For the exercise Genius ran on a commodity 2021 Apple Mac with an M1 Pro with 64 gb RAM from 2021 whereas the hardware that OpenAI runs o1-preview on is unknown. Research and OpenAI’s pricing suggest it’s likely to be running on powerful A100 or H100 GPUs, possibly even a cluster of them. Given the orders of magnitude difference in compute and power between an M1 and an A100, an apples to apples comparison could significantly widen the gap in effectiveness and efficiency with which Genius outperformed o1-preview.
The advanced reasoning required to crack the code in Mastermind is representative of the capabilities available in Genius today and that underpin the Atari 10k challenge that we are eager to share. Outperforming o1-preview by such a wide margin exposes inherent limitations with the correlational language-based reasoning of LLMs when tackling tasks that require logical cause-effect reasoning and planning.