Blog | VERSES

Genius Outperforms OpenAI Model in Code-Breaking Challenge,”Mastermind”

Written by Steven Swanson | Dec 17, 2024 5:57:54 PM

In a head to head test of reasoning abilities using the code breaking game, Mastermind, against OpenAI’s o1-preview, an agent powered by Genius solved 100% of the games, 140 times faster and 5260 times cheaper.

The term reasoning is being used by many AI companies as a catch-all way to reference the processing under the hood.  OpenAI promotes its new o1 model as having advanced reasoning, where it (and other LLM-based models) use a new kind of “inference-time-compute” whose improved results over GPT models alone are the basis for an argument that these reasoning AI models have a form of language-based reasoning.  However, research from Apple and others suggest that this form of reasoning may struggle to reliably model cause and effect relationships thus leading to issues with accuracy.  Regardless, there are many problem spaces that require a class of reasoning where language based reasoning is not enough and where reliability, accuracy and auditability are critical.

...current LLMs are not capable of genuine logical reasoning; instead, they attempt to replicate the reasoning steps observed in their training data.

- Apple Researchers

Speaking at the 2024 K-Science and Technology Global Forum in Seoul, Meta Chief Scientist, Yann LeCun said, “LLMs can deal with language because it is simple and discrete, but it cannot deal with the complexity of the real world,” he explained.  “These systems lack the ability to reason, plan, and understand the physical world in the same way humans do.”

There are in fact 7 types of reasoning but in the context of agentic software we define reasoning as:

The ability for an agent to run inference on a generative model in order to deduce the likelihood of a cause or effect based on past data.

Cybersecurity, fraud detection, stock markets, and weather forecasting among others are complex dynamic systems with complicated cause-effect relationships, surprises, and hidden factors and variables where the “right” answer must be probabilistic in nature rather than absolute because the available information is incomplete, unclear, unprecedented, or otherwise fuzzy.  Predicting future effects within these systems requires learning causality and quantifying uncertainty as opposed to reconstructing predictions based on correlations in past data.  As the saying goes, correlation does not imply causation.

With that in mind, we decided to compare how o1-preview* and Genius fare when reasoning a fuzzy problem.  Mastermind is a code breaking game where one player creates a secret code using colored pegs, and the other player must guess the code by making logical deductions based on feedback about the color and position of their guesses.  Learn more about the game or play it at mastermindgame.org.

We let o1-preview and Genius each play 100 games with up to 10 guesses using a pool of 6 colors for the four-code secret.

Our findings are as follows:

Metric Genius o1-preview
Success Rate 100% 71%
Avg Compute Time per Game 3.1 seconds 345 seconds
Avg Guesses per Game 5.6 6.1
Total Compute Time (100 Games) 5m 18s 12.5 hours
Avg Cost per Game $0.0005 (estimate) $1.96
Total Cost for 100 Games $0.05 (estimate) $263 USD**
Hardware Requirements Mac M1 Pro laptop OpenAI Servers
  • o1-preview failed to guess the secret code on 29% of the games within 10 guesses and of the 71 that it succeeded on, the average compute time was 345 seconds and the average number of guesses was 6.1
  • Genius solved all 100 games with an average of 3.1 seconds and with 5.6 guesses on average.
  • o1-preview’s solve time on successful games varied from 7.9 to 889 seconds while Genius solve time ranged from 1.1 to 4.5 seconds
  • The combined total time that o1-preview spent solving all 100 games was more than 12.5 hours while Genius spent 5m 18s
  • Based on the number of tokens o1-preview sent back and forth over the API for each guess and the cost per token, we calculated the average cost of each successful game at $1.96**  Failed games cost more on average because they used up all 10 guesses so we excluded them, however, it should be noted that the 29 failures incurred 47% of the total spend.
  • We estimate the total cost for Genius to run all 100 games was nominal, perhaps $0.05, given the hardware (see further down) and minimal energy and compute time compared to a total of $263 USD for o1-preview.

One observation is that Genius always started with the exact same guess on all 100 games (RRBB) and there is a consistent pattern in subsequent guesses when guessing the same secret code multiple times which demonstrates a principled and systematic approach to reasoning.

In contrast, in 25 of the games using the same secret code (GYPO), o1-preview started with a "one of each color" strategy (RBGY) 19 times and the "two of each color" strategy (RRGG) 6 times.  However, in subsequent guesses over the 25 games, there is no apparent pattern and instead the guesses appear random and unpredictable.  o1-preview did not solve 7 of the 25 games within 10 guesses and in the 18 games where it did solve the code, the number of guesses varied from 3 to 10 and solve times ranged from 50 sec to 660 sec.

Following is a side by side time-lapse comparison between the two methods.

 

For the exercise Genius ran on a commodity 2021 Apple Mac with an M1 Pro with 64 gb RAM from 2021 whereas the hardware that OpenAI runs o1-preview on is unknown.  Research and OpenAI’s pricing suggest it’s likely to be running on powerful A100 or H100 GPUs, possibly even a cluster of them.  Given the orders of magnitude difference in compute and power between an M1 and an A100, an apples to apples comparison could significantly widen the gap in effectiveness and efficiency with which Genius outperformed o1-preview.

The advanced reasoning required to crack the code in Mastermind is representative of the capabilities available in Genius today and that underpin the Atari 10k challenge that we are eager to share.  Outperforming o1-preview by such a wide margin exposes inherent limitations with the correlational language-based reasoning of LLMs when tackling tasks that require logical cause-effect reasoning and planning.

Notes

  • *At the time of the test and this writing, even though the full version of o1 was released for chatGPT Pro subscriber’s interface ($200/month) only o1-preview was available through OpenAI’s API/SDK.
  • **o1-preview cost calculations were arrived at by multiplying the number of tokens by the cost per token.
  • We removed o1-preview’s failed games from our calculations of time, cost, and number of guesses to avoid skewing the numbers.
  • If o1-preview were allowed to exceed the 10 guess limit it likely would have solved all 100 games but the time and cost associated would have been higher.
  • Detailed results are available here.