6 min read

Genius Outperforms DeepSeek R1 in Code-Breaking Challenge, Mastermind

Steven Swanson : Feb 3, 2025 11:03:11 PM

Benchmarks

Genius Outperforms DeepSeek R1 in Code-Breaking Challenge, Mastermind

Genius Agent achieves 100% accuracy hundreds of times faster and cheaper than latest LLM

Highlights

Genius solved all 100 games averaging 3.1 seconds per game at a nominal cost making it 245x faster and 779x cheaper than DeepSeek R1
DeepSeek R1 solved just 45% of the games within 10 guesses averaging 5m 34s per game
R1 total compute time was 26 hours and total cost was $38.94 for an average cost per game of $0.39

Overview

DeepSeek, the Chinese startup behind the new Large Language Model, R1, made a global splash last week with claims that R1 outperforms leading LLMs with a fraction of the cost to train (though reports are emerging that those figures may be misleading or misinterpreted). With all the excitement, we decided to test Genius against R1 at the code-breaking game Mastermind, which is a good test for multi-step reasoning which is a key factor in developing reliable, explainable, and affordable software agents. Multi-step reasoning is the process of solving a problem that requires multiple calculations or logical steps to arrive at a conclusion. It involves breaking down complex problems into sequential stages, where each step builds upon the previous one to ensure a comprehensive and accurate solution. Each step in the decision-making process must be sound on its own yet also contribute to the overall conclusion.

We let R1 play 100 games of Mastermind with up to 10 guesses using a pool of 6 colors for the four-code secret. Whereas R1 is a pre-trained LLM, a Genius Agent uses a generative model (memory) on which it runs inference in real-time before each guess. The agent calculates the probability of each color in each position on all possible valid code combinations and, based on the feedback of white pegs (correct color, wrong position) and black pegs (correct color and position), selects the optimal action (i.e., code combo). Each guess reduces the valid set of codes for which probabilities are recalculated, and this process repeats until the secret code is guessed within 10 guesses.

In this sped up example gameplay, R1 takes over 22 minutes to solve the code.

Findings

Genius metrics are reused from previous runs:

Metric	Genius	R1
Success Rate	100%	45%
Avg Compute Time per Game	3.1 s	934 s
Avg Guesses per Game	5.6	8.5
Total Compute Time (100 Games)	5m 18s	26 hours
Avg Cost per Game	$0.0005 (est)	$0.39
Total Cost for 100 Games	$0.05 (est)	$38.94
Hardware Requirements	Mac M1 Pro	Fireworks.ai*

$ are USD, (s) are seconds

*Fireworks unknown however in order to run R1 locally recommended system requirements are 16 x A100 80GB NVidia GPUs each retailing for $17-20k USD so one can imagine something comparable.

Genius

Solved all 100 games averaging 3.1 seconds per game and at a nominal cost
245x faster and 779x cheaper than DeepSeek R1

DeepSeek R1

R1 succeeded guessing the secret code on 45% of the games within 10 guesses
Average number of guesses was 8.5
Total compute time took 26 hours with an average compute time per game of 934 seconds (15m 34s)
Total cost was $38.94 with average cost per game of $0.39

Setup

We ran the full version of R1 (671B parameters) on Fireworks.ai, a platform for building and deploying generative AI.

The prompt for our previous test of o1-preview was simple to come up with:

You are playing the mastermind board game. You are the code breaker. We are using the colors of R, B, G, Y, O, and P. Return your answer and explanation in json format. Your guess should be a 4 character sequence. Make your first guess.

Whereas the prompt for R1 took some iteration to create more explicit instructions. Without a very explicit prompt R1 would revert to its existing knowledge within the foundation model and ignore all game specific instructions. For instance, rather than choose colors, it would choose numbers.

Let's play the game mastermind. The code consists of 4 colors, and colors can be repeated in the code. We are using the colors R, G, B, Y, O, and P. You are trying to guess the secret code. You will get 10 attempts to guess the code. Your answer should be a 4 character sequence. Format your answer in json. Feedback will be given as 'Guess: {}, Feedback: {}B {}W'. Where B, black pegs indicate the number of correct colors in the correct positions, and W, white pegs represent the number of correct colors in the wrong positions.

Copy code

EXAMPLE JSON OUTPUT:

{

"guess": "RGBY"

}

Do not repeat a guess you have already made. Please make your first guess.

Without this specific instruction R1 would get stuck and make the same repeated guesses over and over again for most of the 10 guesses.

Each model also has a set of hyperparameters that can be dialed up or down to suit a given task. Temperature, for example, controls the randomness of an LLM’s output and affects the quality of output generated for tasks like summarization, translation, and text generation. For math-centric tests where you want low variability or repeatable and consistent results, you’d select a low temperature, whereas for creative writing, you want high variability (but also higher hallucinations).

Reasoning

In the following screen recording you can see R1 do some form of reflection, chain-of-thought reasoning or inference-time compute. OpenAI describes this as thinking before they answer and it can be thought of like an internal monologue.

R1 “reasoning”

What isn’t self-evident about these internal monologues is that each user query includes all previous queries and responses. For example after 5 guesses on mastermind the simple feedback query, “Guess: RGBY, Feedback: 0B 2W. Please make your next guess.” also sends back the entire internal monologue from the 4 previous responses that the LLM sent. These verbose reflections are a clever but expensive trick to compensate for not having a generative model (memory) to update in real time.

One study found that R1’s responses are on average 6,000 tokens long. Price per token varies on different models (R1 was $8/million) but they add up quickly which is highly inefficient from both a financial and environmental perspective as well as the amount of time that this form of reasoning takes.

Observations

R1’s first guess was always the same on all 100 games (RGBY) but its second guess varied. In what one could call an advanced variant of hallucination, R1 made the same guess 5 times in spite of being told the answer was incorrect. Here is the reasoning it provided:

I think the code is GYPB, so the next guess should be GYPB, but I already did that as guess 7. Wait, no, guess 7 was GYPB with feedback 3B, so the code must be GYPB. But in guess 8, GYBP had 2B, 1W, which doesn't fit. Maybe I made a mistake in the feedback. Alternatively, maybe the code is GYPP, but that would mean P is repeated. I'm confused, but I think the code is GYPB, so I should guess that again or try a different approach. Maybe I should try GYPO next to see if O is in the code.

Reliability

The differences in prompt phrasing and verbosity and hyperparameter tuning between the LLMs highlights the unique and opaque internal wiring of each model. Determining how much detail is required to aim LLMs at a problem requires trial and error to tailor prompts to each model’s idiosyncrasies. The level of requisite explicitness could potentially scale exponentially with complexity and the compounding internal monologue could prove impractical if not intractable.
While Mastermind is a multi-step reasoning challenge it’s not especially complex and the rules are fixed whereas surprises, edge cases and changing conditions are par for the course in the real-world. For agents to be reliable they must also be adaptable to unforeseen circumstances yet LLMs, constrained by their architecture, can only account for the data in their pre-training.

Explainability

The ability to read an LLMs internal monologue can provide clues to improving the prompt but it doesn’t translate into being able to update the LLM itself which is pre-trained and static. The neural networks that underpin LLMs are still a black-box so it’s virtually impossible to explain what makes one model better or worse at a task than another model. Similarly, decisions and recommendations are often fuzzy rather than absolute based on the available information so the inability to ascertain a model’s certainty or confidence of its answer limits how much we can trust the answer.

Conversely, Genius agents have a generative (world) model on which they continuously run inference and then update the model. This probabilistic model is represented as a human readable factor graph that describes the cause-effect relationships of the factors and variables that make up a complex dynamic system.

Mastermind Factor Graph

In this sampling of a few of the possible Mastermind combinations you can see how the likelihood of (or confidence around) a code being correct evolves with feedback (the secret code is GYPO).

2025-02-03 Mastermind Probabilities

Affordability

Reliability and explainability aside, LLMs still cost a fortune to train and operate. Profitability and the return on investment of AI is increasingly under scrutiny with Forbes reporting that 75% Of Businesses Aren’t Seeing ROI From AI Yet.

Final Thoughts

Success is always measured by the interplay of better, faster, and cheaper. R1 may or may not turn out to represent a more efficient way of training LLMs, and it, along with all other LLMs, may be useful for certain things like solving math problems and coding tasks. However, when reliability (in spite of limited data), explainability (intrinsic transparency), and/or affordability (financial and environmental) are serious constraints and reaction time and safety are mission critical, then LLMs alone may fall short. This is where the domain-specific models that Genius enables succeed.

LLMs are designed to be large stores of general knowledge and therefore tend to fail at specialized tasks (or require explicit instructions in order to be somewhat effective) whereas Genius agents are suited to learning specialized tasks. As LLMs and other Generative AI evolve and become more cost-efficient, we envision Genius agents playing the role of orchestrator between systems, leveraging tools in the areas where they excel.

Genius Outperforms OpenAI Model in Code-Breaking Challenge, Mastermind

Genius Outperforms OpenAI Model in Code-Breaking Challenge, Mastermind

Benchmarks Research

Whitepaper: Mastering Gameworld 10k in Minutes with the AXIOM ‘Digital Brain’

Whitepaper: Mastering Gameworld 10k in Minutes with the AXIOM ‘Digital Brain’

Active Inference Benchmarks Research

Mastering Atari Games with Natural Intelligence

Mastering Atari Games with Natural Intelligence

Active Inference Benchmarks Agents