MARS Benchmarks¶

Performance results on standard mathematical reasoning benchmarks.

Benchmark Results¶

AIME 2025¶

Configuration	Score	Improvement
Baseline (single model)	43.3%	-
MARS (3 agents)	73.3%	+69%
MARS (5 agents)	75.8%	+75%

IMO 2025¶

Configuration	Score	Improvement
Baseline	16.7%	-
MARS (3 agents)	33.3%	+100%
MARS (5 agents)	36.7%	+120%

LiveCodeBench¶

Configuration	Score	Improvement
Baseline	39.05%	-
MARS (3 agents)	50.48%	+29%
MARS (5 agents)	52.13%	+34%

Configuration Impact¶

Number of Agents¶

Text Only

Agents | Quality | Speed | Tokens
-----  | ------- | ----- | ------
1      | Baseline| Fast  | Low
3      | +69%    | Normal| 3x
5      | +75%    | Slow  | 5x
7      | +78%    | Slower| 7x

Temperature Settings¶

Optimal temperatures vary by problem type:

Math Problems: [0.2, 0.5, 0.8]
Creative Tasks: [0.5, 0.8, 1.2]
Coding: [0.1, 0.4, 0.7]

Cost-Benefit Analysis¶

Quality vs. Cost¶

Text Only

Cost (tokens)
      |    Steep gain
      |   /
      |  /
      | /_____ Diminishing returns
      |________
             Time

Recommendations: - 3 agents: Good quality/cost ratio - 5 agents: High quality, higher cost - >7 agents: Diminishing returns

Model-Specific Results¶

Results vary by underlying model:

Model	AIME (3 agents)	IMO (3 agents)
Claude 3.5 Sonnet	73.3%	33.3%
GPT-4o	71.5%	32.1%
Gemini 1.5	69.2%	30.8%

Latency Analysis¶

Approximate latency by configuration:

Text Only

Single agent:     ~3-5s
MARS (3 agents):  ~8-12s  (parallel)
MARS (5 agents):  ~12-18s (parallel)

With thinking tags: Add 50-100% to latency

Memory Usage¶

Estimated peak memory by config:

Single agent: ~100MB
MARS (3 agents): ~250MB
MARS (5 agents): ~400MB

Token Usage¶

Average tokens consumed:

Text Only

Query: "Solve this complex math problem"

Single agent: ~2,000 tokens
MARS (3 agents): ~5,000 tokens (2.5x)
MARS (5 agents): ~8,000 tokens (4x)

Recommendations¶

For Accuracy¶

Use 5+ agents
Enable thinking tags
Multiple verification rounds

For Speed¶

Use 2-3 agents
Disable thinking tags
Single verification round

For Production¶

3 agents: Good balance
Thinking tags: Yes
Caching: Implement for repeated queries

See Configuration for tuning options.