Skip to content

MARS Benchmarks

Performance results on standard mathematical reasoning benchmarks.

Benchmark Results

AIME 2025

Configuration Score Improvement
Baseline (single model) 43.3% -
MARS (3 agents) 73.3% +69%
MARS (5 agents) 75.8% +75%

IMO 2025

Configuration Score Improvement
Baseline 16.7% -
MARS (3 agents) 33.3% +100%
MARS (5 agents) 36.7% +120%

LiveCodeBench

Configuration Score Improvement
Baseline 39.05% -
MARS (3 agents) 50.48% +29%
MARS (5 agents) 52.13% +34%

Configuration Impact

Number of Agents

Text Only
Agents | Quality | Speed | Tokens
-----  | ------- | ----- | ------
1      | Baseline| Fast  | Low
3      | +69%    | Normal| 3x
5      | +75%    | Slow  | 5x
7      | +78%    | Slower| 7x

Temperature Settings

Optimal temperatures vary by problem type:

  • Math Problems: [0.2, 0.5, 0.8]
  • Creative Tasks: [0.5, 0.8, 1.2]
  • Coding: [0.1, 0.4, 0.7]

Cost-Benefit Analysis

Quality vs. Cost

Text Only
Cost (tokens)
      |    Steep gain
      |   /
      |  /
      | /_____ Diminishing returns
      |________
             Time

Recommendations: - 3 agents: Good quality/cost ratio - 5 agents: High quality, higher cost - >7 agents: Diminishing returns

Model-Specific Results

Results vary by underlying model:

Model AIME (3 agents) IMO (3 agents)
Claude 3.5 Sonnet 73.3% 33.3%
GPT-4o 71.5% 32.1%
Gemini 1.5 69.2% 30.8%

Latency Analysis

Approximate latency by configuration:

Text Only
Single agent:     ~3-5s
MARS (3 agents):  ~8-12s  (parallel)
MARS (5 agents):  ~12-18s (parallel)

With thinking tags: Add 50-100% to latency

Memory Usage

Estimated peak memory by config:

  • Single agent: ~100MB
  • MARS (3 agents): ~250MB
  • MARS (5 agents): ~400MB

Token Usage

Average tokens consumed:

Text Only
Query: "Solve this complex math problem"

Single agent: ~2,000 tokens
MARS (3 agents): ~5,000 tokens (2.5x)
MARS (5 agents): ~8,000 tokens (4x)

Recommendations

For Accuracy

  • Use 5+ agents
  • Enable thinking tags
  • Multiple verification rounds

For Speed

  • Use 2-3 agents
  • Disable thinking tags
  • Single verification round

For Production

  • 3 agents: Good balance
  • Thinking tags: Yes
  • Caching: Implement for repeated queries

See Configuration for tuning options.