MARS Benchmarks¶
Performance results on standard mathematical reasoning benchmarks.
Benchmark Results¶
AIME 2025¶
| Configuration | Score | Improvement |
|---|---|---|
| Baseline (single model) | 43.3% | - |
| MARS (3 agents) | 73.3% | +69% |
| MARS (5 agents) | 75.8% | +75% |
IMO 2025¶
| Configuration | Score | Improvement |
|---|---|---|
| Baseline | 16.7% | - |
| MARS (3 agents) | 33.3% | +100% |
| MARS (5 agents) | 36.7% | +120% |
LiveCodeBench¶
| Configuration | Score | Improvement |
|---|---|---|
| Baseline | 39.05% | - |
| MARS (3 agents) | 50.48% | +29% |
| MARS (5 agents) | 52.13% | +34% |
Configuration Impact¶
Number of Agents¶
Text Only
Agents | Quality | Speed | Tokens
----- | ------- | ----- | ------
1 | Baseline| Fast | Low
3 | +69% | Normal| 3x
5 | +75% | Slow | 5x
7 | +78% | Slower| 7x
Temperature Settings¶
Optimal temperatures vary by problem type:
- Math Problems: [0.2, 0.5, 0.8]
- Creative Tasks: [0.5, 0.8, 1.2]
- Coding: [0.1, 0.4, 0.7]
Cost-Benefit Analysis¶
Quality vs. Cost¶
Text Only
Cost (tokens)
| Steep gain
| /
| /
| /_____ Diminishing returns
|________
Time
Recommendations: - 3 agents: Good quality/cost ratio - 5 agents: High quality, higher cost - >7 agents: Diminishing returns
Model-Specific Results¶
Results vary by underlying model:
| Model | AIME (3 agents) | IMO (3 agents) |
|---|---|---|
| Claude 3.5 Sonnet | 73.3% | 33.3% |
| GPT-4o | 71.5% | 32.1% |
| Gemini 1.5 | 69.2% | 30.8% |
Latency Analysis¶
Approximate latency by configuration:
Text Only
Single agent: ~3-5s
MARS (3 agents): ~8-12s (parallel)
MARS (5 agents): ~12-18s (parallel)
With thinking tags: Add 50-100% to latency
Memory Usage¶
Estimated peak memory by config:
- Single agent: ~100MB
- MARS (3 agents): ~250MB
- MARS (5 agents): ~400MB
Token Usage¶
Average tokens consumed:
Text Only
Query: "Solve this complex math problem"
Single agent: ~2,000 tokens
MARS (3 agents): ~5,000 tokens (2.5x)
MARS (5 agents): ~8,000 tokens (4x)
Recommendations¶
For Accuracy¶
- Use 5+ agents
- Enable thinking tags
- Multiple verification rounds
For Speed¶
- Use 2-3 agents
- Disable thinking tags
- Single verification round
For Production¶
- 3 agents: Good balance
- Thinking tags: Yes
- Caching: Implement for repeated queries
See Configuration for tuning options.