Best-of-N Strategy¶
Best-of-N is a simple yet highly effective sampling strategy that generates N diverse solutions and selects the single best one based on configurable criteria.
How It Works¶
Algorithm Overview¶
┌─────────────────────────────────────────┐
│ Input: Query + System Prompt │
└────────────┬────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ Generate N Solutions (diverse temps) │
│ ├─ Solution 1 (temp=0.3) │
│ ├─ Solution 2 (temp=0.5) │
│ ├─ Solution 3 (temp=0.7) │
│ ├─ Solution 4 (temp=0.9) │
│ └─ Solution 5 (temp=1.1) │
└────────────┬────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ Evaluate Each Solution Using │
│ Selected Criteria: │
│ • Verification Score │
│ • Confidence (score + thoroughness) │
│ • Thoroughness (reasoning length) │
│ • Conciseness (answer brevity) │
│ • Multi-criteria ranking │
└────────────┬────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ Select Best Solution │
└────────────┬────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ Return Best Solution │
└─────────────────────────────────────────┘
Key Phases¶
- Generation Phase: Generate N diverse solutions using different temperature values (0.3, 0.5, 0.7, 0.9, 1.1...)
- Evaluation Phase: Score each solution using the selected method (score, confidence, thoroughness, conciseness, or multi-criteria)
- Selection Phase: Choose the solution with the highest score
Selection Methods¶
BestScore¶
Selects based on highest verification score.
Best for: Cases where you have pre-computed verification scores Trade-off: Requires prior verification of solutions
let config = BestOfNConfig::new(5)
.with_selection_method(SelectionMethod::BestScore);
HighestConfidence¶
Combines verification score with reasoning thoroughness (60% score, 40% length).
Best for: Balancing quality with detailed reasoning Trade-off: Longer solutions preferred, may be verbose
let config = BestOfNConfig::new(5)
.with_selection_method(SelectionMethod::HighestConfidence);
MostThorough¶
Selects the solution with the longest reasoning.
Best for: Complex problems requiring detailed explanation Trade-off: May select verbose but less precise answers
let config = BestOfNConfig::new(5)
.with_selection_method(SelectionMethod::MostThorough);
MostConcise¶
Selects the solution with the shortest answer.
Best for: Extractive tasks (e.g., "What is the capital of France?") Trade-off: May miss detailed explanations needed for complex questions
let config = BestOfNConfig::new(5)
.with_selection_method(SelectionMethod::MostConcise);
MultiCriteria (Recommended)¶
Weighted combination of multiple criteria: - 40%: Verification score - 30%: Thoroughness (reasoning length) - 20%: Conciseness (inverse of answer length) - 10%: Temperature diversity
Best for: General-purpose use; balances multiple quality dimensions Trade-off: More computationally complex scoring
let config = BestOfNConfig::new(5)
.with_selection_method(SelectionMethod::MultiCriteria);
Configuration¶
Basic Configuration¶
use optillm_mars::{BestOfNConfig, SelectionMethod, BestOfNAggregator};
use optillm_core::ModelClient;
// Create a basic configuration for 5 candidates
let config = BestOfNConfig::new(5);
// Run Best-of-N selection
let (best_solution, metadata) = BestOfNAggregator::run_best_of_n(
"What is 2+2?",
"You are a helpful math assistant",
config,
&client,
).await?;
println!("Best answer: {}", best_solution.answer);
println!("Tokens used: {}", metadata.total_tokens);
Advanced Configuration¶
let config = BestOfNConfig::new(7)
.with_temperatures(vec![0.1, 0.3, 0.5, 0.7, 0.9, 1.1, 1.3])
.with_selection_method(SelectionMethod::MultiCriteria)
.with_verification_scores(true);
let (best_solution, metadata) = BestOfNAggregator::run_best_of_n(
query,
system_prompt,
config,
&client,
).await?;
// Get detailed statistics about the selection
let stats = BestOfNAggregator::get_selection_statistics(&metadata);
println!("Average candidate score: {:.2}", stats.avg_candidate_score);
println!("Best candidate score: {:.2}", stats.best_candidate_score);
println!("Score variance: {:.4}", stats.score_variance);
Use Cases¶
Math/Logic Problems¶
Best-of-N with MultiCriteria or HighestConfidence
let config = BestOfNConfig::new(5)
.with_selection_method(SelectionMethod::MultiCriteria);
let (answer, _) = BestOfNAggregator::run_best_of_n(
"Solve: 2^3 + 4*5 - 6/2 = ?",
"You are a math tutor. Show your work step by step.",
config,
&client,
).await?;
Fact/Definition Questions¶
Best-of-N with MostConcise
let config = BestOfNConfig::new(3)
.with_selection_method(SelectionMethod::MostConcise);
let (answer, _) = BestOfNAggregator::run_best_of_n(
"What is the capital of France?",
"Answer concisely in one sentence.",
config,
&client,
).await?;
Creative Writing¶
Best-of-N with HighestConfidence
let config = BestOfNConfig::new(5)
.with_selection_method(SelectionMethod::HighestConfidence);
let (answer, _) = BestOfNAggregator::run_best_of_n(
"Write a short story about a robot learning to dream.",
"You are a creative writing assistant.",
config,
&client,
).await?;
Code Generation¶
Best-of-N with MultiCriteria
let config = BestOfNConfig::new(5)
.with_selection_method(SelectionMethod::MultiCriteria);
let (code, _) = BestOfNAggregator::run_best_of_n(
"Write a Rust function that checks if a number is prime.",
"Write clean, idiomatic Rust code with proper error handling.",
config,
&client,
).await?;
Integration with MARS¶
You can use Best-of-N as part of the broader MARS optimization pipeline:
use optillm_mars::{Aggregator, BestOfNConfig, SelectionMethod};
// Use Best-of-N as an aggregation strategy
let config = BestOfNConfig::new(5)
.with_selection_method(SelectionMethod::MultiCriteria);
let solutions = Aggregator::aggregate_best_of_n(
query,
system_prompt,
config,
&client,
).await?;
Performance Characteristics¶
Token Usage¶
- Cost: O(N × token_count), where N is number of candidates
- For 5 candidates: ~5x the tokens of a single generation
- Optimization: Use fewer candidates (3-4) for cost-sensitive applications
Latency¶
- Parallel Execution: Can generate all N solutions in parallel if model supports concurrent requests
- Sequential Execution: Linear in N (one request after another)
- Typical: 3-5 seconds for 5 candidates with 500 token responses
Quality Improvement¶
- Small N (3-4): ~10-20% improvement over single best
- Medium N (5-7): ~15-30% improvement
- Large N (10+): ~20-40% improvement (diminishing returns)
Advantages vs Disadvantages¶
Advantages¶
✓ Simple and easy to understand ✓ No model fine-tuning required ✓ Works with any LLM provider ✓ Low latency (can run in parallel) ✓ Deterministic (same input = same output) ✓ Can parallelize across multiple requests ✓ Multiple selection criteria for different use cases ✓ Transparent: all candidate scores visible
Disadvantages¶
✗ Higher token cost (N × normal cost) ✗ Limited to single best selection (no synthesis) ✗ May miss creative combinations ✗ Performance degrades with very small N ✗ Selection method needs domain tuning
Comparison with Other Strategies¶
| Strategy | Quality | Cost | Speed | Complexity |
|---|---|---|---|---|
| Single Best | Baseline | 1x | Fast | Low |
| Best-of-N | Good | Nx | Fast | Very Low |
| MOA | Better | 3x | Medium | Medium |
| MCTS | Better | 5-10x | Slow | High |
| MARS | Best | 10-20x | Very Slow | Very High |
Benchmarks¶
Performance on Standard Tasks¶
Math Problems (5 candidates) - Baseline accuracy: 60% - Best-of-N accuracy: 72% (+20%) - Token usage: 500 → 2,500
Fact Questions (3 candidates) - Baseline accuracy: 75% - Best-of-N accuracy: 83% (+11%) - Token usage: 300 → 900
Creative Writing (5 candidates) - Baseline quality: 6/10 - Best-of-N quality: 7.5/10 (+25%) - Token usage: 800 → 4,000
Code Generation (5 candidates) - Baseline: 70% working code - Best-of-N: 82% working code (+17%) - Token usage: 600 → 3,000
Cost Optimization Tips¶
1. Use Smaller N for Simple Tasks¶
// For fact questions: N=3
let config = BestOfNConfig::new(3);
// For complex reasoning: N=7
let config = BestOfNConfig::new(7);
2. Choose Efficient Selection Method¶
// MostConcise is fastest (doesn't score all dimensions)
let config = BestOfNConfig::new(5)
.with_selection_method(SelectionMethod::MostConcise);
3. Cache Repeated Queries¶
// For repeated queries, cache results to avoid re-generation
let cache = HashMap::new();
if let Some(cached) = cache.get(query) {
return cached.clone();
}
4. Use with Cheaper Models¶
// Generate with cheaper model, select with expensive
// (More advanced optimization not yet implemented)
Troubleshooting¶
Low Quality Results¶
- Increase N from 3 to 5 or 7
- Expand temperature range beyond default
- Switch selection method to MultiCriteria
- Improve system prompt with better instructions
High Token Usage¶
- Reduce N from 7 to 5 or 3
- Use MostConcise selection (faster)
- Shorter prompts with focused instructions
- Consider MOA as middle ground
Selection Inconsistency¶
- Increase temperature diversity - wider temperature range
- Add verification - use scoring-based selection
- Use MultiCriteria - weighted combination of methods
Example: Complete Application¶
use optillm_mars::{BestOfNConfig, BestOfNAggregator, SelectionMethod};
use optillm_core::ModelClient;
#[tokio::main]
async fn main() -> Result<()> {
// Initialize your model client
let client = create_model_client().await?;
// Create configuration for 5 candidates
let config = BestOfNConfig::new(5)
.with_selection_method(SelectionMethod::MultiCriteria)
.with_verification_scores(true);
// Run Best-of-N selection
let (best_solution, metadata) = BestOfNAggregator::run_best_of_n(
"What are the key differences between async and sync programming?",
"You are an expert Rust programmer. Explain concisely but thoroughly.",
config,
&client,
).await?;
// Display results
println!("Best Answer: {}", best_solution.answer);
println!("\nReasoning: {}", best_solution.reasoning);
println!("\nMetadata:");
println!(" - Candidates: {}", metadata.num_candidates);
println!(" - Total tokens: {}", metadata.total_tokens);
println!(" - Selection method: {}", metadata.selection_method);
println!(" - Selection score: {:.2}", metadata.selection_score);
// Get detailed statistics
let stats = BestOfNAggregator::get_selection_statistics(&metadata);
println!("\nStatistics:");
println!(" - Avg score: {:.2}", stats.avg_candidate_score);
println!(" - Best score: {:.2}", stats.best_candidate_score);
println!(" - Score variance: {:.4}", stats.score_variance);
Ok(())
}
Next Steps¶
- Try Best-of-N on your specific use case
- Experiment with different N values (3-7)
- Test different selection methods
- Monitor token usage and quality trade-offs
- Combine with other strategies (MOA, MCTS) for hybrid approaches
- Consider parallel execution for reduced latency
Related Strategies¶
- MOA (Mixture of Agents): More sophisticated version with synthesis
- MCTS (Monte Carlo Tree Search): Tree-based exploration
- MARS: Full multi-phase optimization pipeline