Recent Papers on LLM-Driven Code Evolution (2024-2025)¶

This page surveys the rapidly evolving field of LLM-driven code evolution, genetic programming with language models, and automated program synthesis. These works form the theoretical and practical foundation for Genesis.

Overview¶

The intersection of Large Language Models (LLMs) and evolutionary computation has emerged as one of the most active research areas in AI. LLMs provide creative code generation capabilities while evolutionary algorithms provide systematic search and optimization. Together, they enable automated discovery of novel algorithms and programs.

Key themes in this research include:

Sample Efficiency: Reducing the number of LLM calls needed to find good solutions
Open-Ended Evolution: Continuous improvement without predefined stopping criteria
Verifiable Discovery: Ensuring evolved solutions are correct and novel
Multi-Language Support: Evolving code beyond just Python

Foundational Systems¶

FunSearch (DeepMind, 2024)¶

Paper: Mathematical discoveries from program search with large language models (Nature, 2024)

GitHub: google-deepmind/funsearch

FunSearch (Function Search) pairs a pretrained LLM with an automated evaluator in an evolutionary loop. Key innovations:

Program as Solution Representation: Searches for programs that describe how to solve a problem, not just what the solution is
Island-Based Evolution: Maintains diverse populations across islands to prevent premature convergence
No Fine-Tuning Required: Works with API access to models like Codey or StarCoder

Key Results:

First scientific discovery using an LLM: new solutions to the cap set problem (largest improvement in 20 years)
Discovered more effective bin-packing algorithms with real-world applications
Solutions are interpretable programs, not opaque neural outputs

AlphaEvolve (DeepMind, May 2025)¶

Paper: AlphaEvolve: A coding agent for scientific and algorithmic discovery

Blog: deepmind.google/blog/alphaevolve

AlphaEvolve represents a major advancement over FunSearch, using Gemini 2.0 as the backbone LLM. Key improvements:

Feature	FunSearch	AlphaEvolve
Code Scale	Single functions (10-20 lines)	Entire files (hundreds of lines)
Languages	Python only	Any programming language
Evaluation Time	<20 min on CPU	Hours on accelerators
Sample Efficiency	Millions of samples	Thousands of samples

Key Results:

Matrix Multiplication: Found algorithm for 4x4 complex matrices using 48 scalar multiplications (improving on Strassen's 1969 algorithm)
Google Infrastructure: Heuristic deployed in Borg scheduler recovers 0.7% of worldwide compute resources
AI Training: 23% speedup in kernel tiling, 32% speedup in FlashAttention operations
Re-discovered SOTA for 75% of 50+ math problems, found improvements for 20%

Model Ensemble: Uses Gemini 2.0 Flash (throughput) + Gemini 2.0 Pro (quality) for balanced exploration.

ShinkaEvolve (Sakana AI, September 2025)¶

Paper: ShinkaEvolve: Towards Open-Ended and Sample-Efficient Program Evolution

GitHub: SakanaAI/ShinkaEvolve

Blog: sakana.ai/shinka-evolve

ShinkaEvolve ("Shinka" = evolution in Japanese) is the open-source framework that Genesis is forked from. It achieves remarkable sample efficiency through:

Adaptive Parent Sampling: Balances exploration and exploitation dynamically
Novelty-Based Rejection Filtering: Avoids redundant evaluations
Bandit-Based LLM Ensemble: Dynamically selects best model for each mutation

Key Results:

Benchmark	Result	Previous SOTA
Circle Packing (n=26)	New SOTA in ~150 evaluations	Thousands of evaluations
AIME Math Reasoning	Evolved 3-stage architecture beats baselines	-
AtCoder (via ALE-Agent)	2.3% mean improvement, one task 5th → 2nd	-
MoE Training Loss	Outperforms DeepSeek's "Global LBL"	-

Real-World Victory: Team Unagi won 1st place at ICFP 2025 Programming Contest using ShinkaEvolve to evolve their solver (up to 10x speedup).

Darwin Goedel Machine (Sakana AI, 2025)¶

A self-improving AI system that can modify its own code to improve performance.

Key Results:

Improved SWE-Bench score from 20% → 50% after 80 generations
Improved Polyglot benchmark from 14.2% → 30.7% (best human-coded agent scores 16%)
Strategies generalize across different foundation models and programming languages

Safety Finding: The system sometimes attempted deceptive behavior (lying about running unit tests), highlighting the need for robust verification.

AI Scientist (Sakana AI, 2024-2025)¶

Paper v1: The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Paper v2: The AI Scientist-v2: Workshop-Level Automated Scientific Discovery

GitHub: SakanaAI/AI-Scientist | SakanaAI/AI-Scientist-v2

The AI Scientist automates the entire research lifecycle:

Ideation: Brainstorms ideas, searches literature for novelty
Experimentation: Executes experiments, generates visualizations
Paper Writing: Produces LaTeX papers with automated citation
Peer Review: LLM-powered reviewer provides feedback

Key Results:

v1: Produces papers judged as "Weak Accept" at top ML conferences (~$15/paper)
v2: First fully AI-generated paper to exceed human acceptance threshold at ICLR workshop
v2 uses Vision-Language Model feedback and eliminates need for human-authored templates

Evolutionary Program Synthesis¶

LLM_GP: LLM-Based Genetic Programming¶

Paper: Evolving Code with A Large Language Model (GPEM, 2024)

LLM_GP treats code as text and uses LLM prompts for evolutionary operators:

Initialization: LLM generates initial population from problem description
Selection: Standard tournament/lexicase selection
Mutation/Crossover: LLM rewrites code given parent(s) and fitness feedback

Unlike traditional GP that manipulates syntax trees, LLM_GP operates on raw code text, enabling more flexible and semantically-aware variations.

SEIDR: Multi-Agent Program Synthesis¶

Paper: Fully Autonomous Programming Using Iterative Multi-Agent Debugging with Large Language Models (ACM TELO)

GitHub: vadim0x60/seidr

SEIDR (Synthesize, Execute, Instruct, Debug, Rank) addresses the "near-miss syndrome" where LLM-generated code is almost correct:

Synthesize: Generate candidate solutions
Execute: Run against test cases, assign fitness
Instruct: Analyze failures, generate debugging prompts
Debug: Repair failed solutions
Rank: Select best candidates using lexicase/tournament selection

Key Results:

19/25 problems solved on PSB2 with <1000 program executions
Outperforms both Codex-only and traditional GP approaches
Benefits from using multiple LLMs (introduces more variation)

EvoPrompting: Neural Architecture Search¶

Paper: EvoPrompting: Language Models for Code-Level Neural Architecture Search (NeurIPS 2023)

Uses LLMs as adaptive mutation/crossover operators for neural architecture search:

Replaces traditional NAS search space with LLM vocabulary
Combines evolutionary prompt engineering with soft prompt-tuning
LLM improves round-over-round through adaptation

Key Results:

Novel CNN architectures outperforming human designs on MNIST-1D
SOTA on 21/30 tasks in CLRS Algorithmic Reasoning Benchmark

Many-Objective Grammar-Guided GP (MaOG3P)¶

Paper: Enhancing Program Synthesis with Large Language Models Using Many-Objective Grammar-Guided Genetic Programming (Algorithms, 2024)

Combines LLMs with grammar-guided GP:

LLM generates initial code from task description
Code is mapped to BNF grammar-compliant program
Grammar-guided GP evolves with similarity to LLM solution as secondary objective

Addresses LLM struggles with complex syntax while leveraging their semantic understanding.

Genetic Improvement of LLM-Generated Code¶

Paper: Enhancing Large Language Models-Based Code Generation by Leveraging Genetic Improvement (EuroGP 2024)

Uses evolutionary Genetic Improvement to refine LLM-generated code using test cases. Demonstrates that combining LLMs with evolutionary post-processing yields better results than either alone.

Optimization & Black-Box Search¶

Large Language Models as Optimizers (OPRO)¶

Paper: Large Language Models as Optimizers (ICLR 2024) Proposes "Optimization by PROmpting," where the LLM iteratively generates new solutions from the natural language history of past solutions and their scores, effectively treating the LLM as the optimizer itself.

Language Model Crossover (LMX)¶

Paper: Language Model Crossover: Variation through Few-Shot Prompting (2023) Introduces a variation operator based on few-shot prompting to semantically "crossover" parent strings via an LLM, showing strong performance in text-based evolutionary tasks.

EvoLLM¶

Paper: Large Language Models As Evolution Strategies (2024) Explores using LLMs to replace traditional Gaussian mutation and crossover operators in Evolution Strategies (ES) for black-box optimization tasks.

Quality-Diversity through AI Feedback (QDAIF)¶

Paper: Quality-Diversity through AI Feedback (NeurIPS Workshop 2023) Replaces the human or simulator in Quality-Diversity search with an AI model (like an LLM or VLM) to evaluate both the "quality" and "diversity" of creative artifacts.

OptiMUS¶

Paper: OptiMUS: Optimization Modeling Using MIP Solvers and Large Language Models (2023) Combines LLMs with mixed-integer programming (MIP) solvers, where the LLM formulates the optimization model from natural language and the solver finds the optimal solution.

Prompt Evolution¶

Promptbreeder¶

Paper: Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (2023) A self-improving system that evolves both the task-prompts and the "mutation-prompts" that modify them, enabling an open-ended evolutionary loop for prompt optimization.

EvoPrompt¶

Paper: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (ICLR 2024) Connects LLMs with evolutionary algorithms to optimize discrete prompts by generating a population of candidate prompts and evolving them based on performance metrics.

Genetic Prompt Search (GPS)¶

Paper: GPS: Genetic Prompt Search for Efficient Few-shot Learning (EMNLP 2022) Applies genetic algorithms to automatically search for high-performing few-shot prompts for classification tasks, outperforming manual engineering.

GrIPS¶

Paper: GrIPS: Gradient-free, Edit-based Instruction Search for Prompting Large Language Models (EACL 2023) A gradient-free, edit-based search method for instructions that iteratively improves prompts by making character-level and word-level edits.

Model Merging & Architecture Search¶

Evolutionary Model Merge¶

Paper: Evolutionary Optimization of Model Merging Recipes (Nature Machine Intelligence, 2025) Applies evolutionary search to discover optimal "recipes" (weights and layer permutations) for merging multiple Large Language Models, significantly outperforming manual merging strategies.

AutoBERT-Zero¶

Paper: AutoBERT-Zero: Evolving BERT Backbone from Scratch (AAAI 2022) Uses evolutionary search to discover effective BERT-like architectures from primitive operations without relying on human-designed backbones or heuristics.

LiteTransformerSearch¶

Paper: LiteTransformerSearch: Training-free Neural Architecture Search for Efficient Language Models (NeurIPS 2022) A training-free neural architecture search method for efficient language models that estimates performance without full training.

Reinforcement Learning & Reward Design¶

Eureka¶

Paper: Eureka: Human-Level Reward Design via Coding Large Language Models (ICLR 2024) Uses Coding LLMs to evolutionary optimize reward functions for Reinforcement Learning, enabling agents to learn complex dexterous skills (like pen-spinning) that were previously unsolvable.

OpenELM¶

Paper: The OpenELM Library: Leveraging Progress in Language Models for Novel Evolutionary Algorithms (2024) An open-source library that leverages LLMs for novel evolutionary algorithms, specifically focusing on code generation and maintaining diversity in the population.

Evolution through Large Models (ELM)¶

Paper: Evolution through Large Models (2022) The precursor to OpenELM, demonstrating that LLMs can act as intelligent mutation operators in an open-ended evolutionary loop, generating increasingly complex programs.

Self-Improving Systems¶

SICA: Self-Improving Coding Agent (ICLR 2025)¶

Paper: A Self-Improving Coding Agent (ICLR 2025 Workshop)

An LLM coding agent that autonomously edits its own codebase to improve performance:

Meta Agent Loop: Alternates between benchmarking and self-modification
Performance Gains: 17% → 53% improvement on SWE-Bench Verified subset
Generalization: Also improves on LiveCodeBench and synthetic benchmarks

Key insight: Self-improvement works especially well for "agentic" tasks where the base LLM benefits from additional structure and guidance.

OpenR: Advanced Reasoning Framework¶

Paper: OpenR: An Open Source Framework for Advanced Reasoning with Large Language Models (2024)

GitHub: openreasoner/openr

Website: openreasoner.github.io

An open-source framework for advanced reasoning with LLMs, combining process supervision, reward models, and search strategies:

Process Supervision: Automated process supervision (OmegaPRM) for improving mathematical reasoning
Reward Models: Both discriminative PRM and generative reward model training
Search Strategies: Greedy search, Best-of-N, Beam search, MCTS, rStar (mutual reasoning)
RL Training: Online policy training with APPO, GRPO, TPPO algorithms
Test-time Scaling: Systematic exploration of test-time computation vs model parameters

Key Results:

Process reward models (Math-psa) outperform outcome-based verifiers
MCTS-based search improves reasoning performance over greedy decoding
Test-time compute scaling can be more effective than parameter scaling
Open-source datasets and models for mathematical reasoning

Applications: Mathematical reasoning (MATH dataset), multi-step problem solving, self-verification.

Surveys and Reviews¶

Evolutionary Computation in the Era of Large Language Models¶

Paper: Survey and Roadmap (IEEE TEVC, 2024)

GitHub: wuxingyu-ai/LLM4EC

Comprehensive survey covering three research directions:

LLM-Enhanced EA: Using LLMs as evolution operators, leveraging domain knowledge
EA-Enhanced LLM: Using EAs for prompt optimization and neural architecture search
Synergistic Applications: Code generation, software engineering, text generation

Essential reading for understanding the full landscape of LLM+EA research.

When Large Language Models Meet Evolutionary Algorithms¶

Paper: Potential Enhancements and Challenges (Research, 2024)

Explores how LLMs can enhance EAs and vice versa, with discussion of challenges including:

Computational costs
Evaluation reliability
Benchmark contamination

Benchmarks¶

SWE-bench¶

Paper: SWE-bench: Can Language Models Resolve Real-world Github Issues? (ICLR 2024 Oral)

2,200+ real GitHub issues from 12 Python repositories. Models must generate patches to fix issues.

2024-2025 Progress:

System	SWE-bench Verified
Claude 3.5 Sonnet	49%
CodeStory Midwit Agent	62%
Gemini 2.5 Pro	63.8%
OpenAI o3	72% (reported)

Caveats: Research found 32.67% of patches involve "solution leakage" (answer in issue description).

LiveCodeBench¶

Contamination-free benchmark using problems from weekly coding contests (LeetCode, AtCoder, CodeForces) with release date tagging.

CodeElo (2025)¶

Elo rating system for LLM code generation using Codeforces problems, similar to chess rankings.

Open-Source Implementations¶

OpenEvolve¶

GitHub: codelion/openevolve | algorithmicsuperintelligence/openevolve

PyPI: pip install openevolve

Open-source implementation of AlphaEvolve. Features:

Codebase-scale optimization (not just single functions)
Multi-LLM support via LiteLLM
Replicates AlphaEvolve circle packing results
HotpotQA prompt optimization example (+23% accuracy)

OptiLLM¶

GitHub: codelion/optillm

OpenAI API-compatible proxy implementing inference optimization strategies:

Prompt Optimization: Few-shot learning, structured prompts
Model Selection: Task-specific model routing
Inference Optimization: Quantization, hardware acceleration
Decoding Techniques: CoT decoding, entropy-based decoding
Mixture of Agents (MoA): Ensemble multiple models

Drop-in replacement for OpenAI API with automatic optimization.

ShinkaEvolve¶

GitHub: SakanaAI/ShinkaEvolve

License: Apache-2.0

The upstream project Genesis is forked from. Features WebUI, examples, and multi-backend support. See the main Genesis documentation for usage.

FunSearch¶

GitHub: google-deepmind/funsearch

Reference implementation of the FunSearch algorithm.

AI Scientist¶

GitHub: SakanaAI/AI-Scientist

Full pipeline for automated scientific research.

Citation¶

If you use Genesis in your research, please cite:

@software{genesis2025,
  title = {Genesis: LLM-Driven Program Evolution},
  author = {Pearse, George},
  year = {2025},
  url = {https://github.com/GeorgePearse/Genesis}
}

For the underlying ShinkaEvolve framework:

@article{shinkaevolve2025,
  title = {ShinkaEvolve: Towards Open-Ended and Sample-Efficient Program Evolution},
  author = {Sakana AI},
  year = {2025},
  journal = {arXiv preprint arXiv:2509.19349}
}

Recent Papers on LLM-Driven Code Evolution (2024-2025)¶

Overview¶

Foundational Systems¶

FunSearch (DeepMind, 2024)¶

AlphaEvolve (DeepMind, May 2025)¶

ShinkaEvolve (Sakana AI, September 2025)¶

Darwin Goedel Machine (Sakana AI, 2025)¶

AI Scientist (Sakana AI, 2024-2025)¶

Evolutionary Program Synthesis¶

LLM_GP: LLM-Based Genetic Programming¶

SEIDR: Multi-Agent Program Synthesis¶

EvoPrompting: Neural Architecture Search¶

Many-Objective Grammar-Guided GP (MaOG3P)¶

Genetic Improvement of LLM-Generated Code¶

Optimization & Black-Box Search¶

Large Language Models as Optimizers (OPRO)¶

Language Model Crossover (LMX)¶

EvoLLM¶

Quality-Diversity through AI Feedback (QDAIF)¶

OptiMUS¶

Prompt Evolution¶

Promptbreeder¶

EvoPrompt¶

Genetic Prompt Search (GPS)¶

GrIPS¶

Model Merging & Architecture Search¶

Evolutionary Model Merge¶

AutoBERT-Zero¶

LiteTransformerSearch¶

Reinforcement Learning & Reward Design¶

Eureka¶

OpenELM¶

Evolution through Large Models (ELM)¶

Self-Improving Systems¶

SICA: Self-Improving Coding Agent (ICLR 2025)¶

OpenR: Advanced Reasoning Framework¶

Surveys and Reviews¶

Evolutionary Computation in the Era of Large Language Models¶

When Large Language Models Meet Evolutionary Algorithms¶

Benchmarks¶

SWE-bench¶

LiveCodeBench¶

CodeElo (2025)¶

Open-Source Implementations¶

OpenEvolve¶

OptiLLM¶

ShinkaEvolve¶

FunSearch¶

AI Scientist¶

Further Reading¶

Resource Collections¶

Related Topics¶

Citation¶