CS 329A: Self-Improving AI Agents

Stanford CS329A | Self-Improving AI Agents. Azalia Mirhoseini; Aakanksha Chowdhery; Mert Yuksekgonul; Jon Saad-Falcon. cs329a.stanford.edu . Accessed Jul 11, 2025.

Test-time Compute Scaling
- Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
- Archon; An Architecture Search Framework for Inference-Time Techniques
- Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters
Self-Improvement Techniques with Verifiers:
- Training Verifiers to Solve Math Word Problems
- Let’s Verify Step by Step
- Math-Shepherd: Verify and Reinforce LLMs Step-by-step Without Human Annotations
Self-Improvement Techniques with RL
- Constitutional AI: Harmlessness from AI Feedback
- STaR: Bootstrapping Reasoning With Reasoning
- Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models
Self-Improvement Techniques with Search
- Thinking Fast and Slow with Deep Learning and Tree Search
- Competitive-level Code Generation with AlphaCode
- AlphaCode 2 Technical Report
Open-ended Agent Learning in the Era of Foundation Models
- The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
- Automated Design of Agentic Systems
Augmenting LLMs with Tool Use/Actions
- ReAct: Synergizing Reasoning and Acting in Language Models
- Toolformer: Language Models Can Teach Themselves to Use Tools
- RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning
Planning and Multi-Step Reasoning
- Language Agent Tree Search Unifies Reasoning, Acting, and Planning in Language Models
- LLMs Still Can’t Plan; Can LRMs? A Preliminary Evaluation of OpenAI’s o1 on PlanBench
- ADaPT: As-Needed Decomposition and Planning with Language Models
Reasoning Across Modalities
- Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents
- Developing a Computer Use Model
- The Dawn of GUI Agent: A Preliminary Study with Claude 3.5 Computer Use
Benchmarks and Challenges in Evaluating Agents
- SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
- KernelBench: Can LLMs Write GPU Kernels?
- RE-Bench: Evaluating Frontier AI R&D Capabilities of Language Models Against Human Experts
- MLE-Bench: Evaluating Machine Learning Agents on Machine Learning Engineering
- \(\tao\)-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
- GAIA: A Benchmark for General AI Assistants
AI Coding Agents
- SWE-Agent: Agent-Computer Interfaces Enable Automated Software Engineering
- SWE-Search: Enhancing Software Agents with Monte Carlo Tree Search and Iterative Refinement
- SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?
Agent Orchestration Frameworks
- AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
Augmenting LLMs with Retrieval/Memory
- Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?
- Contextual Retrieval
- MemGPT: Towards LLMs as Operating Systems
Multimodal AI Agents
Multi-agent Systems and Future Research Areas
- Multi-agent Fine-tuning: Self Improvement with Diverse Reasoning Chains
- Mixture-of-Agents Enhances Large Language Model Capabilities
- CodeMonkeys: Scaling Test-Time Compute for Software Engineering