CS 329A: Self-Improving AI Agents

Dated Jul 11, 2025; last modified on Fri, 11 Jul 2025

Stanford CS329A | Self-Improving AI Agents. Azalia Mirhoseini; Aakanksha Chowdhery; Mert Yuksekgonul; Jon Saad-Falcon. cs329a.stanford.edu . Accessed Jul 11, 2025.
  • Test-time Compute Scaling
    • Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
    • Archon; An Architecture Search Framework for Inference-Time Techniques
    • Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters
  • Self-Improvement Techniques with Verifiers:
    • Training Verifiers to Solve Math Word Problems
    • Let’s Verify Step by Step
    • Math-Shepherd: Verify and Reinforce LLMs Step-by-step Without Human Annotations
  • Self-Improvement Techniques with RL
    • Constitutional AI: Harmlessness from AI Feedback
    • STaR: Bootstrapping Reasoning With Reasoning
    • Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models
  • Self-Improvement Techniques with Search
    • Thinking Fast and Slow with Deep Learning and Tree Search
    • Competitive-level Code Generation with AlphaCode
    • AlphaCode 2 Technical Report
  • Open-ended Agent Learning in the Era of Foundation Models
    • The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
    • Automated Design of Agentic Systems
  • Augmenting LLMs with Tool Use/Actions
    • ReAct: Synergizing Reasoning and Acting in Language Models
    • Toolformer: Language Models Can Teach Themselves to Use Tools
    • RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning
  • Planning and Multi-Step Reasoning
    • Language Agent Tree Search Unifies Reasoning, Acting, and Planning in Language Models
    • LLMs Still Can’t Plan; Can LRMs? A Preliminary Evaluation of OpenAI’s o1 on PlanBench
    • ADaPT: As-Needed Decomposition and Planning with Language Models
  • Reasoning Across Modalities
    • Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents
    • Developing a Computer Use Model
    • The Dawn of GUI Agent: A Preliminary Study with Claude 3.5 Computer Use
  • Benchmarks and Challenges in Evaluating Agents
    • SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
    • KernelBench: Can LLMs Write GPU Kernels?
    • RE-Bench: Evaluating Frontier AI R&D Capabilities of Language Models Against Human Experts
    • MLE-Bench: Evaluating Machine Learning Agents on Machine Learning Engineering
    • \(\tao\)-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
    • GAIA: A Benchmark for General AI Assistants
  • AI Coding Agents
    • SWE-Agent: Agent-Computer Interfaces Enable Automated Software Engineering
    • SWE-Search: Enhancing Software Agents with Monte Carlo Tree Search and Iterative Refinement
    • SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?
  • Agent Orchestration Frameworks
    • AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
  • Augmenting LLMs with Retrieval/Memory
    • Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?
    • Contextual Retrieval
    • MemGPT: Towards LLMs as Operating Systems
  • Multimodal AI Agents
  • Multi-agent Systems and Future Research Areas
    • Multi-agent Fine-tuning: Self Improvement with Diverse Reasoning Chains
    • Mixture-of-Agents Enhances Large Language Model Capabilities
    • CodeMonkeys: Scaling Test-Time Compute for Software Engineering