Stanford CS329A | Self-Improving AI Agents.
Azalia Mirhoseini; Aakanksha Chowdhery; Mert Yuksekgonul; Jon Saad-Falcon.
cs329a.stanford.edu .
Accessed Jul 11, 2025.
- Test-time Compute Scaling
- Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
- Archon; An Architecture Search Framework for Inference-Time Techniques
- Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters
- Self-Improvement Techniques with Verifiers:
- Training Verifiers to Solve Math Word Problems
- Let’s Verify Step by Step
- Math-Shepherd: Verify and Reinforce LLMs Step-by-step Without Human Annotations
- Self-Improvement Techniques with RL
- Constitutional AI: Harmlessness from AI Feedback
- STaR: Bootstrapping Reasoning With Reasoning
- Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models
- Self-Improvement Techniques with Search
- Thinking Fast and Slow with Deep Learning and Tree Search
- Competitive-level Code Generation with AlphaCode
- AlphaCode 2 Technical Report
- Open-ended Agent Learning in the Era of Foundation Models
- The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
- Automated Design of Agentic Systems
- Augmenting LLMs with Tool Use/Actions
- ReAct: Synergizing Reasoning and Acting in Language Models
- Toolformer: Language Models Can Teach Themselves to Use Tools
- RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning
- Planning and Multi-Step Reasoning
- Language Agent Tree Search Unifies Reasoning, Acting, and Planning in Language Models
- LLMs Still Can’t Plan; Can LRMs? A Preliminary Evaluation of OpenAI’s o1 on PlanBench
- ADaPT: As-Needed Decomposition and Planning with Language Models
- Reasoning Across Modalities
- Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents
- Developing a Computer Use Model
- The Dawn of GUI Agent: A Preliminary Study with Claude 3.5 Computer Use
- Benchmarks and Challenges in Evaluating Agents
- SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
- KernelBench: Can LLMs Write GPU Kernels?
- RE-Bench: Evaluating Frontier AI R&D Capabilities of Language Models Against Human Experts
- MLE-Bench: Evaluating Machine Learning Agents on Machine Learning Engineering
- \(\tao\)-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
- GAIA: A Benchmark for General AI Assistants
- AI Coding Agents
- SWE-Agent: Agent-Computer Interfaces Enable Automated Software Engineering
- SWE-Search: Enhancing Software Agents with Monte Carlo Tree Search and Iterative Refinement
- SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?
- Agent Orchestration Frameworks
- AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
- Augmenting LLMs with Retrieval/Memory
- Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?
- Contextual Retrieval
- MemGPT: Towards LLMs as Operating Systems
- Multimodal AI Agents
- Multi-agent Systems and Future Research Areas
- Multi-agent Fine-tuning: Self Improvement with Diverse Reasoning Chains
- Mixture-of-Agents Enhances Large Language Model Capabilities
- CodeMonkeys: Scaling Test-Time Compute for Software Engineering