Agentic AI for Scientific Discovery: Benchmarks, Frameworks, and Applications
Detailed Tutorial Description
Overview
The rise of large language models (LLMs) has introduced a paradigm shift in how AI can contribute to science. Beyond serving as static predictors, LLMs can function as agents that actively generate, refine, and evaluate hypotheses. This tutorial provides a structured overview of how agentic AI can accelerate the scientific discovery process, grounded in recent advances in benchmarks, frameworks, and applications.
Motivation
Traditional machine learning excels at prediction but falls short in hypothesis-driven discovery, where novelty, interpretability, and iterative reasoning are essential. The promise of agentic AI lies in closing this gap. By structuring the discovery process into two complementary phases, we highlight how AI can play an active role in advancing science:
- Hypothesis Generation – AI agents propose candidate hypotheses by retrieving inspirations, composing associations, and ranking them for plausibility.
- Feedback and Refinement – Hypotheses are iteratively improved using diverse feedback signals, including data fit, reasoning consistency, symbolic decomposition, or benchmark performance.
This cycle mirrors the way human scientists move from initial ideas to refined, testable hypotheses, but accelerates it through automated reasoning and structured agentic workflows.
Tutorial Outline
- Introduction to Agentic AI in Science
- From prediction to discovery
- Defining “agentic AI” and distinguishing it from static LLM use
- Motivating examples
- Phase I: Hypothesis Generation
- Inspiration retrieval and knowledge recombination
- From qualitative hypotheses to symbolic formulations
- Ranking strategies and novelty assessment
- Phase II: Feedback and Refinement
- Iterative optimization using feedback signals
- Data-driven evaluation, symbolic decomposition, and reasoning consistency checks
- Hierarchical refinement from coarse ideas to fine-grained hypotheses
- Benchmarks for Scientific Discovery
- Limitations of existing datasets (memorization vs reasoning)
- Principles for robust benchmark design
- Recent benchmarks for equations, hypotheses, and surfaces
- Frameworks for Agentic Discovery
- Decomposition strategies, memory mechanisms, and feedback loops
- Integration of evolutionary search and reinforcement learning
- Examples of agentic workflows
- Applications Across Sciences
- Social sciences (open-domain hypothesis generation)
- Natural sciences (equation discovery, symbolic modeling)
- Broader applications in AI for science
- Challenges and Future Directions
- Reliability, interpretability, reproducibility
- Balancing creativity and validity
- Toward hybrid AI–science collaborations
Target Audience
Researchers and practitioners in machine learning, NLP, and AI for science who are interested in symbolic reasoning, agentic frameworks, and automated discovery. The tutorial is accessible to those with general familiarity with LLMs and does not require deep domain expertise.
Learning Outcomes
Participants will gain:
- An understanding of the two-phase cycle of agentic scientific discovery.
- Exposure to recent benchmarks for evaluating reasoning beyond memorization.
- Insight into frameworks that integrate decomposition, evolutionary search, and feedback mechanisms.
- Awareness of applications across disciplines and the challenges they expose.
- A forward-looking perspective on building reliable, interpretable science-focused agents.
Reading List
Introduction
- Interpretable scientific discovery with symbolic regression: a review
- Symbolic Regression is NP-hard
- A Survey on Large Language Models for Scientific Research
Pre-experiment Phase
- Large Language Models for Automated Open-domain Scientific Hypotheses Discovery (ACL’24) [Github]
- MOOSE-Chem: Large Language Models for Rediscovering Unseen Chemistry Scientific Hypotheses (ICLR’25) [Github]
- MOOSE-Chem2: Exploring LLM Limits in Fine-Grained Scientific Hypothesis Discovery via Hierarchical Search (NeurIPS’25) [Github]
- ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition
Experiment-guided Phase (Efficient Experimentation)
Symbolic Regression Methods
Search Symbolic Regression Methods
- Interpretable Machine Learning for Science with PySR and SymbolicRegression.jl [Github]
- Gene-pool Optimal Mixing Evolutionary Algorithm for Genetic Programming (Evolutionary Computation’21) [Github]
- Symbolic Regression via Neural-Guided Genetic Programming Population Seeding (NeurIPS’21) [Github]
- Symbolic Physics Learner: Discovering governing equations via Monte Carlo tree search (ICLR’23) [Github]
- AI Feynman: A physics-inspired method for symbolic regression (Science Advances) [Github]
- Deep symbolic regression: Recovering mathematical expressions from data via risk-seeking policy gradients (ICLR’21) [Github]
Learning Symbolic Regression Methods
- Neural Symbolic Regression that scales (ICML’21) [Github]
- End-to-end Symbolic Regression with Transformers (NeurIPS’22) [Github]
- SymFormer: End-to-end symbolic regression using transformer-based architecture [Github]
- SymbolicGPT: A Generative Transformer Model for Symbolic Regression [Github]
- SNIP: Bridging Mathematical Symbolic and Numeric Realms with Unified Pre-training (ICLR’24) [Github]
Learning + Search Symbolic Regression Methods
- Transformer-based Planning for Symbolic Regression (NeurIPS’23) [Github]
- A Unified Framework for Deep Symbolic Regression (NeurIPS’22) [Github]
- Deep Generative Symbolic Regression (ICLR’23) [Github]
- Efficient Generator of Mathematical Expressions for Symbolic Regression (Machine Learning’23) [Github]
- SNIP: Bridging Mathematical Symbolic and Numeric Realms with Unified Pre-training (ICLR’24) [Github]
LLM-guided Symbolic Regression Methods
- LLM-SR: Scientific Equation Discovery via Programming with Large Language Models (ICLR’25) [Github]
- In-Context Symbolic Regression: Leveraging Large Language Models for Function Discovery (ACL’24) [Github]
- Symbolic Regression with a Learned Concept Library (NeurIPS’24) [Github]
Symbolic Regression Benchmarks
- Contemporary Symbolic Regression Methods and their Relative Performance (NeurIPS’21) [Github]
- Rethinking Symbolic Regression Datasets and Benchmarks for Scientific Discovery (DMLR’24) [Github]
- LLM-SRBench: A New Benchmark for Scientific Equation Discovery with Large Language Models (ICML’25) [Github] [Dataset]
Experiment-guided Phase (Costly Experimentation)
- MOOSE-Chem3: Toward Experiment-Guided Hypothesis Ranking via Simulated Experimental Feedback [Github]