Agentic AI for Scientific Discovery: Benchmarks, Frameworks, and Applications

Zonglin Yang MiroMind

Chandan Reddy Virginia Tech

Xinya Du University of Texas at Dallas

Detailed Tutorial Description

Overview

The rise of large language models (LLMs) has introduced a paradigm shift in how AI can contribute to science. Beyond serving as static predictors, LLMs can function as agents that actively generate, refine, and evaluate hypotheses. This tutorial provides a structured overview of how agentic AI can accelerate the scientific discovery process, grounded in recent advances in benchmarks, frameworks, and applications.

Motivation

Traditional machine learning excels at prediction but falls short in hypothesis-driven discovery, where novelty, interpretability, and iterative reasoning are essential. The promise of agentic AI lies in closing this gap. By structuring the discovery process into two complementary phases, we highlight how AI can play an active role in advancing science:

Hypothesis Generation – AI agents propose candidate hypotheses by retrieving inspirations, composing associations, and ranking them for plausibility.
Feedback and Refinement – Hypotheses are iteratively improved using diverse feedback signals, including data fit, reasoning consistency, symbolic decomposition, or benchmark performance.

This cycle mirrors the way human scientists move from initial ideas to refined, testable hypotheses, but accelerates it through automated reasoning and structured agentic workflows.

Tutorial Outline

Introduction to Agentic AI in Science
- From prediction to discovery
- Defining “agentic AI” and distinguishing it from static LLM use
- Motivating examples
Phase I: Hypothesis Generation
- Inspiration retrieval and knowledge recombination
- From qualitative hypotheses to symbolic formulations
- Ranking strategies and novelty assessment
Phase II: Feedback and Refinement
- Iterative optimization using feedback signals
- Data-driven evaluation, symbolic decomposition, and reasoning consistency checks
- Hierarchical refinement from coarse ideas to fine-grained hypotheses
Benchmarks for Scientific Discovery
- Limitations of existing datasets (memorization vs reasoning)
- Principles for robust benchmark design
- Recent benchmarks for equations, hypotheses, and surfaces
Frameworks for Agentic Discovery
- Decomposition strategies, memory mechanisms, and feedback loops
- Integration of evolutionary search and reinforcement learning
- Examples of agentic workflows
Applications Across Sciences
- Social sciences (open-domain hypothesis generation)
- Natural sciences (equation discovery, symbolic modeling)
- Broader applications in AI for science
Challenges and Future Directions
- Reliability, interpretability, reproducibility
- Balancing creativity and validity
- Toward hybrid AI–science collaborations

Target Audience

Researchers and practitioners in machine learning, NLP, and AI for science who are interested in symbolic reasoning, agentic frameworks, and automated discovery. The tutorial is accessible to those with general familiarity with LLMs and does not require deep domain expertise.

Learning Outcomes

Participants will gain:

An understanding of the two-phase cycle of agentic scientific discovery.
Exposure to recent benchmarks for evaluating reasoning beyond memorization.
Insight into frameworks that integrate decomposition, evolutionary search, and feedback mechanisms.
Awareness of applications across disciplines and the challenges they expose.
A forward-looking perspective on building reliable, interpretable science-focused agents.

Reading List

Introduction

Pre-experiment Phase

Experiment-guided Phase (Efficient Experimentation)

Symbolic Regression Methods

Search Symbolic Regression Methods

Learning Symbolic Regression Methods

Learning + Search Symbolic Regression Methods

LLM-guided Symbolic Regression Methods

Symbolic Regression Benchmarks

Experiment-guided Phase (Costly Experimentation)

MOOSE-Chem3: Toward Experiment-Guided Hypothesis Ranking via Simulated Experimental Feedback [Github]