Preprint

Can Coding Agents Reproduce Findings in Computational Materials Science?

AutoMat asks a stricter question than "can an agent write code?": can it recover a scientific workflow from a paper, navigate specialized computational tools, and produce evidence that actually supports a published claim?

Ziyang Huang Yi Cao Ali K. Shargh Jing Luo Ruidong Mei Mohd Zaki Zhan Liu Wyatt Bunstine William Jurayj Somdatta Goswami Tyrel McQueen Michael Shields Jaafar El-Awady Paulette Clancy Benjamin Van Durme Nicholas Andrews William Walden Daniel Khashabi

¹Computer Science

²Chemical and Biomolecular Engineering

³Mechanical Engineering

⁴Civil and Systems Engineering

⁵Physics and Astronomy

⁶Hopkins Extreme Materials Institute (HEMI)

⁷Center for Language and Speech Processing (CLSP)

⁸Human Language Technology Center of Excellence (HLTCOE)

Johns Hopkins University

^*Equal contribution ^♡Equal advising ^†Corresponding author

Read on arXiv GitHub Code Hugging Face Data

§ 01 — Intro

The benchmark starts where coding benchmarks stop

Coding agents have become good at software tasks with visible feedback loops: tests fail, a patch is written, tests pass. Scientific reproduction is a different regime. The requirement is often buried in prose, correctness depends on tacit domain choices, and a numerically plausible output can still be scientifically wrong.

AutoMat evaluates that regime directly. Each benchmark instance asks an agent to investigate a specific computational materials science claim from a real paper. The agent must decide what needs to be run, assemble or adapt the workflow, execute it in a controlled environment, and write a report explaining whether the evidence supports the claim.

AutoMat in one pass

85 claims curated and annotated by subject matter experts from computational materials science papers.
Three reproduction regimes: from paper text, from released artifacts, and from final outputs that still require interpretation.
Five agent settings evaluated end-to-end, including general-purpose coding agents and a benchmark-specific orchestrated agent.
Artifact-grounded scoring by an evaluator agent calibrated against blinded SME judgments.

The key result is blunt: current agents can make partial progress, but they are not yet reliable autonomous scientific reproducers. Claude Code with Opus 4.6 is the strongest setting we evaluate, with a mean score of 3.52 and a success rate around 54%. Codex with GPT-5.4 is lowest overall, with a mean score of 2.44 and a success rate around 24%.1

Reproducing the right number on the wrong subset does not verify the scientific conclusion. In AutoMat, faithful reproduction means recovering the procedure, executing it competently, and interpreting the result under the constraints the original work imposed.

Software benchmark

Correctness is often executable

The agent can use tests, linters, compilation, and explicit issue descriptions as short feedback loops.

Scientific reproduction

Correctness is evidential

The agent must recover an underspecified method, handle domain tools, and judge whether the produced evidence actually supports the claim.

§ 02 — Abstract

What the paper contributes

AutoMat is a benchmark for evaluating LLM-based agents' ability to reproduce claims from computational materials science. It poses three interrelated challenges: recovering underspecified computational procedures, navigating specialized toolchains, and determining whether the resulting evidence supports a claim.

Full abstract

Large language models are increasingly deployed as autonomous coding agents and have achieved remarkably strong performance on software engineering benchmarks. However, it is unclear whether such success transfers to computational scientific workflows, where tasks require not only strong coding ability, but also the ability to navigate complex, domain-specific procedures and to interpret results in the context of scientific claims.

To address this question, we present AutoMat, a benchmark for evaluating LLM-based agents' ability to reproduce claims from computational materials science. AutoMat poses three interrelated challenges: recovering underspecified computational procedures, navigating specialized toolchains, and determining whether the resulting evidence supports a claim. By working closely with subject matter experts, we curate a set of claims from real materials science papers to test whether coding agents can recover and execute the end-to-end workflow needed to support or undermine such claims.

We then evaluate multiple representative coding agent settings across several foundation models. Our results show that current LLM-based agents obtain low overall success rates on AutoMat, with the best-performing setting achieving a success rate of only about 54%. Error analysis further reveals that agents perform worst when workflows must be reconstructed from paper text alone and that they fail primarily due to incomplete procedures, methodological deviations, and execution fragility.

§ 03 — Benchmark

A scientific claim becomes a runnable task

Each AutoMat instance starts with a claim from a materials science paper. SMEs select claims that are quantitatively checkable and important to the paper’s conclusions, then annotate the intended reproduction procedure, expected outcome, and relevant context. The agent sees the claim, paper, metadata, and any released artifacts. The SME procedure stays hidden until evaluation.

The three reproduction regimes

AutoMat deliberately varies how much scaffolding the agent receives. This matters because the task changes qualitatively as artifacts become available. With a runnable codebase, the problem is often adaptation and execution. With only the paper, the agent has to infer the workflow itself.

§ 04 — Method

From claims to scored reproductions

AutoMat runs in four phases. SMEs curate claims, the benchmark packages them into runnable tasks, an agent attempts the reproduction in a resource-controlled environment, and a separate artifact-grounded evaluator scores the run. Click any stage to inspect what happens inside it.

What makes AutoMat hard?

Computational materials science is a sharp stress test because a single claim can require a chain of hidden decisions: which structure to use, which pseudopotential version is valid, which simulation path must be relaxed, which subset defines the metric, and which output actually corresponds to the paper’s statement.

Recover the procedure

Papers rarely spell out every executable step. The agent must infer enough of the workflow to avoid a superficially plausible but methodologically different reproduction.

Navigate the toolchain

DFT, MD, and ML workflows expose fragile dependencies, HPC constraints, simulation timeouts, and domain-specific file formats.

Interpret the evidence

The final question is not "did code run?" but "does this output support the target claim under the intended scientific protocol?"

§ 05 — Results

Agents make progress, but not reliable reproductions

Across all systems, many runs land in the middle of the scoring scale: enough progress to show the agent understood part of the task, but not enough to establish a faithful reproduction. The strongest setting succeeds on only about half of the benchmark.

System	Mean ↑	Overall SR ↑	From-paper SR ↑	From-artifact SR ↑	Interp. SR ↑
Codex (GPT-5.4)	2.44	23.6%	0.0%	39.4%	40.0%
Claude Code (Kimi K2.5)	2.75	28.3%	0.0%	46.2%	33.3%
Orchestrated (Sonnet 4.6)	2.92	36.9%	10.0%	66.7%	37.5%
Claude Code (Sonnet 4.6)	3.12	38.8%	3.8%	67.7%	40.0%
Claude Code (Opus 4.6)	3.52	54.2%	8.3%	76.9%	50.0%

Orchestration improves rigor, not overall success

The paper also compares a task-specific orchestrated agent against Claude Code using the same Sonnet 4.6 backbone. The orchestrated design separates planning, setup, deterministic execution, failure diagnosis, and result extraction. It improves scientific rigor, but does not produce a clear overall success advantage. The tradeoff is instructive: more structure can make the run more auditable, while a fluid coding-agent loop may be better for opportunistic repair.

Dimension	Orch.	CC	p-value
Method fidelity	3.4	3.3	0.147
Execution competence	3.5	3.6	0.470
Result accuracy	2.6	2.9	0.848
Completeness	3.2	3.3	0.772
Scientific rigor	3.4	3.3	0.038 *

§ 06 — Failures

The failures are usually scientific-procedure failures

AutoMat failures are not dominated by agents printing the wrong final number after otherwise faithful execution. More often, the agent does not carry out the intended scientific procedure in the first place. It skips required steps, follows a deviated methodology, gets stuck in the toolchain, or tunes its reasoning around the target answer.

	Proceduralincompleteness	Methodologicaldeviation	Resource & exec.failure	Overconfidentassessment	Circularreasoning	Under-train /simplification
Codex (GPT-5.4)	5665.9%	1720.0%	1416.5%	67.1%	55.9%	44.7%
Claude Code (Kimi K2.5)	3541.2%	2731.8%	2225.9%	1821.2%	1922.4%	44.7%
Orchestrated (Sonnet 4.6)	5058.8%	2529.4%	3440.0%	910.6%	44.7%	89.4%
Claude Code (Sonnet 4.6)	4451.8%	2630.6%	2529.4%	1720.0%	89.4%	1011.8%
Claude Code (Opus 4.6)	4047.1%	1618.8%	1214.1%	89.4%	67.1%	22.4%

Procedural incompleteness

Critical steps such as relaxation, NEB optimization, data-cycle checks, or post-processing are skipped or abandoned.

Methodological deviation

The agent uses the wrong features, pseudopotentials, units, data split, subset, or approximation.

Resource and execution failure

The run fails because simulation tools, MPI, memory, convergence, or time limits break the workflow.

Protocol drift

The agent may recover a plausible number while changing the procedure that gives the number scientific meaning.

§ 07 — Case Studies

Some ways an agent can be wrong

These examples make the failure taxonomy concrete. Click a tile for a closer visual view; the expanded cards below explain what happened and why it matters.

§ 08 — Takeaway

Scientific agents need more than code fluency

AutoMat suggests that the next step for coding agents in science is not simply better syntax, faster tool use, or longer contexts. The core challenge is procedural fidelity: knowing which tacit choices matter, when a partial result is not enough, and how to distinguish a convincing reproduction from a plausible-looking artifact.

The benchmark is intended as both an evaluation suite and a diagnostic tool. It makes failures inspectable by preserving the full run trace, generated artifacts, and evaluator rationale, giving researchers a way to study not just whether an agent succeeded, but why it did or did not reproduce the science.

§ 09 — Citation

Cite this work

If AutoMat is useful in your research, please cite:

@article{huang2026automat,
  title     = {Can Coding Agents Reproduce Findings in
                Computational Materials Science?},
  author    = {Huang, Ziyang and Cao, Yi and
                Shargh, Ali K. and Luo, Jing and
                Mei, Ruidong and Zaki, Mohd and
                Liu, Zhan and Bunstine, Wyatt and
                Jurayj, William and Goswami, Somdatta and
                McQueen, Tyrel and Shields, Michael and
                El-Awady, Jaafar and Clancy, Paulette and
                Van Durme, Benjamin and Walden, William and
                Andrews, Nicholas and Khashabi, Daniel},
  journal = {arXiv preprint arXiv:2605.00803},
  year    = {2026},
}

Artifacts

The benchmark, evaluator rubric, and SME-curated claim annotations are available at github.com/JHU-CLSP/AutoMat, with the dataset hosted at hf.co/datasets/jhu-clsp/AutoMat.