Preprint

Can Coding Agents Reproduce Findings in Computational Materials Science?

AutoMat asks a stricter question than "can an agent write code?": can it recover a scientific workflow from a paper, navigate specialized computational tools, and produce evidence that actually supports a published claim?

Ziyang Huang Yi Cao Ali K. Shargh Jing Luo Ruidong Mei Mohd Zaki Zhan Liu Wyatt Bunstine William Jurayj Somdatta Goswami Tyrel McQueen Michael Shields Jaafar El-Awady Paulette Clancy Benjamin Van Durme Nicholas Andrews William Walden Daniel Khashabi
1Computer Science
2Chemical and Biomolecular Engineering
3Mechanical Engineering
4Civil and Systems Engineering
5Physics and Astronomy
6Hopkins Extreme Materials Institute (HEMI)
7Center for Language and Speech Processing (CLSP)
8Human Language Technology Center of Excellence (HLTCOE)
Johns Hopkins University
*Equal contribution Equal advising Corresponding author
100% 75% 50% 25% 0% 10.0% From paper recover workflow from prose 39.4% 76.9% From artifact code / data / models provided 33.3% 50.0% Interpretation outputs provided, analyze best overall SR · 54.2% (Opus 4.6) success rate (score >= 4)
Figure 1. Success rate ranges across the five evaluated agent settings. From-paper reproduction is the stress test: when the workflow has to be recovered from the paper itself, even strong coding agents rarely reach a convincing reproduction.
§ 01 — Intro

The benchmark starts where coding benchmarks stop

Coding agents have become good at software tasks with visible feedback loops: tests fail, a patch is written, tests pass. Scientific reproduction is a different regime. The requirement is often buried in prose, correctness depends on tacit domain choices, and a numerically plausible output can still be scientifically wrong.

AutoMat evaluates that regime directly. Each benchmark instance asks an agent to investigate a specific computational materials science claim from a real paper. The agent must decide what needs to be run, assemble or adapt the workflow, execute it in a controlled environment, and write a report explaining whether the evidence supports the claim.

AutoMat in one pass
  • 85 claims curated and annotated by subject matter experts from computational materials science papers.
  • Three reproduction regimes: from paper text, from released artifacts, and from final outputs that still require interpretation.
  • Five agent settings evaluated end-to-end, including general-purpose coding agents and a benchmark-specific orchestrated agent.
  • Artifact-grounded scoring by an evaluator agent calibrated against blinded SME judgments.

The key result is blunt: current agents can make partial progress, but they are not yet reliable autonomous scientific reproducers. Claude Code with Opus 4.6 is the strongest setting we evaluate, with a mean score of 3.52 and a success rate around 54%. Codex with GPT-5.4 is lowest overall, with a mean score of 2.44 and a success rate around 24%.1

Reproducing the right number on the wrong subset does not verify the scientific conclusion. In AutoMat, faithful reproduction means recovering the procedure, executing it competently, and interpreting the result under the constraints the original work imposed.
Software benchmark

Correctness is often executable

The agent can use tests, linters, compilation, and explicit issue descriptions as short feedback loops.

Scientific reproduction

Correctness is evidential

The agent must recover an underspecified method, handle domain tools, and judge whether the produced evidence actually supports the claim.

§ 02 — Abstract

What the paper contributes

AutoMat is a benchmark for evaluating LLM-based agents' ability to reproduce claims from computational materials science. It poses three interrelated challenges: recovering underspecified computational procedures, navigating specialized toolchains, and determining whether the resulting evidence supports a claim.

Full abstract

Large language models are increasingly deployed as autonomous coding agents and have achieved remarkably strong performance on software engineering benchmarks. However, it is unclear whether such success transfers to computational scientific workflows, where tasks require not only strong coding ability, but also the ability to navigate complex, domain-specific procedures and to interpret results in the context of scientific claims.

To address this question, we present AutoMat, a benchmark for evaluating LLM-based agents' ability to reproduce claims from computational materials science. AutoMat poses three interrelated challenges: recovering underspecified computational procedures, navigating specialized toolchains, and determining whether the resulting evidence supports a claim. By working closely with subject matter experts, we curate a set of claims from real materials science papers to test whether coding agents can recover and execute the end-to-end workflow needed to support or undermine such claims.

We then evaluate multiple representative coding agent settings across several foundation models. Our results show that current LLM-based agents obtain low overall success rates on AutoMat, with the best-performing setting achieving a success rate of only about 54%. Error analysis further reveals that agents perform worst when workflows must be reconstructed from paper text alone and that they fail primarily due to incomplete procedures, methodological deviations, and execution fragility.

§ 03 — Benchmark

A scientific claim becomes a runnable task

Each AutoMat instance starts with a claim from a materials science paper. SMEs select claims that are quantitatively checkable and important to the paper’s conclusions, then annotate the intended reproduction procedure, expected outcome, and relevant context. The agent sees the claim, paper, metadata, and any released artifacts. The SME procedure stays hidden until evaluation.

Agent-visible payload claim + paper + metadata + optional artifacts

The agent reads, plans, executes commands, inspects outputs, and writes a final reproduction report.

Autonomous run trace + logs + generated artifacts

The benchmark preserves the full trajectory, not just the final answer.

Task package.   The agent-visible task is separated from evaluation-only SME annotations. This prevents the agent from seeing the answer while still allowing evaluation to be grounded in SME expectations.

The three reproduction regimes

AutoMat deliberately varies how much scaffolding the agent receives. This matters because the task changes qualitatively as artifacts become available. With a runnable codebase, the problem is often adaptation and execution. With only the paper, the agent has to infer the workflow itself.

33 41 11 25 32 15 6 7 From-paper reproduction From-artifact reproduction From-artifact interpretation Scalar metric match Comparative / ordering Structural / qualitative Range / bound verification Figure reproduction REPRODUCTION REGIME VERIFICATION CATEGORY
Distribution of the 85 AutoMat claims by reproduction regime (left) and verification category (right). Each ribbon's thickness is proportional to the number of claims in that combination; absent ribbons mark combinations the benchmark does not contain (e.g., from-paper claims never require figure reproduction). Hover any flow for the exact count.
§ 04 — Method

From claims to scored reproductions

AutoMat runs in four phases. SMEs curate claims, the benchmark packages them into runnable tasks, an agent attempts the reproduction in a resource-controlled environment, and a separate artifact-grounded evaluator scores the run. Click any stage to inspect what happens inside it.

01 / Curate
SME claims
SMEs distill quantitatively checkable claims from real computational materials science papers.
02 / Package
Runnable task
Each claim is paired with paper text, metadata, optional artifacts, and hidden SME ground truth.
03 / Execute
Agent run
Five agent settings run end-to-end on an HPC-style environment and produce full traces.
04 / Assess
Grounded eval
An evaluator scores the run on five dimensions, calibrated against blinded SME judgments.

What makes AutoMat hard?

Computational materials science is a sharp stress test because a single claim can require a chain of hidden decisions: which structure to use, which pseudopotential version is valid, which simulation path must be relaxed, which subset defines the metric, and which output actually corresponds to the paper’s statement.

01

Recover the procedure

Papers rarely spell out every executable step. The agent must infer enough of the workflow to avoid a superficially plausible but methodologically different reproduction.

02

Navigate the toolchain

DFT, MD, and ML workflows expose fragile dependencies, HPC constraints, simulation timeouts, and domain-specific file formats.

03

Interpret the evidence

The final question is not "did code run?" but "does this output support the target claim under the intended scientific protocol?"

§ 05 — Results

Agents make progress, but not reliable reproductions

Across all systems, many runs land in the middle of the scoring scale: enough progress to show the agent understood part of the task, but not enough to establish a faithful reproduction. The strongest setting succeeds on only about half of the benchmark.

System Mean ↑ Score distribution · 15 Overall SR ↑ From-paper SR ↑ From-artifact SR ↑ Interp. SR ↑
Codex (GPT-5.4) 2.44
23.6%0.0%39.4%40.0%
Claude Code (Kimi K2.5) 2.75
28.3%0.0%46.2%33.3%
Orchestrated (Sonnet 4.6) 2.92
36.9%10.0%66.7%37.5%
Claude Code (Sonnet 4.6) 3.12
38.8%3.8%67.7%40.0%
Claude Code (Opus 4.6) 3.52
54.2%8.3%76.9%50.0%
Mean reproducibility score, full score-distribution shape (1 = failure → 5 = full reproduction), and success rate (SR; share of claims scored ≥ 4) by reproduction type. From-paper reproduction is the hardest setting; artifacts lift performance substantially, but do not make the benchmark solved.

Orchestration improves rigor, not overall success

The paper also compares a task-specific orchestrated agent against Claude Code using the same Sonnet 4.6 backbone. The orchestrated design separates planning, setup, deterministic execution, failure diagnosis, and result extraction. It improves scientific rigor, but does not produce a clear overall success advantage. The tradeoff is instructive: more structure can make the run more auditable, while a fluid coding-agent loop may be better for opportunistic repair.

Orchestrated Claude Code (same Sonnet 4.6 backbone)
23.3% Orch. wins
45.0% Tie
31.7% CC wins
Dimension Orch. 135 CC p-value
Method fidelity 3.4
3.3 0.147
Execution competence 3.5
3.6 0.470
Result accuracy 2.6
2.9 0.848
Completeness 3.2
3.3 0.772
Scientific rigor 3.4
3.3 0.038 *
Same-backbone head-to-head between the orchestrated agent and Claude Code, both running Claude Sonnet 4.6. Top: head-to-head outcome split — most claims tie (45.0%), and the orchestrated design loses slightly more often than it wins (23.3% vs 31.7%). Bottom: mean evaluator score per dimension; the only statistically significant difference (★) is on scientific rigor, where orchestration edges out Claude Code (3.4 vs 3.3, p = 0.038). Hover any dot or summary segment for the exact value.
§ 06 — Failures

The failures are usually scientific-procedure failures

AutoMat failures are not dominated by agents printing the wrong final number after otherwise faithful execution. More often, the agent does not carry out the intended scientific procedure in the first place. It skips required steps, follows a deviated methodology, gets stuck in the toolchain, or tunes its reasoning around the target answer.

Proceduralincompleteness Methodologicaldeviation Resource & exec.failure Overconfidentassessment Circularreasoning Under-train /simplification
Codex
(GPT-5.4)
5665.9% 1720.0% 1416.5% 67.1% 55.9% 44.7%
Claude Code
(Kimi K2.5)
3541.2% 2731.8% 2225.9% 1821.2% 1922.4% 44.7%
Orchestrated
(Sonnet 4.6)
5058.8% 2529.4% 3440.0% 910.6% 44.7% 89.4%
Claude Code
(Sonnet 4.6)
4451.8% 2630.6% 2529.4% 1720.0% 89.4% 1011.8%
Claude Code
(Opus 4.6)
4047.1% 1618.8% 1214.1% 89.4% 67.1% 22.4%
Per-system breakdown of failure-mode prevalence. Each cell shows the number of occurrences of that failure mode across the system's runs and the corresponding share (% of runs). A single run can exhibit multiple failure modes, so row totals exceed 100%. Procedural incompleteness dominates everywhere; resource & execution fragility is most pronounced for the orchestrated agent. Hover any cell for the exact numbers.
Procedural incompleteness

Critical steps such as relaxation, NEB optimization, data-cycle checks, or post-processing are skipped or abandoned.

Methodological deviation

The agent uses the wrong features, pseudopotentials, units, data split, subset, or approximation.

Resource and execution failure

The run fails because simulation tools, MPI, memory, convergence, or time limits break the workflow.

Protocol drift

The agent may recover a plausible number while changing the procedure that gives the number scientific meaning.

§ 07 — Case Studies

Some ways an agent can be wrong

These examples make the failure taxonomy concrete. Click a tile for a closer visual view; the expanded cards below explain what happened and why it matters.

AUTOMAT-0037

Diagnosis without remediation

The agent correctly recognized that the provided DFT files described vacancy migration, not the required Br interstitial migration path. It verified tool availability and diagnosed the missing NEB inputs, but stopped before constructing a feasible partial reproduction.

The failure was not ignorance of the scientific goal; it was the inability to turn a correct diagnosis into an executable fallback plan.

AUTOMAT-0028

A false dependency assumption

The claim required neutral defect formation energies in CsPbBr3. The correct pseudopotentials mixed Br v1.4 with Cs/Pb v1, but the agent assumed all elements should use v1.4. It then searched for nonexistent files, received 404s, and declared the task blocked.

A small dependency-identification error propagated into a false infeasibility judgment.

AUTOMAT-0007

The right number, the wrong protocol

The agent reproduced an MLP accuracy of 0.8894 for a claim reporting approximately 0.89, but selected a temperature-like feature and normalized range heuristically. The reference workflow used a specific denormalization rule and physical temperature window.

A scalar match can be misleading when the subset definition gives the metric its scientific meaning.

§ 08 — Takeaway

Scientific agents need more than code fluency

AutoMat suggests that the next step for coding agents in science is not simply better syntax, faster tool use, or longer contexts. The core challenge is procedural fidelity: knowing which tacit choices matter, when a partial result is not enough, and how to distinguish a convincing reproduction from a plausible-looking artifact.

The benchmark is intended as both an evaluation suite and a diagnostic tool. It makes failures inspectable by preserving the full run trace, generated artifacts, and evaluator rationale, giving researchers a way to study not just whether an agent succeeded, but why it did or did not reproduce the science.

§ 09 — Citation

Cite this work

If AutoMat is useful in your research, please cite:

@article{huang2026automat,
  title     = {Can Coding Agents Reproduce Findings in
                Computational Materials Science?},
  author    = {Huang, Ziyang and Cao, Yi and
                Shargh, Ali K. and Luo, Jing and
                Mei, Ruidong and Zaki, Mohd and
                Liu, Zhan and Bunstine, Wyatt and
                Jurayj, William and Goswami, Somdatta and
                McQueen, Tyrel and Shields, Michael and
                El-Awady, Jaafar and Clancy, Paulette and
                Van Durme, Benjamin and Walden, William and
                Andrews, Nicholas and Khashabi, Daniel},
  journal = {arXiv preprint arXiv:2605.00803},
  year    = {2026},
}

Artifacts

The benchmark, evaluator rubric, and SME-curated claim annotations are available at github.com/JHU-CLSP/AutoMat, with the dataset hosted at hf.co/datasets/jhu-clsp/AutoMat.