Correctness is often executable
The agent can use tests, linters, compilation, and explicit issue descriptions as short feedback loops.
AutoMat asks a stricter question than "can an agent write code?": can it recover a scientific workflow from a paper, navigate specialized computational tools, and produce evidence that actually supports a published claim?
Coding agents have become good at software tasks with visible feedback loops: tests fail, a patch is written, tests pass. Scientific reproduction is a different regime. The requirement is often buried in prose, correctness depends on tacit domain choices, and a numerically plausible output can still be scientifically wrong.
AutoMat evaluates that regime directly. Each benchmark instance asks an agent to investigate a specific computational materials science claim from a real paper. The agent must decide what needs to be run, assemble or adapt the workflow, execute it in a controlled environment, and write a report explaining whether the evidence supports the claim.
The key result is blunt: current agents can make partial progress, but they are not yet reliable autonomous scientific reproducers. Claude Code with Opus 4.6 is the strongest setting we evaluate, with a mean score of 3.52 and a success rate around 54%. Codex with GPT-5.4 is lowest overall, with a mean score of 2.44 and a success rate around 24%.1
Reproducing the right number on the wrong subset does not verify the scientific conclusion. In AutoMat, faithful reproduction means recovering the procedure, executing it competently, and interpreting the result under the constraints the original work imposed.
The agent can use tests, linters, compilation, and explicit issue descriptions as short feedback loops.
The agent must recover an underspecified method, handle domain tools, and judge whether the produced evidence actually supports the claim.
AutoMat is a benchmark for evaluating LLM-based agents' ability to reproduce claims from computational materials science. It poses three interrelated challenges: recovering underspecified computational procedures, navigating specialized toolchains, and determining whether the resulting evidence supports a claim.
Large language models are increasingly deployed as autonomous coding agents and have achieved remarkably strong performance on software engineering benchmarks. However, it is unclear whether such success transfers to computational scientific workflows, where tasks require not only strong coding ability, but also the ability to navigate complex, domain-specific procedures and to interpret results in the context of scientific claims.
To address this question, we present AutoMat, a benchmark for evaluating LLM-based agents' ability to reproduce claims from computational materials science. AutoMat poses three interrelated challenges: recovering underspecified computational procedures, navigating specialized toolchains, and determining whether the resulting evidence supports a claim. By working closely with subject matter experts, we curate a set of claims from real materials science papers to test whether coding agents can recover and execute the end-to-end workflow needed to support or undermine such claims.
We then evaluate multiple representative coding agent settings across several foundation models. Our results show that current LLM-based agents obtain low overall success rates on AutoMat, with the best-performing setting achieving a success rate of only about 54%. Error analysis further reveals that agents perform worst when workflows must be reconstructed from paper text alone and that they fail primarily due to incomplete procedures, methodological deviations, and execution fragility.
Each AutoMat instance starts with a claim from a materials science paper. SMEs select claims that are quantitatively checkable and important to the paper’s conclusions, then annotate the intended reproduction procedure, expected outcome, and relevant context. The agent sees the claim, paper, metadata, and any released artifacts. The SME procedure stays hidden until evaluation.
The agent reads, plans, executes commands, inspects outputs, and writes a final reproduction report.
The benchmark preserves the full trajectory, not just the final answer.
AutoMat deliberately varies how much scaffolding the agent receives. This matters because the task changes qualitatively as artifacts become available. With a runnable codebase, the problem is often adaptation and execution. With only the paper, the agent has to infer the workflow itself.
AutoMat runs in four phases. SMEs curate claims, the benchmark packages them into runnable tasks, an agent attempts the reproduction in a resource-controlled environment, and a separate artifact-grounded evaluator scores the run. Click any stage to inspect what happens inside it.
Computational materials science is a sharp stress test because a single claim can require a chain of hidden decisions: which structure to use, which pseudopotential version is valid, which simulation path must be relaxed, which subset defines the metric, and which output actually corresponds to the paper’s statement.
Papers rarely spell out every executable step. The agent must infer enough of the workflow to avoid a superficially plausible but methodologically different reproduction.
DFT, MD, and ML workflows expose fragile dependencies, HPC constraints, simulation timeouts, and domain-specific file formats.
The final question is not "did code run?" but "does this output support the target claim under the intended scientific protocol?"
Across all systems, many runs land in the middle of the scoring scale: enough progress to show the agent understood part of the task, but not enough to establish a faithful reproduction. The strongest setting succeeds on only about half of the benchmark.
| System | Mean ↑ | Score distribution · 15 | Overall SR ↑ | From-paper SR ↑ | From-artifact SR ↑ | Interp. SR ↑ |
|---|---|---|---|---|---|---|
| Codex (GPT-5.4) | 2.44 | 23.6% | 0.0% | 39.4% | 40.0% | |
| Claude Code (Kimi K2.5) | 2.75 | 28.3% | 0.0% | 46.2% | 33.3% | |
| Orchestrated (Sonnet 4.6) | 2.92 | 36.9% | 10.0% | 66.7% | 37.5% | |
| Claude Code (Sonnet 4.6) | 3.12 | 38.8% | 3.8% | 67.7% | 40.0% | |
| Claude Code (Opus 4.6) | 3.52 | 54.2% | 8.3% | 76.9% | 50.0% |
The paper also compares a task-specific orchestrated agent against Claude Code using the same Sonnet 4.6 backbone. The orchestrated design separates planning, setup, deterministic execution, failure diagnosis, and result extraction. It improves scientific rigor, but does not produce a clear overall success advantage. The tradeoff is instructive: more structure can make the run more auditable, while a fluid coding-agent loop may be better for opportunistic repair.
| Dimension | Orch. | 135 | CC | p-value |
|---|---|---|---|---|
| Method fidelity | 3.4 | 3.3 | 0.147 | |
| Execution competence | 3.5 | 3.6 | 0.470 | |
| Result accuracy | 2.6 | 2.9 | 0.848 | |
| Completeness | 3.2 | 3.3 | 0.772 | |
| Scientific rigor | 3.4 | 3.3 | 0.038 * |
AutoMat failures are not dominated by agents printing the wrong final number after otherwise faithful execution. More often, the agent does not carry out the intended scientific procedure in the first place. It skips required steps, follows a deviated methodology, gets stuck in the toolchain, or tunes its reasoning around the target answer.
| Proceduralincompleteness | Methodologicaldeviation | Resource & exec.failure | Overconfidentassessment | Circularreasoning | Under-train /simplification | |
|---|---|---|---|---|---|---|
| Codex (GPT-5.4) |
5665.9% | 1720.0% | 1416.5% | 67.1% | 55.9% | 44.7% |
| Claude Code (Kimi K2.5) |
3541.2% | 2731.8% | 2225.9% | 1821.2% | 1922.4% | 44.7% |
| Orchestrated (Sonnet 4.6) |
5058.8% | 2529.4% | 3440.0% | 910.6% | 44.7% | 89.4% |
| Claude Code (Sonnet 4.6) |
4451.8% | 2630.6% | 2529.4% | 1720.0% | 89.4% | 1011.8% |
| Claude Code (Opus 4.6) |
4047.1% | 1618.8% | 1214.1% | 89.4% | 67.1% | 22.4% |
Critical steps such as relaxation, NEB optimization, data-cycle checks, or post-processing are skipped or abandoned.
The agent uses the wrong features, pseudopotentials, units, data split, subset, or approximation.
The run fails because simulation tools, MPI, memory, convergence, or time limits break the workflow.
The agent may recover a plausible number while changing the procedure that gives the number scientific meaning.
These examples make the failure taxonomy concrete. Click a tile for a closer visual view; the expanded cards below explain what happened and why it matters.
The agent correctly recognized that the provided DFT files described vacancy migration, not the required Br interstitial migration path. It verified tool availability and diagnosed the missing NEB inputs, but stopped before constructing a feasible partial reproduction.
The failure was not ignorance of the scientific goal; it was the inability to turn a correct diagnosis into an executable fallback plan.
The claim required neutral defect formation energies in CsPbBr3. The correct pseudopotentials mixed Br v1.4 with Cs/Pb v1, but the agent assumed all elements should use v1.4. It then searched for nonexistent files, received 404s, and declared the task blocked.
A small dependency-identification error propagated into a false infeasibility judgment.
The agent reproduced an MLP accuracy of 0.8894 for a claim reporting approximately 0.89, but selected a temperature-like feature and normalized range heuristically. The reference workflow used a specific denormalization rule and physical temperature window.
A scalar match can be misleading when the subset definition gives the metric its scientific meaning.
AutoMat suggests that the next step for coding agents in science is not simply better syntax, faster tool use, or longer contexts. The core challenge is procedural fidelity: knowing which tacit choices matter, when a partial result is not enough, and how to distinguish a convincing reproduction from a plausible-looking artifact.
The benchmark is intended as both an evaluation suite and a diagnostic tool. It makes failures inspectable by preserving the full run trace, generated artifacts, and evaluator rationale, giving researchers a way to study not just whether an agent succeeded, but why it did or did not reproduce the science.
If AutoMat is useful in your research, please cite:
@article{huang2026automat, title = {Can Coding Agents Reproduce Findings in Computational Materials Science?}, author = {Huang, Ziyang and Cao, Yi and Shargh, Ali K. and Luo, Jing and Mei, Ruidong and Zaki, Mohd and Liu, Zhan and Bunstine, Wyatt and Jurayj, William and Goswami, Somdatta and McQueen, Tyrel and Shields, Michael and El-Awady, Jaafar and Clancy, Paulette and Van Durme, Benjamin and Walden, William and Andrews, Nicholas and Khashabi, Daniel}, journal = {arXiv preprint arXiv:2605.00803}, year = {2026}, }
The benchmark, evaluator rubric, and SME-curated claim annotations are available at github.com/JHU-CLSP/AutoMat, with the dataset hosted at hf.co/datasets/jhu-clsp/AutoMat.