← back to main page
2026 · Benchmark · v1.0

FactoryBench

A benchmark for evaluating LLMs and time-series models on machine understanding over industrial robotic telemetry. Q&A items are organized along Pearl's ladder of causation, extended with a decision-making layer that reflects how operators actually run a factory floor.

Yanis Merzouki, Coral Izquierdo, Matei Ignuta-Ciuncanu, Marcos Gómez-Bracamonte, Riccardo Maggioni, Alessandro Lombardi, Camilla Mazzoleni, Federico Martelli, Balázs Günther, Jonas Petersen, Philipp Petersen.  ·  Forgis Labs
70k+
Q&A items
15k+
Real episodes
4
Causal levels
6
LLMs evaluated
End-to-end FactoryBench pipeline
End-to-end FactoryBench pipeline diagram

Abstract

Modern robotic manufacturing systems emit dense multivariate telemetry: joint states, torques, forces, contact events, fault indicators. Traditional time-series models excel at narrow tasks but generalize poorly outside their training distribution. Large language models bring strong general reasoning, but underperform on dense numerical signals. Whether language-based systems can truly reason about machine behavior, rather than pattern-match on textual cues, has not been systematically measured.

FactoryBench targets this gap. It pairs roughly 15k normalized episodes from FactoryWave (a new dataset we collect on a UR3 cobot and a KUKA KR10 industrial arm), AURSAD, and voraus-AD with 21 typed-variable templates that generate ~70k Q&A items spanning four causal levels.

TL;DR. Zero-shot evaluation of six frontier LLMs shows no model exceeds 50% on structured reasoning levels or 18% on operational decision-making, a wide gap between current models and factory-floor readiness.

Four causal levels

We extend Pearl's ladder of causation with a fourth decision-making layer that mirrors the diagnose-then-act loop standard in fault-tolerant control.

L1 · State

What is the machine doing?

Interpret the current operational state from raw multivariate signals.

L2 · Intervention

What happens if?

Predict the immediate effect of an event or control intervention.

L3 · Counterfactual

What would have happened?

Reason about alternative histories conditioned on observed facts.

L4 · Decision

What should the operator do?

Recommend remedial actions grounded in physical and engineering constraints.

Headline results

Chance-corrected accuracy (%) on the four reasoning levels, zero-shot. Best per column highlighted. The rank reverses at L4: Claude leads structured reasoning but collapses on decision-making, while GPT-5.1 dominates L4 despite weaker signal comprehension.

Model L1 State L2 Intervention L3 Counterfactual L4 Decision
Claude Sonnet 4.6 46.8 47.1 45.9 4.3
Qwen3-235B 36.0 35.6 43.6 4.4
Mistral Large 3 34.6 31.7 36.3 5.4
GPT-5.1 30.9 30.0 31.7 17.7
DeepSeek V3.2 25.0 29.1 28.5 7.6
Qwen3-4B 21.8 27.5 28.8 2.9
Simple baseline 28.4 24.1 20.4 -

FactoryWave

FactoryWave is the dataset we collect alongside the benchmark to bridge the cobot-to-industrial-machine gap that public datasets miss. It contributes 8,983 episodes recorded on a UR3 cobot at 125 Hz and a KUKA KR10 industrial arm at 83 Hz, covering three tasks (pick-and-place, screwing, peg-in-hole) and a catalogue of 27 systematically injected fault types spanning gripper failures, misconfigurations, collisions, and peg-in-hole-specific anomalies.

Getting started

Coming soon. Public dataset release and loader are in preparation. Check back here for the install command, the Hugging Face dataset, and a runnable evaluation script.

Citation

@article{factorybench2026,
  title   = {FactoryBench: Evaluating Industrial Machine Understanding},
  author  = {Merzouki, Yanis and Izquierdo, Coral and Ignuta-Ciuncanu, Matei and G{\'o}mez-Bracamonte, Marcos and Maggioni, Riccardo and Lombardi, Alessandro and Mazzoleni, Camilla and Martelli, Federico and G{\"u}nther, Bal{\'a}zs and Petersen, Jonas and Petersen, Philipp},
  journal = {arXiv preprint arXiv:2605.07675},
  year    = {2026}
}