A benchmark for evaluating LLMs and time-series models on machine
understanding over industrial robotic telemetry. Q&A items are organized
along Pearl's ladder of causation, extended with a decision-making layer
that reflects how operators actually run a factory floor.
Yanis Merzouki, Coral Izquierdo,
Matei Ignuta-Ciuncanu, Marcos Gómez-Bracamonte,
Riccardo Maggioni, Alessandro Lombardi,
Camilla Mazzoleni, Federico Martelli,
Balázs Günther, Jonas Petersen, Philipp Petersen.
· Forgis Labs
Modern robotic manufacturing systems emit dense multivariate telemetry:
joint states, torques, forces, contact events, fault indicators. Traditional
time-series models excel at narrow tasks but generalize poorly outside their
training distribution. Large language models bring strong general reasoning,
but underperform on dense numerical signals. Whether language-based systems
can truly reason about machine behavior, rather than pattern-match on
textual cues, has not been systematically measured.
FactoryBench targets this gap. It pairs roughly 15k normalized episodes from
FactoryWave (a new dataset we collect on a UR3 cobot and a
KUKA KR10 industrial arm), AURSAD, and
voraus-AD with 21 typed-variable templates that generate
~70k Q&A items spanning four causal levels.
TL;DR. Zero-shot evaluation of six frontier LLMs shows no
model exceeds 50% on structured reasoning levels or 18% on operational
decision-making, a wide gap between current models and factory-floor
readiness.
Four causal levels
We extend Pearl's ladder of causation with a fourth decision-making layer
that mirrors the diagnose-then-act loop standard in fault-tolerant control.
L1 · State
What is the machine doing?
Interpret the current operational state from raw multivariate
signals.
L2 · Intervention
What happens if?
Predict the immediate effect of an event or control intervention.
L3 · Counterfactual
What would have happened?
Reason about alternative histories conditioned on observed facts.
L4 · Decision
What should the operator do?
Recommend remedial actions grounded in physical and engineering
constraints.
Headline results
Chance-corrected accuracy (%) on the four reasoning levels, zero-shot. Best
per column highlighted. The rank reverses at L4: Claude leads structured
reasoning but collapses on decision-making, while GPT-5.1 dominates L4
despite weaker signal comprehension.
Model
L1 State
L2 Intervention
L3 Counterfactual
L4 Decision
Claude Sonnet 4.6
46.8
47.1
45.9
4.3
Qwen3-235B
36.0
35.6
43.6
4.4
Mistral Large 3
34.6
31.7
36.3
5.4
GPT-5.1
30.9
30.0
31.7
17.7
DeepSeek V3.2
25.0
29.1
28.5
7.6
Qwen3-4B
21.8
27.5
28.8
2.9
Simple baseline
28.4
24.1
20.4
-
FactoryWave
FactoryWave is the dataset we collect alongside the benchmark to bridge the
cobot-to-industrial-machine gap that public datasets miss. It contributes
8,983 episodes recorded on a UR3 cobot at 125 Hz and a
KUKA KR10 industrial arm at 83 Hz, covering three tasks
(pick-and-place, screwing, peg-in-hole) and a catalogue of
27 systematically injected fault types
spanning gripper failures, misconfigurations, collisions, and
peg-in-hole-specific anomalies.
Coming soon. Public dataset release and loader are in
preparation. Check back here for the install command, the Hugging Face
dataset, and a runnable evaluation script.
Citation
@article{factorybench2026,
title = {FactoryBench: Evaluating Industrial Machine Understanding},
author = {Merzouki, Yanis and Izquierdo, Coral and Ignuta-Ciuncanu, Matei and G{\'o}mez-Bracamonte, Marcos and Maggioni, Riccardo and Lombardi, Alessandro and Mazzoleni, Camilla and Martelli, Federico and G{\"u}nther, Bal{\'a}zs and Petersen, Jonas and Petersen, Philipp},
journal = {arXiv preprint arXiv:2605.07675},
year = {2026}
}