← back to main page

2026 · Benchmark · v1.0

FactoryBench

A benchmark for evaluating LLMs and time-series models on machine understanding over industrial robotic telemetry. Q&A items are organized along Pearl's ladder of causation, extended with a decision-making layer that reflects how operators actually run a factory floor.

Yanis Merzouki, Coral Izquierdo, Matei Ignuta-Ciuncanu, Marcos Gómez-Bracamonte, Riccardo Maggioni, Alessandro Lombardi, Camilla Mazzoleni, Federico Martelli, Balázs Günther, Jonas Petersen, Philipp Petersen. · Forgis Labs

Paper Code Hugging Face

70k+

Q&A items

15k+

Real episodes

Causal levels

LLMs evaluated

End-to-end FactoryBench pipeline diagram — End-to-end FactoryBench pipeline

Abstract

Modern robotic manufacturing systems emit dense multivariate telemetry: joint states, torques, forces, contact events, fault indicators. Traditional time-series models excel at narrow tasks but generalize poorly outside their training distribution. Large language models bring strong general reasoning, but underperform on dense numerical signals. Whether language-based systems can truly reason about machine behavior, rather than pattern-match on textual cues, has not been systematically measured.

FactoryBench targets this gap. It pairs roughly 15k normalized episodes from FactoryWave (a new dataset we collect on a UR3 cobot and a KUKA KR10 industrial arm), AURSAD, and voraus-AD with 21 typed-variable templates that generate ~70k Q&A items spanning four causal levels.

TL;DR. Zero-shot evaluation of six frontier LLMs shows no model exceeds 50% on structured reasoning levels or 18% on operational decision-making, a wide gap between current models and factory-floor readiness.

Four causal levels

We extend Pearl's ladder of causation with a fourth decision-making layer that mirrors the diagnose-then-act loop standard in fault-tolerant control.

L1 · State

What is the machine doing?

Interpret the current operational state from raw multivariate signals.

L2 · Intervention

What happens if?

Predict the immediate effect of an event or control intervention.

L3 · Counterfactual

What would have happened?

Reason about alternative histories conditioned on observed facts.

L4 · Decision

What should the operator do?

Recommend remedial actions grounded in physical and engineering constraints.

Headline results

Chance-corrected accuracy (%) on the four reasoning levels, zero-shot. Best per column highlighted. The rank reverses at L4: Claude leads structured reasoning but collapses on decision-making, while GPT-5.1 dominates L4 despite weaker signal comprehension.

Model	L1 State	L2 Intervention	L3 Counterfactual	L4 Decision
Claude Sonnet 4.6	46.8	47.1	45.9	4.3
Qwen3-235B	36.0	35.6	43.6	4.4
Mistral Large 3	34.6	31.7	36.3	5.4
GPT-5.1	30.9	30.0	31.7	17.7
DeepSeek V3.2	25.0	29.1	28.5	7.6
Qwen3-4B	21.8	27.5	28.8	2.9
Simple baseline	28.4	24.1	20.4	-

FactoryWave

FactoryWave is the dataset we collect alongside the benchmark to bridge the cobot-to-industrial-machine gap that public datasets miss. It contributes 8,983 episodes recorded on a UR3 cobot at 125 Hz and a KUKA KR10 industrial arm at 83 Hz, covering three tasks (pick-and-place, screwing, peg-in-hole) and a catalogue of 27 systematically injected fault types spanning gripper failures, misconfigurations, collisions, and peg-in-hole-specific anomalies.

UR3 cobot · KUKA KR10 · 8,983 episodes 3 tasks · 27 fault conditions

KUKA KR10 industrial arm experimental setup

UR3 hole obstruction fault for peg-in-hole

UR3 external arm disturbance during screwing

UR3 additional axis weight fault variation

UR3 external arm disturbance during peg-in-hole

UR3 collision with rigid object variation

UR3 external arm disturbance during screwing variation

UR3 unstable mounting platform variation

Getting started

Coming soon. Public dataset release and loader are in preparation. Check back here for the install command, the Hugging Face dataset, and a runnable evaluation script.

Citation

@article{factorybench2026,
  title   = {FactoryBench: Evaluating Industrial Machine Understanding},
  author  = {Merzouki, Yanis and Izquierdo, Coral and Ignuta-Ciuncanu, Matei and G{\'o}mez-Bracamonte, Marcos and Maggioni, Riccardo and Lombardi, Alessandro and Mazzoleni, Camilla and Martelli, Federico and G{\"u}nther, Bal{\'a}zs and Petersen, Jonas and Petersen, Philipp},
  journal = {arXiv preprint arXiv:2605.07675},
  year    = {2026}
}