← back to main page
2026 · Method

HEPA

FMSD @ ICML Spotlight

A self-supervised, horizon-conditioned event predictive architecture for time series. A causal Transformer is pretrained via JEPA to forecast future representations rather than future values; the encoder is then frozen and only the predictor is finetuned, producing a monotonic survival CDF across prediction horizons.

Jonas Petersen, Gian-Alessandro Lombardi, Riccardo Maggioni, Camilla Mazzoleni, Federico Martelli, Philipp Petersen.  ·  ETH Zürich · Forgis · University of Vienna
14
Benchmarks
11
Domains
11×
Smaller finetune
92%
Perf @ 2% labels
HEPA overview: h-AUROC across 14 benchmarks (HEPA wins 9 of 14, retains >=92% at 10% labels) and predicted probability surfaces p(t, delta-t) for turbofan remaining-useful-life and cardiac arrhythmia.
Figure 1. One label-efficient architecture, domain- and event-agnostic. (a) h-AUROC across 14 benchmarks in 11 domains; HEPA wins 9 of 14 at full labels and retains ≥92% of full-label performance at 10% labels (open circles) on lifecycle datasets. (b) Predicted probability surfaces p(t, Δt) for turbofan degradation (top) and cardiac arrhythmia (bottom).

Overview

Critical events in multivariate time series, from turbine failures to cardiac arrhythmias, demand accurate prediction. Labels are scarce because the events themselves are rare and costly to annotate. Current methods couple the event definition to the model architecture: remaining-useful-life models never see anomaly benchmarks; anomaly detectors never forecast time-to-failure. Yet all these tasks share the same structure: given observations up to time t, estimate the probability of an event within each future horizon Δt.

HEPA exploits this structural uniformity with a separation of concerns. A causal Transformer encoder is pretrained via a Joint-Embedding Predictive Architecture (JEPA), forecasting future representations from unlabeled data alone. The encoder is then frozen and only the predictor is finetuned alongside a tiny event head that emits a discrete-time survival CDF, monotonically non-decreasing in the prediction horizon by construction. The encoder learns what changes; the predictor learns what matters.

TL;DR. One 2.16M-parameter architecture, frozen hyperparameters, generalises across 14 benchmarks in 11 domains including turbine degradation, spacecraft anomalies, cardiac arrhythmia, and water contamination. HEPA wins 9 of 14 against PatchTST, iTransformer, MAE, and Chronos-2 while finetuning 11× fewer parameters, and retains 92% of full-label performance with only 2% of labels on extended-precursor lifecycle datasets.

Four design choices

HEPA is defined by four choices that together yield a label-efficient, domain-agnostic event predictor.

JEPA · Predict latents

Future representations, not values

The encoder is trained to predict future representations rather than reconstruct future observations. The latent space retains what is predictable and discards noise and event-irrelevant variation.

Pred-FT · Finetune the predictor

Freeze encoder, tune 198K params

Standard JEPA discards the predictor and probes the encoder. HEPA instead keeps the predictor and finetunes it toward the downstream event — 11× fewer parameters than end-to-end training, more expressive than a linear probe.

Survival CDF · Monotonic by design

Discrete hazards compose to CDF

Per-horizon hazards λΔt compose into a survival product p(t, Δt) = 1 − ∏(1 − λj). The event probability is monotonically non-decreasing in the horizon by construction — no calibration trick required.

h-AUROC · Unified evaluation

One surface, every metric

The probability surface p(t, Δt) is the complete prediction. RMSE, F1, and PA-F1 are lossy projections of it. h-AUROC, the mean per-horizon AUROC, is threshold-free and robust to the extreme prevalence shifts within a single surface.

Headline results

HEPA wins 9 of 14 benchmarks at 100% labels under a matched protocol: all encoders frozen, all methods sharing the same horizon-conditioned MLP head, positive-weighted BCE, and evaluation splits. Baselines include PatchTST and iTransformer (supervised end-to-end), MAE (same architecture as HEPA but reconstruction-pretrained), and Chronos-2 (119M-parameter foundation model pretrained on a large external corpus).

Benchmark Domain HEPA PatchTST MAE Chr-2
TEP Chemical process 1.00 0.99 0.96
Weather Heat spike 0.89 0.64 0.88 0.72
GECCO Water contamination 0.88 0.65 0.80 0.74
C-MAPSS-3 Turbofan multi-fault 0.84 0.79 0.79 0.73
C-MAPSS-1 Turbine failure 0.81 0.80 0.69 0.66
ETTm1 Transformer overheating 0.81 0.53 0.87 0.74
Beijing-AQ PM2.5 spike 0.81 0.81 0.81 0.53
MBA Cardiac arrhythmia 0.75 0.68 0.71 0.53
SMD Server anomaly 0.64 0.64 0.64 0.65
C-MAPSS-2 Multi-condition 0.57 0.44 0.57 0.45
VIX Volatility regime 0.57 0.48 0.56 0.40
BATADAL SCADA cyberattack 0.57 0.66 0.60 0.56

JEPA pretraining captures temporal structure that supervised value forecasting (PatchTST) and reconstruction-based SSL (MAE) miss, particularly on datasets with extended precursor dynamics (C-MAPSS, GECCO). MAE is a strong runner-up on gradual-drift failure modes (ETTm1, SMAP), and large-corpus pretraining (Chronos-2) helps when event signatures resemble patterns it has seen at scale. We report all losses honestly: iTransformer wins on MBA where per-variate attention isolates arrhythmia-relevant leads, and PatchTST wins on BATADAL where events are localised to specific sensor subsets that channel fusion dilutes.

Label efficiency

On lifecycle datasets, where JEPA pretraining achieves low loss, HEPA degrades gracefully under severe label scarcity. C-MAPSS-1 retains 92% of full-label performance with only 2 labelled engines out of 85.

Dataset 100% 10% 5% 2% 1%
C-MAPSS-1 (85 engines) 0.786 0.772 0.730 0.724 0.670
C-MAPSS-1 (retention) 100% 98% 93% 92% 85%
C-MAPSS-3 (100 engines) 0.853 0.830 0.709 0.635 0.513
C-MAPSS-3 (retention) 100% 97% 83% 74% 60%

Why it works

Proposition 1 formalises the recipe: under mild assumptions, the event information retained by the frozen encoder is lower-bounded by the event information in the target representation, minus a penalty that scales linearly with the pretraining loss ε. The bound makes a falsifiable prediction: lower pretraining loss should imply stronger downstream performance. Across 17 datasets spanning industrial, environmental, medical, weather, energy, and finance domains, the empirical trend matches (slope = −2.3).

The visual signature is striking. After self-supervised pretraining on C-MAPSS-1, the encoder organises its latent space into a smooth degradation manifold without ever seeing a label — the first principal component captures 61% of variance and correlates with time-to-failure. The 10 longest-lived engines all converge toward a shared failure region. The downstream predictor only has to map these states to event probabilities, which is why so few labels suffice.

Getting started

Coming soon. Codebase, pretrained checkpoints, per-seed results, and the paper PDF will be released here once ready.

Citation

@misc{petersen2026hepa,
  title  = {HEPA: A Self-Supervised Horizon-Conditioned
            Event Predictive Architecture for Time Series},
  author = {Petersen, Jonas and Lombardi, Gian-Alessandro and
            Maggioni, Riccardo and Mazzoleni, Camilla and
            Martelli, Federico and Petersen, Philipp},
  year   = {2026}
}