← back to main page
2026 · Study · arXiv preprint

RASA

GFM @ ICML
Tabula RASA · What structural inductive bias helps transformers reason over knowledge graphs?

A controlled ablation of four candidate structural signals for graph-aware attention. Across MetaQA, WebQSP, and CWQ, sparse adjacency masking alone delivers the dominant share of the gain over a standard transformer (+72.5pp / +45.5pp / +53.9pp), while learned relation parameters add modest refinement, and sometimes hurt without it. The useful inductive bias for multi-hop knowledge-graph reasoning is predominantly topological, not relational.

Jonas Petersen, Camilla Mazzoleni, Riccardo Maggioni.  ·  Forgis · University of Cambridge · ETH Zürich
92.6%
MetaQA 3-hop Hits@1
+72.5
pp · mask vs vanilla
4.0×
Less zero-shot decay
91%
Of gain from mask alone

Overview

Transformers are remarkable function approximators in language, vision, and code, but they fail systematically on tasks that require following chains of relationships through structured data. A sizeable literature has responded by injecting graph structure into attention: spatial encodings, spectral positional features, hybrid local–global attention, structure-aware preprocessing. Each method couples several design choices, making it difficult to tell which ingredient is carrying the improvement.

The question we investigate is not "how do we design a better graph transformer?" but rather: among the structural signals a transformer can be given about a knowledge graph, which ones actually matter, and by how much?

To answer it we build a minimal vehicle, RASA (Relation-Aware Sparse Attention), in which four candidate structural signals are exposed as independently removable components: a binary adjacency mask, a learnable edge-type bias, a relation-specific query scale, and a relation-specific value gate. RASA itself is not the contribution; it is the experimental apparatus.

TL;DR. Across MetaQA, WebQSP, and CWQ, the dominant factor is almost always the binary adjacency mask. Masking alone recovers 91% of the full model's improvement over an unmasked transformer (+72.5pp / +45.5pp / +53.9pp). On CWQ, learned biases without masking perform below the vanilla transformer (2.1% vs 2.7%). A zero-shot experiment with held-out relations corroborates this architecturally: masking-based attention degrades 4.0× less than R-GCN's relation-specific weights. The useful inductive bias for multi-hop KGQA is predominantly topological, not relational.

Why depth, and hence masking, matters

Each attention layer aggregates value vectors but cannot compose paths within a single step: it can route B's features to A, but only after layer 1 updates B with C's features can layer 2 propagate the composed A → B → C representation back to A. Each layer therefore extends a node's effective receptive field by at most one hop, and k-hop reachability for k growing with the input requires Ω(k) layers. Among RASA's four signals, only the adjacency mask directly enforces the one-hop-per-layer propagation this argument demands; bias, scale, and gate all reshape scores after the attention pattern is already set.

Four candidate structural signals

RASA exposes four signals as independently removable knobs. Only the first restricts which positions can attend to which; the other three re-weight an already-aggregated representation.

1 · Mask · Topological

Sparse adjacency masking

A binary mask A restricts attention to graph-adjacent positions plus self-attention. Non-edges receive −∞ scores. This is the only component that enforces one-hop-per-layer propagation, and the only one our ablation finds to dominate.

2 · Bias · Relational

Edge-type embeddings

A learnable scalar br per relation type and head, added to attention scores between connected nodes. Lets the model express relation-specific preferences for born_in, acted_in, directed_by.

3 · Scale · Relational

Relation-specific query scaling

A learnable scalar sr that modulates attention score intensity per relation type. Sharpens or softens the softmax over edges of each type independently.

4 · Gate · Relational

Relation-specific value gating

A learned gate gr ∈ [0, 1] per relation type that controls how much information flows through each edge type after the attention weights are computed.

The change, in code

The full architectural delta is four extra lines over standard scaled dot-product attention. Each can be switched on or off independently — the controlled apparatus the ablation needs.

# Standard attention S = Q @ K.T / sqrt(d) output = softmax(S) @ V # RASA attention (four independently removable signals) S = Q @ K.T / sqrt(d) S.masked_fill_(~adj, -inf) # 1) mask (topological) S += bias[edge_types] # 2) bias (relational) S *= scale[edge_types] # 3) scale (relational) W = softmax(S) W *= sigmoid(gate[edge_types]) # 4) gate (relational) output = W @ V

Main results across three benchmarks

We evaluate on MetaQA, WebQSP, and CWQ under a matched protocol (DistilBERT encoder, 3-layer attention, identical answer scorer across baselines). The four-component RASA is competitive with graph-native baselines under matched compute. The point of this table is not a leaderboard win — LLM-augmented systems like SubgraphRAG reach 90.1% on WebQSP and operate in a different regime; what this table establishes is that the ablation findings below are not an artifact of a weak baseline.

Model MetaQA 3-hop WebQSP CWQ
Vanilla Transformer 12.9 18.7 2.7
R-GCN 91.9 65.7 58.2
Graphormer 93.3 74.0 64.7
RASA (ours) 92.6 72.5 59.9

The staircase: mask alone explains it

Holding the same encoder fixed, we ablate the four components in two informative configurations: Mask only (drops bias, scale, gate) and Bias only (drops the mask — full dense attention with learned per-relation biases). The pattern replicates across all three datasets: masking is the staircase step that carries most of the gain.

Variant MetaQA 3-hop WebQSP CWQ
Vanilla Transformer 12.9 18.7 2.7
Bias only (no mask) 12.9 49.1 2.1
Mask only (no relation params) 85.4 64.2 56.6
Full RASA (mask + bias + scale + gate) 92.6 72.5 59.9

Masking alone contributes +72.5pp / +45.5pp / +53.9pp over the vanilla transformer; the three learned relation parameters together add only +7.2pp / +8.3pp / +3.3pp on top. The CWQ datapoint is the sharpest: Bias only (2.1%) actually drops below Vanilla (2.7%) — without structural guidance, learned biases on dense attention act as noise the model cannot productively use.

Zero-shot: topological masks transfer, relational weights do not

If the useful bias is topological rather than relational, an architecture whose primary bias is topological should generalise better to unseen relation types than one whose primary bias is relational. We train with selected relation types removed from the knowledge graph and evaluate on the full graph at test time. The asymmetry is striking and architecturally independent of the ablation evidence.

Model Full KG −starred (Δ) −dir, −starred (Δ)
R-GCN 78.3 57.6 (−20.7) 49.1 (−29.2)
RASA (ours) 59.2 53.7 (−5.5) 52.0 (−7.2)

R-GCN's relation-specific weight matrices cannot generalise to unseen r and collapse to essentially random projections. Sparse adjacency masking, by contrast, operates on graph structure rather than relation identity: new edge types still provide connectivity information that guides attention even when their learned bias terms are zero. Same conclusion as the ablation, reached by varying the data distribution instead of the architecture.

Attention concentration

Sparse masking turns near-uniform attention into a strongly structured pattern: normalised attention entropy drops from 0.89 (standard transformer) to ~0.30 across all RASA configurations — a 3× reduction that mirrors the theoretical search-space prediction (O(2) → O(2m)).

Model L0 L1 L2 H ⁄ log(n)
Vanilla Transformer 2.41 2.38 2.35 0.89
RASA · 1-hop 0.64 0.48 0.54 0.31
RASA · 2-hop 0.70 0.70 0.73 0.32
RASA · 3-hop 0.79 0.81 0.82 0.29

Limitations and takeaways

The study uses purely graph-structural inputs and a frozen text encoder; LLM-augmented systems remain a different and stronger regime. Strict masking can hurt on incomplete graphs where adaptive sparsity would be preferable, and our dense adjacency-matrix construction is ~6× slower than vanilla attention (custom sparse kernels would close this). The architectural conclusion, however, is clear and replicates across three benchmarks and two independent experimental designs: for multi-hop reasoning over knowledge graphs, give the model the topology; let it learn the rest.

Citation

@article{petersen2026tabularasa,
  title  = {What Structural Inductive Bias Helps Transformers
            Reason Over Knowledge Graphs? A Study with Tabula RASA},
  author = {Petersen, Jonas and Mazzoleni, Camilla and
            Maggioni, Riccardo},
  journal = {arXiv preprint arXiv:2602.02834},
  year   = {2026}
}