92.6%
MetaQA 3-hop Hits@1
+72.5
pp · mask vs vanilla
4.0×
Less zero-shot decay
91%
Of gain from mask alone
Overview
Transformers are remarkable function approximators in language, vision, and
code, but they fail systematically on tasks that require following chains of
relationships through structured data. A sizeable literature has responded
by injecting graph structure into attention: spatial encodings, spectral
positional features, hybrid local–global attention, structure-aware
preprocessing. Each method couples several design choices, making it
difficult to tell which ingredient is carrying the improvement.
The question we investigate is not
"how do we design a better graph transformer?" but rather:
among the structural signals a transformer can be given about a
knowledge graph, which ones actually matter, and by how much?
To answer it we build a minimal vehicle, RASA (Relation-Aware Sparse
Attention), in which four candidate structural signals are exposed as
independently removable components: a binary adjacency mask, a
learnable edge-type bias, a relation-specific query scale, and a
relation-specific value gate. RASA itself is not the contribution; it is the
experimental apparatus.
TL;DR. Across MetaQA, WebQSP, and CWQ, the dominant factor
is almost always the binary adjacency mask. Masking alone recovers
91% of the full model's improvement over an unmasked
transformer (+72.5pp / +45.5pp / +53.9pp). On CWQ, learned biases
without masking perform below the vanilla transformer
(2.1% vs 2.7%). A zero-shot experiment with held-out relations corroborates
this architecturally: masking-based attention degrades
4.0× less than R-GCN's relation-specific weights. The
useful inductive bias for multi-hop KGQA is predominantly
topological, not relational.
Why depth, and hence masking, matters
Each attention layer aggregates value vectors but cannot
compose paths within a single step: it can route B's
features to A, but only after layer 1 updates B with
C's features can layer 2 propagate the composed
A → B → C representation back to A. Each layer therefore
extends a node's effective receptive field by at most one hop, and
k-hop reachability for k growing with the input requires
Ω(k) layers. Among RASA's four signals, only the adjacency mask
directly enforces the one-hop-per-layer propagation this argument
demands; bias, scale, and gate all reshape scores after the attention
pattern is already set.
Four candidate structural signals
RASA exposes four signals as independently removable knobs. Only the first
restricts which positions can attend to which; the other three
re-weight an already-aggregated representation.
1 · Mask · Topological
Sparse adjacency masking
A binary mask A restricts attention to graph-adjacent positions plus
self-attention. Non-edges receive −∞ scores. This is the only
component that enforces one-hop-per-layer propagation, and the only
one our ablation finds to dominate.
2 · Bias · Relational
Edge-type embeddings
A learnable scalar br per relation type and head, added
to attention scores between connected nodes. Lets the model express
relation-specific preferences for born_in,
acted_in, directed_by.
3 · Scale · Relational
Relation-specific query scaling
A learnable scalar sr that modulates attention score
intensity per relation type. Sharpens or softens the softmax over
edges of each type independently.
4 · Gate · Relational
Relation-specific value gating
A learned gate gr ∈ [0, 1] per relation type that
controls how much information flows through each edge type after the
attention weights are computed.
The change, in code
The full architectural delta is four extra lines over standard scaled
dot-product attention. Each can be switched on or off independently — the
controlled apparatus the ablation needs.
# Standard attention
S = Q @ K.T / sqrt(d) output = softmax(S) @ V
# RASA attention (four independently removable signals)
S = Q @ K.T / sqrt(d) S.masked_fill_(~adj, -inf)
# 1) mask (topological) S +=
bias[edge_types]
# 2) bias (relational) S *=
scale[edge_types]
# 3) scale (relational) W = softmax(S) W *=
sigmoid(gate[edge_types])
# 4) gate (relational) output = W @ V
Main results across three benchmarks
We evaluate on MetaQA, WebQSP, and CWQ under a matched protocol (DistilBERT
encoder, 3-layer attention, identical answer scorer across baselines). The
four-component RASA is competitive with graph-native baselines under matched
compute. The point of this table is not a leaderboard win — LLM-augmented
systems like SubgraphRAG reach 90.1% on WebQSP and operate in a different
regime; what this table establishes is that the ablation findings below are
not an artifact of a weak baseline.
| Model |
MetaQA 3-hop |
WebQSP |
CWQ |
| Vanilla Transformer |
12.9 |
18.7 |
2.7 |
| R-GCN |
91.9 |
65.7 |
58.2 |
| Graphormer |
93.3 |
74.0 |
64.7 |
| RASA (ours) |
92.6 |
72.5 |
59.9 |
The staircase: mask alone explains it
Holding the same encoder fixed, we ablate the four components in two
informative configurations: Mask only (drops bias, scale,
gate) and Bias only (drops the mask — full dense attention
with learned per-relation biases). The pattern replicates across all three
datasets: masking is the staircase step that carries most of the gain.
| Variant |
MetaQA 3-hop |
WebQSP |
CWQ |
| Vanilla Transformer |
12.9 |
18.7 |
2.7 |
| Bias only (no mask) |
12.9 |
49.1 |
2.1 |
| Mask only (no relation params) |
85.4 |
64.2 |
56.6 |
| Full RASA (mask + bias + scale + gate) |
92.6 |
72.5 |
59.9 |
Masking alone contributes +72.5pp / +45.5pp / +53.9pp over
the vanilla transformer; the three learned relation parameters together add
only +7.2pp / +8.3pp / +3.3pp on top. The CWQ datapoint is the sharpest:
Bias only (2.1%) actually drops below Vanilla (2.7%) —
without structural guidance, learned biases on dense attention act as noise
the model cannot productively use.
Zero-shot: topological masks transfer, relational weights do not
If the useful bias is topological rather than relational, an architecture
whose primary bias is topological should generalise better to
unseen relation types than one whose primary bias is relational. We
train with selected relation types removed from the knowledge graph and
evaluate on the full graph at test time. The asymmetry is striking and
architecturally independent of the ablation evidence.
| Model |
Full KG |
−starred (Δ) |
−dir, −starred (Δ) |
| R-GCN |
78.3 |
57.6 (−20.7) |
49.1 (−29.2) |
| RASA (ours) |
59.2 |
53.7 (−5.5) |
52.0 (−7.2) |
R-GCN's relation-specific weight matrices cannot generalise to unseen
r and collapse to essentially random projections. Sparse adjacency
masking, by contrast, operates on graph structure rather than
relation identity: new edge types still provide connectivity information
that guides attention even when their learned bias terms are zero. Same
conclusion as the ablation, reached by varying the data distribution instead
of the architecture.
Attention concentration
Sparse masking turns near-uniform attention into a strongly structured
pattern: normalised attention entropy drops from
0.89 (standard transformer) to ~0.30
across all RASA configurations — a 3× reduction that mirrors the theoretical
search-space prediction (O(2n²) → O(2m)).
| Model |
L0 |
L1 |
L2 |
H ⁄ log(n) |
| Vanilla Transformer |
2.41 |
2.38 |
2.35 |
0.89 |
| RASA · 1-hop |
0.64 |
0.48 |
0.54 |
0.31 |
| RASA · 2-hop |
0.70 |
0.70 |
0.73 |
0.32 |
| RASA · 3-hop |
0.79 |
0.81 |
0.82 |
0.29 |
Limitations and takeaways
The study uses purely graph-structural inputs and a frozen text encoder;
LLM-augmented systems remain a different and stronger regime. Strict masking
can hurt on incomplete graphs where adaptive sparsity would be preferable,
and our dense adjacency-matrix construction is ~6× slower than vanilla
attention (custom sparse kernels would close this). The architectural
conclusion, however, is clear and replicates across three benchmarks and two
independent experimental designs: for multi-hop reasoning over knowledge
graphs, give the model the
topology; let it learn the rest.
Citation
@article{petersen2026tabularasa,
title = {What Structural Inductive Bias Helps Transformers
Reason Over Knowledge Graphs? A Study with Tabula RASA},
author = {Petersen, Jonas and Mazzoleni, Camilla and
Maggioni, Riccardo},
journal = {arXiv preprint arXiv:2602.02834},
year = {2026}
}