🌐 Curvature-Stratified Relational Learning Benchmark

The Post-GCN Decade Revisited

A geometry-aware benchmark showing that relational learning performance is not a universal ranking: model preferences are stable within curvature regimes, but shift sharply across regimes.

14 datasets 18 representative models GCNs · GFMs · Non-Euclidean GNNs Curvature-aware diagnostics

Shuo Wang1,4,*, Xiangyu Wang1,*, Quanxin Wang1, Bolin Wu1, Bokui Wang1, Shunyang Huang1, Boyan Deng1, Haonan Liu1, Ruiyi Fang2, Zhenxiang Xu1,3, Boyu Wang2, Zhao Kang1,†

1 University of Electronic Science and Technology of China  ·  2 Western University  ·  3 Zhejiang University  ·  4 Tsinghua University

* Equal contribution    † Corresponding author

From flat leaderboards to regime-conditioned diagnostics.

CURVBENCH replaces one-size-fits-all graph evaluation with curvature-stratified comparisons, making it visible when a model succeeds because its inductive bias matches the geometry of the data.

1. Measure geometryEstimate midpoint curvature residuals and graph-level curvature profiles.
2. Stratify dataGroup datasets into near-zero, positive, negative, and tail-driven regimes.
3. Compare fairlyReport regime-conditioned rankings, label elasticity, coverage, and feasibility.
Intrinsic geometry organizes model behavior.
flat avg. near-zero positive negative table GraphSAGEGraphMoREPCNetHAT 1 4 2 1 3 2 1 4 2 6 3 2 4 3 3 6 Rank shifts appear once evaluation is conditioned on regime.
near-zero positive negative tail-driven

A project page for CURVBENCH. The design is self-contained in a single index.html file and can be directly deployed with GitHub Pages.

Overview

Why flat graph leaderboards can mislead.

Current relational-learning evaluations often average over heterogeneous datasets. CURVBENCH shows that such aggregation can hide geometry-dependent trade-offs: a model may look globally strong only because the benchmark mixture favors its preferred curvature regime.

🧭

Curvature-aware partitioning

Datasets are grouped by mean curvature and curvature skewness, revealing geometry-dependent model behavior.

📊

Partial-order diagnostics

Top-model rankings are compared within and across regimes to quantify preference stability.

🧪

Broad model suite

The benchmark covers Euclidean GNNs, hyperbolic methods, mixed-curvature models, adaptive Riemannian models, and GFMs.

🔁

Reproducible evaluation

The release is designed around code, splits, curvature computation, model evaluation, and diagnostic tools.

Method

A finite-metric view of graphs.

CURVBENCH treats each graph as a finite metric space and uses a midpoint curvature residual to measure local deviation from Euclidean geometry. Mean curvature captures the average signed profile, while skewness captures asymmetric curvature tails.

Curvature residual and graph-level descriptors

For a center node m, a neighbor pair {b,c}, and an anchor node a, the residual probes whether local graph triangles are fatter, thinner, or close to Euclidean. This gives a discrete signal for stratifying relational datasets.

midpoint residual mean curvature skewness metric distortion
CURVBENCH curvature statistic
ξG(a,b,c;m) = [ dG(a,m)2 + 1/4 dG(b,c)2
  − 1/2( dG(a,b)2 + dG(a,c)2 ) ] / [ 2dG(a,m) ]

κ̄(G) = mean node-level relative curvature
γκ(G) = third standardized central moment
Benchmark

14 datasets, 18 models, four geometric views.

CURVBENCH spans natural graphs and table-derived graphs, then evaluates models through the lens of geometry-conditioned inductive bias rather than a single aggregate score.

14relational datasets
18models and GFMs
3+1curvature views
Curvature Distribution

Curvature distributions of three representative datasets

The histograms below show the node-level curvature distributions for Citeseer (near-zero), Actor (positive), and Disease (negative). These illustrate the geometric basis for our regime classification.

Citeseer (near-zero)
Actor (positive)
Disease (negative)

Dataset regimes used in the webpage summary

Natural graphs + table-derived graphs
RegimeRepresentative datasetsGeometric signalWhat it tests
Near-zeroCora, Citeseer, PubMedBalanced curvature profileWhether flat aggregation and spectral filtering are sufficient.
PositiveCornell, Airport, ActorCompact or clustered geometryWhen attributes and local clustering dominate relational structure.
NegativeDisease, Telecom, CS_PhdsHierarchical or tree-like geometryWhether non-Euclidean or adaptive models reduce metric mismatch.
Table-derivedCarcinogenesis, Hepatitis, PTE, Toxicology, F1Near-zero mean with strong curvature tailsSpecialist–robustness trade-offs hidden by average curvature.
Key results

Curvature reorganizes model rankings.

The webpage now surfaces the main experimental signals directly: rank consistency, family-by-regime interaction, few-shot GFM behavior, and table-derived graph specialization.

0.503Top-3 Spearman within-minus-cross regime gap on node classification.
1/280Exact regrouping significance for the curvature partition.
43.95%Explained variance due to family-by-regime interaction.
15.95Average 1-shot→5-shot gain for GFMs on near-zero graphs.

What changes across regimes?

Near-zero graphs favor Euclidean and spectral methods; positive graphs keep Euclidean methods competitive but increasingly expose feature-dominant behavior; negative graphs favor mixed or adaptive Riemannian methods, suggesting that Euclidean failures are regime-specific rather than universal.

GraphSAGE PCNet GraphMoRE QGCN

Rank-shift intuition

A flat average collapses several different preference orders. The mini heatmap below illustrates how the same model can move up or down depending on the curvature view.

Near-zero
Positive
Negative
Table
GraphSAGE
·
·
GraphMoRE
·
↑↑
HAT
·
·
↑↑
MLP
↑↑
·

This is a compact visual summary; full numeric tables are provided below.

Observation 1 — coherent regime orders

Dataset-induced top-model rankings are substantially more consistent within a curvature regime than across regimes.

Observation 2 — inductive biases shift

Euclidean methods dominate near-zero graphs, while mixed and adaptive Riemannian methods become strongest on negative-curvature graphs.

Observation 3 — GFMs remain geometry-conditioned

Few-shot GFMs do not form one universal leaderboard; the leading method changes with the curvature regime and scalability constraints.

Observation 4 — table graphs expose curvature tails

Table-derived graphs show that mean curvature alone is insufficient; skewness and tail mass explain specialist behavior such as HAT on F1.

Experimental tables

Numbers from the paper, placed directly on the page.

The full paper remains the source of record, but the project page now includes the most useful benchmark tables so readers can understand the result pattern without opening the PDF.

Dataset statistics: natural graph regimes
RegimeDatasetDomainNodesEdgesHomophilyAvg Deg.FeaturesClassesMean Curv.Skewness
Near-zeroCoraCitation2,7085,2780.81003.901,43370.007490.08401
Near-zeroCiteseerCitation3,3274,5520.73552.743,70360.002220.38363
Near-zeroPubMedCitation19,71744,3240.80244.5050030.006780.43122
PositiveCornellWebpage/WebKB1832980.13091.631,70350.010500.81561
PositiveAirportTransportation7,54318,5080.42894.917,54340.002131.33127
PositiveActorWikipedia7,60030,0190.21883.9593250.120391.30001
NegativeDiseaseEpidemiological1,0441,0420.87520.9981,0002-0.00335-1.48057
NegativeTelecomTelecommunication41,14341,4240.56201.012403-1.14371-11.82744
NegativeCS_PhdsAcademic/Social1,0251,0430.28192.04164-0.00301-1.53958

Curvature regimes are defined using mean curvature κ̄(G) and curvature skewness γκ(G).

Dataset statistics: table-derived graphs
DatasetDomain#Tables#Rows#Cols#Nodes#EdgesAvg Deg.FeaturesClassesMean Curv.Skewness
CarcinogenesisMedicine627,5702328,0278,9820.641030030.000349.42658
HepatitisMedicine712,9272612,92713,0162.013830030.000244.21239
PTEMedicine3829,7627629,85018,8051.260030030.000319.74080
ToxicologyMedicine449,2391149,81318,2670.733430030.0002112.06911
F1Sports997,6067797,606192,5603.9457300401.11301-2.26907

Several medical table-derived graphs have near-zero mean curvature but strong positive skewness, exposing tail-driven geometry.

Node classification performance on natural graphs
ModelCoraCiteseerPubMedAirportCornellActorDiseaseTelecomCS_Phds
GCN80.36±0.7168.68±0.6578.12±0.2879.18±0.9838.37±3.5231.31±0.6283.82±5.5885.85±0.6435.51±2.87
GAT80.72±0.7067.50±1.6477.08±0.3282.82±0.7844.32±4.5228.67±0.6090.62±1.4179.73±0.1926.83±0.00
GraphSAGE88.30±0.2174.89±0.6588.48±0.0548.80±0.2773.51±3.5232.84±0.5695.60±1.4592.90±3.0826.73±6.15
MLP56.12±1.0554.18±0.8771.27±0.3885.07±0.5568.10±2.2637.46±0.6279.90±0.0088.15±0.0426.83±0.00
PCNet88.08±0.4475.59±0.2589.97±0.1145.51±0.1361.08±4.5233.45±0.9778.56±0.9287.49±0.0431.51±0.44
HAT81.60±0.3270.99±0.2878.74±0.4659.22±5.5336.84±0.0334.64±0.4477.51±0.3087.92±0.0226.82±0.00
HGNN78.52±0.6367.62±0.8176.54±0.4383.51±2.4761.08±1.3228.92±0.6877.72±2.1593.16±0.9724.41±2.87
HyboNet75.16±0.8470.23±1.2073.58±0.4560.88±4.1736.22±1.0626.67±1.3277.01±4.5962.03±7.3226.73±0.19
HGCN76.74±0.7867.22±1.0175.88±0.3360.23±2.2061.08±0.9628.80±0.2377.92±1.5693.16±1.7043.63±2.86
CUSP76.94±0.9568.20±1.2866.36±2.3158.65±2.2440.54±1.0024.81±1.2685.79±1.8766.73±5.0129.65±3.47
QGCN79.80±0.4167.32±0.2675.90±1.0361.07±0.7454.59±2.0226.74±0.5583.31±1.4298.25±0.0545.39±2.33
GraphMoRE81.06±0.3368.30±0.7876.34±1.1290.42±1.3240.54±3.4224.49±0.8196.11±0.7793.40±0.3137.45±2.82

Highlighted cells mark the best mean performance in each dataset column.

Graph Foundation Models: 1-shot scenario
ModelCoraCiteseerPubMedAirportCornellActorDiseaseTelecomCS_Phds
GCOPE33.19±6.0537.38±7.4641.49±4.3519.22±8.3524.62±9.3624.30±1.8573.08±12.6954.82±13.1026.21±2.11
MDGPT44.58±7.8339.04±10.5353.36±10.7218.28±17.0729.26±6.2720.01±4.3352.42±9.4336.56±12.5525.29±2.30
MDGFM43.27±7.2841.20±6.3151.52±9.3418.70±5.0335.14±9.0220.74±2.1557.84±10.77OOM25.56±2.11
SAMGPT44.64±14.9436.03±8.4145.24±8.4519.12±9.2033.84±8.5419.72±5.8860.28±11.0445.12±13.4925.36±6.92
GraphGluing32.22±1.3328.48±6.5945.90±4.7041.37±2.7732.51±11.2524.10±2.2579.67±0.18OOM26.15±2.45
SA2GFM40.25±8.0529.98±7.8145.79±8.9025.63±5.9520.99±5.4518.53±2.0951.12±13.39OOM25.92±2.75

OOM denotes out-of-memory. Best available mean performance per dataset is highlighted.

Graph Foundation Models: 5-shot scenario
ModelCoraCiteseerPubMedAirportCornellActorDiseaseTelecomCS_Phds
GCOPE61.40±1.8852.42±5.2658.56±1.7920.95±4.3268.03±4.3324.55±2.0779.44±0.5872.16±8.3726.70±1.95
MDGPT60.86±4.8658.68±6.9359.86±6.8322.78±10.4444.98±7.1821.28±4.2154.68±9.7138.74±9.1326.86±2.27
MDGFM64.93±4.4358.10±4.5565.65±5.3019.92±3.8860.10±7.7821.12±1.6763.55±8.69OOM26.81±2.42
SAMGPT64.62±9.8953.76±5.7056.16±7.2721.28±6.7452.24±6.1819.92±6.2468.32±9.8858.56±11.6227.12±6.40
GraphGluing52.52±6.0644.05±2.0866.14±1.7142.46±1.1140.33±10.7223.47±1.7580.42±0.73OOM26.63±1.67
SA2GFM50.91±6.5738.25±4.1853.40±8.9425.95±9.7722.83±7.3419.35±2.9656.77±10.84OOM26.05±1.87

The gain from 1-shot to 5-shot is uneven across regimes, with near-zero graphs benefiting most.

Performance on table-derived graphs
ModelCarcinogenesisHepatitisPTEToxicologyF1
GCN57.27±5.0783.19±0.4479.66±1.8254.78±1.584.70±0.70
GAT60.30±4.5979.80±1.2978.33±3.1152.75±1.294.25±0.14
GraphSAGE65.45±1.2781.80±1.3081.67±0.0055.07±1.024.10±0.14
MLP54.55±0.0070.80±1.7879.00±0.9155.07±0.203.96±0.40
PCNet53.03±2.1484.20±1.9281.00±1.4952.46±0.653.90±0.27
HGNN62.42±2.4266.80±0.4077.00±1.2553.33±2.134.02±0.46
HAT70.84±1.4759.19±0.4485.66±3.0240.57±4.0940.84±5.77
HGCN61.21±1.2164.20±0.4065.33±3.8651.59±1.484.73±0.16
HyboNet43.63±1.7667.59±5.3843.33±5.5544.92±5.554.11±0.21
CUSP57.57±7.6680.40±1.2051.66±10.9054.87±0.575.04±0.25
QGCN63.33±2.4267.20±2.2355.33±1.2553.04±2.354.47±0.50
GraphMoRE54.55±5.0781.00±1.6778.33±1.0553.91±0.584.16±0.12

HAT behaves as a high-variance specialist: strong on several tail-driven cases, especially F1, but weaker on Hepatitis and Toxicology.

Resources

Code, data, and reproducibility.

The GitHub repository is linked. Dataset and model links can be activated once the Hugging Face releases are public.

💻

Code

Training, evaluation, curvature computation, and diagnostic scripts.

🤗

Dataset splits

Curvature-stratified data partitions and table-derived graph construction files. Available on Hugging Face.

🚀

Models and logs

Optional checkpoints, precomputed features, and logs for reproducible comparison. Replace this card after public release.

Citation

Cite CURVBENCH.

Please cite our work if you find the benchmark, splits, or diagnostic framework useful.

@misc{wang2026postgcndecaderevisited,
  title  = {The Post-GCN Decade Revisited: Curvature-Stratified Evaluation of Relational Learning},
  author = {Wang, Shuo and Wang, Xiangyu and Wang, Quanxin and Wu, Bolin and Wang, Bokui and Huang, Shunyang and Deng, Boyan and Liu, Haonan and Fang, Ruiyi and Xu, Zhenxiang and Wang, Boyu and Kang, Zhao},
  year   = {2026},
  note   = {Preprint}
}