🌐 Curvature-Stratified Relational Learning Benchmark

The Post-GCN Decade Revisited

A geometry-aware benchmark showing that relational learning performance is not a universal ranking: model preferences are stable within curvature regimes, but shift sharply across regimes.

14 datasets 18 representative models GCNs · GFMs · Non-Euclidean GNNs Curvature-aware diagnostics

Shuo Wang^1,2,*, Xiangyu Wang^1,*, Quanxin Wang¹, Bailin Wu¹, Bokui Wang¹, Shunyang Huang¹, Boyan Deng¹, Haonan Liu¹, Ruiyi Fang³, Zhenxiang Xu^1,4, Boyu Wang³, Zhao Kang^1,†

¹ University of Electronic Science and Technology of China · ² Tsinghua University · ³ Western University · ⁴ Zhejiang University

* Equal contribution † Corresponding author

📄 arXiv 🧾 PDF 💻 Code 🤗 Dataset 🚀 Models soon

From flat leaderboards to regime-conditioned diagnostics.

CURVBENCH replaces one-size-fits-all graph evaluation with curvature-stratified comparisons, making it visible when a model succeeds because its inductive bias matches the geometry of the data.

1. Measure geometryEstimate midpoint curvature residuals and graph-level curvature profiles.

2. Stratify dataGroup datasets into near-zero, positive, negative, and tail-driven regimes.

3. Compare fairlyReport regime-conditioned rankings, label elasticity, coverage, and feasibility.

Intrinsic geometry organizes model behavior.

Rank shifts appear once evaluation is conditioned on regime.

near-zero positive negative tail-driven

A project page for CURVBENCH. The design is self-contained in a single index.html file and can be directly deployed with GitHub Pages.

Overview

Why flat graph leaderboards can mislead.

Current relational-learning evaluations often average over heterogeneous datasets. CURVBENCH shows that such aggregation can hide geometry-dependent trade-offs: a model may look globally strong only because the benchmark mixture favors its preferred curvature regime.

🧭

Curvature-aware partitioning

Datasets are grouped by mean curvature and curvature skewness, revealing geometry-dependent model behavior.

📊

Partial-order diagnostics

Top-model rankings are compared within and across regimes to quantify preference stability.

🧪

Broad model suite

The benchmark covers Euclidean GNNs, hyperbolic methods, mixed-curvature models, adaptive Riemannian models, and GFMs.

🔁

Reproducible evaluation

The release is designed around code, splits, curvature computation, model evaluation, and diagnostic tools.

Method

A finite-metric view of graphs.

CURVBENCH treats each graph as a finite metric space and uses a midpoint curvature residual to measure local deviation from Euclidean geometry. Mean curvature captures the average signed profile, while skewness captures asymmetric curvature tails.

Curvature residual and graph-level descriptors

For a center node m, a neighbor pair {b,c}, and an anchor node a, the residual probes whether local graph triangles are fatter, thinner, or close to Euclidean. This gives a discrete signal for stratifying relational datasets.

midpoint residual mean curvature skewness metric distortion

CURVBENCH curvature statistic

ξ_G(a,b,c;m) = [ d_G(a,m)² + 1/4 d_G(b,c)²
− 1/2( d_G(a,b)² + d_G(a,c)² ) ] / [ 2d_G(a,m) ]

κ̄(G) = mean node-level relative curvature
γ_κ(G) = third standardized central moment

Benchmark

Comprehensive experiments.

CURVBENCH spans natural graphs and table-derived graphs, then evaluates models through the lens of geometry-conditioned inductive bias rather than a single aggregate score.

14relational datasets

18models and GFMs

3+1curvature views

Curvature Distribution

Curvature distributions of three representative datasets

The histograms below show the node-level curvature distributions for Citeseer (near-zero), Actor (positive), and Disease (negative). These illustrate the geometric basis for our regime classification.

Citeseer (near-zero)

Actor (positive)

Disease (negative)

Dataset regimes used in the webpage summary

Natural graphs + table-derived graphs

Regime	Representative datasets	Geometric signal	What it tests
Near-zero	Cora, Citeseer, PubMed	Balanced curvature profile	Whether flat aggregation and spectral filtering are sufficient.
Positive	Cornell, Airport, Actor	Compact or clustered geometry	When attributes and local clustering dominate relational structure.
Negative	Disease, Telecom, CS_Phds	Hierarchical or tree-like geometry	Whether non-Euclidean or adaptive models reduce metric mismatch.
Table-derived	Carcinogenesis, Hepatitis, PTE, Toxicology, F1	Near-zero mean with strong curvature tails	Specialist–robustness trade-offs hidden by average curvature.

Key results

Curvature reorganizes model rankings.

The webpage now surfaces the main experimental signals directly: rank consistency, family-by-regime interaction, few-shot GFM behavior, and table-derived graph specialization.

0.503Top-3 Spearman within-minus-cross regime gap on node classification.

1/280Exact regrouping significance for the curvature partition.

43.95%Explained variance due to family-by-regime interaction.

15.95Average 1-shot→5-shot gain for GFMs on near-zero graphs.

What changes across regimes?

Near-zero graphs favor Euclidean and spectral methods; positive graphs keep Euclidean methods competitive but increasingly expose feature-dominant behavior; negative graphs favor mixed or adaptive Riemannian methods, suggesting that Euclidean failures are regime-specific rather than universal.

GraphSAGE PCNet GraphMoRE QGCN

Rank-shift intuition

A flat average collapses several different preference orders. The mini heatmap below illustrates how the same model can move up or down depending on the curvature view.

Near-zero

Positive

Negative

Table

GraphSAGE

↓

↑

GraphMoRE

↓

↑↑

↓

HAT

↓

↑↑

MLP

↑

↑↑

↓

This is a compact visual summary; full numeric tables are provided below.

Observation 1 — coherent regime orders

Dataset-induced top-model rankings are substantially more consistent within a curvature regime than across regimes.

Observation 2 — inductive biases shift

Euclidean methods dominate near-zero graphs, while mixed and adaptive Riemannian methods become strongest on negative-curvature graphs.

Observation 3 — GFMs remain geometry-conditioned

Few-shot GFMs do not form one universal leaderboard; the leading method changes with the curvature regime and scalability constraints.

Observation 4 — table graphs expose curvature tails

Table-derived graphs show that mean curvature alone is insufficient; skewness and tail mass explain specialist behavior such as HAT on F1.

Experimental tables

Experimental results and analyses.

The full paper remains the source of record, but the project page now includes the most useful benchmark tables so readers can understand the result pattern without opening the PDF.

Dataset statistics: natural graph regimes

Regime	Dataset	Domain	Nodes	Edges	Homophily	Avg Deg.	Features	Classes	Mean Curv.	Skewness
Near-zero	Cora	Citation	2,708	5,278	0.8100	3.90	1,433	7	0.00749	0.08401
Near-zero	Citeseer	Citation	3,327	4,552	0.7355	2.74	3,703	6	0.00222	0.38363
Near-zero	PubMed	Citation	19,717	44,324	0.8024	4.50	500	3	0.00678	0.43122
Positive	Cornell	Webpage/WebKB	183	298	0.1309	1.63	1,703	5	0.01050	0.81561
Positive	Airport	Transportation	7,543	18,508	0.4289	4.91	7,543	4	0.00213	1.33127
Positive	Actor	Wikipedia	7,600	30,019	0.2188	3.95	932	5	0.12039	1.30001
Negative	Disease	Epidemiological	1,044	1,042	0.8752	0.998	1,000	2	-0.00335	-1.48057
Negative	Telecom	Telecommunication	41,143	41,424	0.5620	1.01	240	3	-1.14371	-11.82744
Negative	CS_Phds	Academic/Social	1,025	1,043	0.2819	2.04	16	4	-0.00301	-1.53958

Curvature regimes are defined using mean curvature κ̄(G) and curvature skewness γκ(G).

Dataset statistics: table-derived graphs

Dataset	Domain	#Tables	#Rows	#Cols	#Nodes	#Edges	Avg Deg.	Features	Classes	Mean Curv.	Skewness
Carcinogenesis	Medicine	6	27,570	23	28,027	8,982	0.6410	300	3	0.00034	9.42658
Hepatitis	Medicine	7	12,927	26	12,927	13,016	2.0138	300	3	0.00024	4.21239
PTE	Medicine	38	29,762	76	29,850	18,805	1.2600	300	3	0.00031	9.74080
Toxicology	Medicine	4	49,239	11	49,813	18,267	0.7334	300	3	0.00021	12.06911
F1	Sports	9	97,606	77	97,606	192,560	3.9457	300	40	1.11301	-2.26907

Several medical table-derived graphs have near-zero mean curvature but strong positive skewness, exposing tail-driven geometry.

Node classification performance on natural graphs

Model	Cora	Citeseer	PubMed	Airport	Cornell	Actor	Disease	Telecom	CS_Phds
GCN	80.36±0.71	68.68±0.65	78.12±0.28	79.18±0.98	38.37±3.52	31.31±0.62	83.82±5.58	85.85±0.64	35.51±2.87
GAT	80.72±0.70	67.50±1.64	77.08±0.32	82.82±0.78	44.32±4.52	28.67±0.60	90.62±1.41	79.73±0.19	26.83±0.00
GraphSAGE	88.30±0.21	74.89±0.65	88.48±0.05	48.80±0.27	73.51±3.52	32.84±0.56	95.60±1.45	92.90±3.08	26.73±6.15
MLP	56.12±1.05	54.18±0.87	71.27±0.38	85.07±0.55	68.10±2.26	37.46±0.62	79.90±0.00	88.15±0.04	26.83±0.00
PCNet	88.08±0.44	75.59±0.25	89.97±0.11	45.51±0.13	61.08±4.52	33.45±0.97	78.56±0.92	87.49±0.04	31.51±0.44
HAT	81.60±0.32	70.99±0.28	78.74±0.46	59.22±5.53	36.84±0.03	34.64±0.44	77.51±0.30	87.92±0.02	26.82±0.00
HGNN	78.52±0.63	67.62±0.81	76.54±0.43	83.51±2.47	61.08±1.32	28.92±0.68	77.72±2.15	93.16±0.97	24.41±2.87
HyboNet	75.16±0.84	70.23±1.20	73.58±0.45	60.88±4.17	36.22±1.06	26.67±1.32	77.01±4.59	62.03±7.32	26.73±0.19
HGCN	76.74±0.78	67.22±1.01	75.88±0.33	60.23±2.20	61.08±0.96	28.80±0.23	77.92±1.56	93.16±1.70	43.63±2.86
CUSP	76.94±0.95	68.20±1.28	66.36±2.31	58.65±2.24	40.54±1.00	24.81±1.26	85.79±1.87	66.73±5.01	29.65±3.47
QGCN	79.80±0.41	67.32±0.26	75.90±1.03	61.07±0.74	54.59±2.02	26.74±0.55	83.31±1.42	98.25±0.05	45.39±2.33
GraphMoRE	81.06±0.33	68.30±0.78	76.34±1.12	90.42±1.32	40.54±3.42	24.49±0.81	96.11±0.77	93.40±0.31	37.45±2.82

Highlighted cells mark the best mean performance in each dataset column.

Graph Foundation Models: 1-shot scenario

Model	Cora	Citeseer	PubMed	Airport	Cornell	Actor	Disease	Telecom	CS_Phds
GCOPE	33.19±6.05	37.38±7.46	41.49±4.35	19.22±8.35	24.62±9.36	24.30±1.85	73.08±12.69	54.82±13.10	26.21±2.11
MDGPT	44.58±7.83	39.04±10.53	53.36±10.72	18.28±17.07	29.26±6.27	20.01±4.33	52.42±9.43	36.56±12.55	25.29±2.30
MDGFM	43.27±7.28	41.20±6.31	51.52±9.34	18.70±5.03	35.14±9.02	20.74±2.15	57.84±10.77	OOM	25.56±2.11
SAMGPT	44.64±14.94	36.03±8.41	45.24±8.45	19.12±9.20	33.84±8.54	19.72±5.88	60.28±11.04	45.12±13.49	25.36±6.92
GraphGluing	32.22±1.33	28.48±6.59	45.90±4.70	41.37±2.77	32.51±11.25	24.10±2.25	79.67±0.18	OOM	26.15±2.45
SA2GFM	40.25±8.05	29.98±7.81	45.79±8.90	25.63±5.95	20.99±5.45	18.53±2.09	51.12±13.39	OOM	25.92±2.75

OOM denotes out-of-memory. Best available mean performance per dataset is highlighted.

Graph Foundation Models: 5-shot scenario

Model	Cora	Citeseer	PubMed	Airport	Cornell	Actor	Disease	Telecom	CS_Phds
GCOPE	61.40±1.88	52.42±5.26	58.56±1.79	20.95±4.32	68.03±4.33	24.55±2.07	79.44±0.58	72.16±8.37	26.70±1.95
MDGPT	60.86±4.86	58.68±6.93	59.86±6.83	22.78±10.44	44.98±7.18	21.28±4.21	54.68±9.71	38.74±9.13	26.86±2.27
MDGFM	64.93±4.43	58.10±4.55	65.65±5.30	19.92±3.88	60.10±7.78	21.12±1.67	63.55±8.69	OOM	26.81±2.42
SAMGPT	64.62±9.89	53.76±5.70	56.16±7.27	21.28±6.74	52.24±6.18	19.92±6.24	68.32±9.88	58.56±11.62	27.12±6.40
GraphGluing	52.52±6.06	44.05±2.08	66.14±1.71	42.46±1.11	40.33±10.72	23.47±1.75	80.42±0.73	OOM	26.63±1.67
SA2GFM	50.91±6.57	38.25±4.18	53.40±8.94	25.95±9.77	22.83±7.34	19.35±2.96	56.77±10.84	OOM	26.05±1.87

The gain from 1-shot to 5-shot is uneven across regimes, with near-zero graphs benefiting most.

Performance on table-derived graphs

Model	Carcinogenesis	Hepatitis	PTE	Toxicology	F1
GCN	57.27±5.07	83.19±0.44	79.66±1.82	54.78±1.58	4.70±0.70
GAT	60.30±4.59	79.80±1.29	78.33±3.11	52.75±1.29	4.25±0.14
GraphSAGE	65.45±1.27	81.80±1.30	81.67±0.00	55.07±1.02	4.10±0.14
MLP	54.55±0.00	70.80±1.78	79.00±0.91	55.07±0.20	3.96±0.40
PCNet	53.03±2.14	84.20±1.92	81.00±1.49	52.46±0.65	3.90±0.27
HGNN	62.42±2.42	66.80±0.40	77.00±1.25	53.33±2.13	4.02±0.46
HAT	70.84±1.47	59.19±0.44	85.66±3.02	40.57±4.09	40.84±5.77
HGCN	61.21±1.21	64.20±0.40	65.33±3.86	51.59±1.48	4.73±0.16
HyboNet	43.63±1.76	67.59±5.38	43.33±5.55	44.92±5.55	4.11±0.21
CUSP	57.57±7.66	80.40±1.20	51.66±10.90	54.87±0.57	5.04±0.25
QGCN	63.33±2.42	67.20±2.23	55.33±1.25	53.04±2.35	4.47±0.50
GraphMoRE	54.55±5.07	81.00±1.67	78.33±1.05	53.91±0.58	4.16±0.12

HAT behaves as a high-variance specialist: strong on several tail-driven cases, especially F1, but weaker on Hepatitis and Toxicology.

Resources

Code, data, and reproducibility.

The GitHub repository is linked. Dataset and model links can be activated once the Hugging Face releases are public.

💻

Code

Training, evaluation, curvature computation, and diagnostic scripts.

🤗

Dataset splits

Curvature-stratified data partitions and table-derived graph construction files. Available on Hugging Face.

🚀

Models and logs

Optional checkpoints, precomputed features, and logs for reproducible comparison. Replace this card after public release.

Citation

Cite CURVBENCH.

Please cite our work if you find the benchmark, splits, or diagnostic framework useful.

@misc{wang2026postgcndecaderevisitedcurvaturestratified,
      title={The Post-GCN Decade Revisited: Curvature-Stratified Evaluation of Relational Learning}, 
      author={Shuo Wang and Xiangyu Wang and Quanxin Wang and Bailin Wu and Bokui Wang and Shunyang Huang and Boyan Deng and Haonan Liu and Ruiyi Fang and Zhenxiang Xu and Boyu Wang and Zhao Kang},
      year={2026},
      eprint={2606.06397},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2606.06397}, 
}