FedStruct: Federated Decoupled Learning over Interconnected Graphs

Javad Aliakbari Johan Östman Alexandre Graell i Amat

Abstract

We address the challenge of federated learning on graph-structured data distributed across multiple clients. Specifically, we focus on the prevalent scenario of interconnected subgraphs, where inter-connections between different clients play a critical role. We present a novel framework for this scenario, named FedStruct, that harnesses deep structural dependencies. To uphold privacy, unlike existing methods, FedStruct eliminates the necessity of sharing or generating sensitive node features or embeddings among clients. Instead, it leverages explicit global graph structure information to capture inter-node dependencies. We validate the effectiveness of FedStruct through experimental results conducted on six datasets for semi-supervised node classification, showcasing performance close to the centralized approach across various scenarios, including different data partitioning methods, varying levels of label availability, and number of clients.

Machine Learning, ICML

gnn short = GNN, long = graph neural network fl short = FL, long = federated learning ml short = ML, long = machine learning dnn short = DNN, long = deep neural network sdsfl short = SDSFL, long = structure decoupled subgraph federated learning sfv short = SFV, long = structure feature vector rsfv short=RSFV, long = random structure feature vector sfm short = SFM, long = structure feature matrix sfl short = SFL, long = subgraph federated learning fedavg short = FedAvg, long = federated averaging spm short = SPM, long = structure predictor model fpm short = FPM, long = feature predictor model gdv short = GDV, long = graphlet degree vector nfe short = NFE, long = node feature embedding nse short = NSE, long = node structure embedding sem short = SEM, long = structure embedding matrix fem short = FEM, long = feature embedding matrix sel short = SEL, long = structure embedding loss dgcn short = decoupled-GCN, long = decoupled-GCN mlp short = MLP, long = multi layer perceptron

1 Introduction

Many real-world data are graph-structured, where nodes represent entities, and edges encapsulate relationships or connections between them. For example, in demand prediction for the energy grid, nodes represent power stations or substations and edges embody power transmission lines. Similarly, the identification of vulnerabilities in global supply chains involves nodes as individual companies or production facilities and edges as their interconnections. Another real-world scenario is anti-money laundering, where nodes symbolize accounts and edges correspond to transactions, which are strictly confidential.

Graph Neural Networks (GNNs) are a class of neural network architectures tailored for graph-structured data with proven success in various domains, such as drug discovery, social networks, or traffic flow modeling (Stokes et al., 2020; Fan et al., 2019; Jiang & Luo, 2022). In practical real-world applications, the graph data is often naturally distributed across multiple clients, i.e., the global graph encompasses multiple, non-overlapping subgraphs. For instance, the examples of energy grids, global supply chains, and anti-money laundering involve local graphs delineating regional energy grids, internal supply chains of distinct corporations, and the internal transactional networks of financial institutions, respectively.

For these and many other applications, data sharing among clients is unfeasible due to privacy, regulations, or proprietary restrictions. In this context, federated learning (FL) (McMahan et al., 2017), emerges as a promising approach to exploit (global) graph-structured data while safeguarding data privacy. Various flavors of federated graph neural networks exist (Liu et al., 2022). In this work, our focus is on one of the most prevalent scenarios: subgraph FL (Zhang et al., 2021), where clients hold subgraphs of disjoint nodes that integrate into a comprehensive global graph with interconnections between the different local subgraphs.

Training a GNN involves aggregating feature representations of neighboring nodes to generate more expressive embeddings (Kipf & Welling, 2017; Hamilton et al., 2017; Veličković et al., 2018). In subgraph FL, a key challenge is facilitating training when some neighboring nodes reside in other clients. Several approaches have been proposed to address this challenge (Zhang et al., 2021; Peng et al., 2022; Chen et al., 2021; Du & Wu, 2022; Lei et al., 2023). The prevailing issue with all these approaches lies in their reliance on sharing the original, or generated, node features or node embeddings with other clients. This raises serious privacy concerns in situations when the node features are sensitive. Moreover, these approaches rely on homophily, assuming strong similarities among connected nodes.

Our contribution. We address the problem of subgraph FL for node classification within a global graph containing multiple, non-overlapping, subgraphs belonging to different clients. For this scenario, we propose a novel subgraph FL framework, named FedStruct, that exploits deep structural dependencies. To safeguard privacy, FedStruct decouples graph structure and node feature information: Unlike existing approaches, FedStruct eliminates the need for sharing (or generating) node features and/or embeddings. Instead, it leverages global graph structure information to capture inter-node dependencies among clients. The source code is publicly available.¹¹1https://anonymous.4open.science/r/FedStruct-B70D/

Our key contributions include:

•

We present two versions of FedStruct tailored for different realistic scenarios: The first one accommodates global graphs with sensitive node features but non-sensitive and publicly accessible edges, while the other addresses cases where also the local subgraph edges are confidential. For the latter case, FedStruct incorporates a decoupled graph convolutional network (decoupled GCN) (Liu et al., 2020) as the underlying GNN, limiting information exchange between clients.
•

We introduce a method to generate task-dependent node structure embeddings, Hop2Vec, that adapts to the graph and demonstrates competitive performance compared to task-agnostic methods such as Node2Vec (Grover & Leskovec, 2016) and graphlet counting (Pržulj, 2007). Notably, in contrast to these methods, Hop2Vec does not require knowledge of the global graph.
•

We address the semi-supervised learning scenario by integrating a decoupled GCN and leveraging deep structural dependencies within the global graph. With these components, FedStruct yields outstanding performance in scenarios with limited number of labeled training nodes. In heavily semi-supervised settings, it significantly outperforms FedSage+. Further, FedStruct relies on non-local neighbor-extension and inter-layer combination, techniques that are commonly used to handle heterophilic graphs (Zheng et al., 2022). To the best of our knowledge, FedStruct is the first subgraph FL framework capable of handling heterophilic graphs.
•

We validate the effectiveness of FedStruct on six datasets for semi-supervised node classification, showcasing excellent performance close to a centralized approach in multiple scenarios with different data partitioning methods, availability of training labels, and number of clients.

2 Related Work

Subgraph federated learning. The works most closely related to ours are FedSage+ (Zhang et al., 2021), FedNI (Peng et al., 2022), FedCog (Lei et al., 2023), and the works of (Zhang et al., 2022; Liu et al., 2023; Zhang et al., 2024; Chen et al., 2021; Du & Wu, 2022). FedSage+ (Zhang et al., 2021), the first method to address subgraph FL, assumes no knowledge of cross-subgraph interconnections. To deal with missing links, it augments the local subgraphs by in-painting cross-subgraph $1$ -hop missing neighbors. Specifically, it estimates the number of missing neighbors for each node and generates features for these neighbors, relying on a strong homophily assumption. FedNI (Peng et al., 2022) proposes a similar in-painting strategy, utilizing a generative adversarial network (GAN) for generating the missing neighbors and their features, resulting in improved quality of generated node features. This in-painting idea has been extended to accommodate graphs with different node types (Zhang et al., 2022) and missing links between subgraphs (Liu et al., 2023). To exploit deeper structural dependencies, building on FedSage+, Zhang et al. (2024) generates node embeddings for missing neighbors, containing information about the node’s $k$ -hop neighbors. FedCog (Lei et al., 2023) proposes a different approach for subgraph FL, decoupling local subgraphs into an internal graph and a border graph. Correspondingly, the graph convolution in the underlying GCN is divided into two steps on the two decoupled graphs. FedCog requires sharing intermediate embeddings of local nodes with neighboring clients, in which case it is equivalent to a graph convolution on the global graph. Other works tackle the problem of subgraph FL through sampling techniques (Chen et al., 2021; Du & Wu, 2022). Importantly, all these prior methods (Zhang et al., 2021; Peng et al., 2022; Lei et al., 2023; Zhang et al., 2022; Liu et al., 2023; Zhang et al., 2024; Chen et al., 2021; Du & Wu, 2022) require the sharing of either original or generated node features or embeddings among clients and operate under a homophily assumption, which does not hold in many practical scenarios.

Structural information in GNNs.

Recent studies have revealed limitations in the capacity of GNNs to capture the structure of the underlying graph. To overcome this obstacle, some works explicitly incorporate structural information into the learning, showing superior performance compared to standard GNNs (Bouritsas et al., 2023; Tan et al., 2023). To increase the representation power of GNNs, (Bouritsas et al., 2023) introduce structural information to the aggregation function. The authors demonstrate that the proposed architecture is strictly more expressive than standard GNNs. In an FL setting, (Tan et al., 2023) share universal structural knowledge among clients to enhance local performance. To our knowledge, no work has integrated explicit structural information in subgraph FL.

3 Preliminaries

Graph notation. We consider a graph denoted by $\mathcal{G}=(\mathcal{V},\mathcal{E},\boldsymbol{X},\boldsymbol{Y})$ , where $\mathcal{V}$ is the set of $n$ nodes, $\mathcal{E}$ the set of edges, $\boldsymbol{X}\in\mathbb{R}^{n\times d}$ the node feature matrix, and $\boldsymbol{Y}\in\mathbb{R}^{n\times c}$ the label matrix. For each node $v\in\mathcal{V}$ , we denote by $\boldsymbol{x}_{v}\in\mathbb{R}^{d}$ its corresponding feature vector and by $\boldsymbol{y}_{v}\in\mathbb{R}^{c}$ its corresponding one-hot encoded label vector. We consider a semi-supervised learning scenario and denote by $\tilde{\mathcal{V}}\subseteq\mathcal{V}$ the set of nodes that possess labels. The labels of the remaining nodes are set to $\boldsymbol{0}$ . We also denote by $\mathcal{N}_{\mathcal{G}}(v)=\{u|(u,v)\in\mathcal{E}\}$ the neighbors of node $v$ .

For a given matrix $\boldsymbol{M}$ , let $m_{uv}$ be its $(u,v)$ -th element. The topological information of the whole graph is described by the adjacency matrix $\boldsymbol{A}\in\mathbb{R}^{n\times n}$ , where $a_{uv}=1$ if $(u,v)\in\mathcal{E}$ . We define the diagonal matrix of node degrees as $\boldsymbol{D}\in\mathbb{R}^{n\times n}$ , where $d_{uu}=\sum_{v}{a_{uv}}$ . Furthermore, we denote $\boldsymbol{\tilde{A}}=\boldsymbol{A}+\boldsymbol{I}$ as the self-loop adjacency matrix, and $\boldsymbol{\hat{A}}=\boldsymbol{\tilde{D}}^{-1}\boldsymbol{\tilde{A}}$ as the normalized self-loop adjacency matrix, where $\tilde{d}_{uu}=\sum_{v\in\mathcal{V}}\tilde{a}_{uv}$ . We also denote $[m]=\{1,\dots,m\}$ .

Graph neural networks. Modern GNNs use neighborhood aggregation (propagation step) followed by a learning transformation (transformation step) to iteratively update the node representation at each layer. Let $\boldsymbol{h}_{v}^{(l)}$ be the feature embedding of node $v$ at layer $l$ , with $\boldsymbol{h}_{v}^{(0)}=\boldsymbol{x}_{v}$ . At layer $l\in[L]$ of the GNN we have

	$\displaystyle\boldsymbol{h}_{\mathcal{N}_{\mathcal{G}}(v)}^{(l)}$	$\displaystyle=\textsc{Aggregate}^{(l)}\bigg{(}\Big{\{}\boldsymbol{h}_{u}^{(l-1)},~{}\forall u\in\mathcal{N}_{\mathcal{G}}(v)\Big{\}}\bigg{)}$		(1)
	$\displaystyle\boldsymbol{h}_{v}^{(l)}$	$\displaystyle=\textsc{Update}^{(l)}\Big{(}\boldsymbol{h}_{v}^{(l-1)}\,,\,\boldsymbol{h}_{\mathcal{N}_{\mathcal{G}}(v)}^{(l)}\,,\boldsymbol{\Theta}^{(l)}\Big{)}\,,$		(2)

where $\boldsymbol{\Theta}^{(l)},\textsc{Aggregate}^{(l)}$ and $\textsc{Update}^{(l)}$ denote the weight matrix, aggregation function, and transformation function associated with layer $l$ of the GNN, respectively.

Decoupled graph convolutional networks. A significant limitation of GNNs is over-smoothing (Li et al., 2018; Liu et al., 2020; Dong et al., 2021), which results in performance degradation when multiple layers are applied. Over-smoothing is characterized by node embeddings becoming inseparable, making it challenging to distinguish between them. Recent studies (Li et al., 2018; Liu et al., 2020; Dong et al., 2021) attribute this phenomenon to the interweaving of the propagation and transformation steps within each GNN layer. To address over-smoothing, a well-known solution (hereafter referred to as decoupled GCN) is to decouple the propagation and transformation steps (Liu et al., 2020; Dong et al., 2021). Let $\boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}_{v})$ be a mlp network that takes the node feature $\boldsymbol{x}_{v}$ and learning parameters $\boldsymbol{\theta}$ and returns the label prediction $\hat{\boldsymbol{y}}^{(\text{MLP})}_{v}$ ,

\hat{\boldsymbol{y}}^{(\text{MLP})}_{v}=\boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}_{v})

(3)

A decoupled GCN can be described with the model

\boldsymbol{H}^{(L)}=\boldsymbol{\bar{A}}\boldsymbol{F}_{\boldsymbol{\theta}}(\boldsymbol{X})\,,

(4)

where $\boldsymbol{\bar{A}}=\sum_{l=1}^{L}{\beta_{l}\boldsymbol{\hat{A}}^{l}}$ , $\boldsymbol{F}_{\boldsymbol{\theta}}(\boldsymbol{X})=||(\boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}_{v})^{\mathsf{T}}\quad\forall v\in\mathcal{V})$ , with $||$ being the concatenation operation, and ^T denotes the transpose operation. We refer to $\boldsymbol{\bar{A}}$ as the $L$ -hop combined neighborhood adjacency matrix. Its element reflects the proximity of two nodes in the graph, with $\beta_{l}$ determining the contribution of each hop. Parameters $\{\beta_{l}\}_{l=1}^{L}$ can be set manually or learned during the training. Equations (3) and (4) correspond to the transformation and propagation step of the decoupled GCN, respectively.

4 System Model

We consider a scenario where data is structured according to a global graph $\mathcal{G}=(\mathcal{V},\mathcal{E},\boldsymbol{X},\boldsymbol{Y})$ . The global graph is distributed among $K$ clients such that each client owns a smaller local subgraph. We denote by $\mathcal{G}_{i}=(\mathcal{V}_{i},\mathcal{V}_{i}^{*},\mathcal{E}_{i},\mathcal{E}_{i}^{*},\boldsymbol{X}_{i},\boldsymbol{Y}_{i})$ the subgraph of client $i$ , where $\mathcal{V}_{i}\subseteq\mathcal{V}$ is the set of $n_{i}$ nodes that reside in client $i$ , referred to as internal nodes, for which client $i$ knows their features. $\mathcal{V}_{i}^{*}$ is the set of nodes that do not reside in client $i$ but have at least one connection to nodes in $\mathcal{V}_{i}$ . We call these nodes external nodes. Client $i$ does not have access to the features of nodes in $\mathcal{V}_{i}^{*}$ . Furthermore, $\mathcal{E}_{i}$ represents the set of edges between nodes owned by client $i$ (intra-connections), $\mathcal{E}_{i}^{*}$ is the set of edges between nodes of client $i$ and nodes of other clients (inter-connections), $\boldsymbol{X}_{i}\in\mathbb{R}^{n_{i}\times d}$ the node feature matrix, and $\boldsymbol{Y}_{i}\in\mathbb{R}^{n_{i}\times c}$ the label matrix for the nodes within subgraph $\mathcal{G}_{i}$ , and we denote by $\tilde{\mathcal{V}}_{i}$ the set of nodes that possess labels. Finally, similar to graph $\mathcal{G}$ , we denote by $\mathcal{N}_{\mathcal{G}_{i}}(v)$ , $\boldsymbol{A}^{(i)}$ , and $\boldsymbol{D}^{(i)}$ the set of local neighbors, the local adjacency matrix, and the local diagonal matrix, respectively, for subgraph $\mathcal{G}_{i}$ .

4.1 Node Embeddings

Our focus is on node classification, where the goal is to classify each node based on the underlying graph data. More precisely, we assume that each client employs a GNN model that generates embeddings for each node using the node features and subgraph connections. Formally, for client $i$ , the feature embedding matrix is obtained as

\displaystyle\boldsymbol{H}_{i}^{(L)}=\text{fGNN}_{\boldsymbol{\theta}_{\mathsf{f}}}(\boldsymbol{X}_{i},\mathcal{E}_{i})\,,

(5)

where fGNN is the GNN model to generate node feature embeddings (NFEs), $L$ is the number of propagation layers, and $\boldsymbol{\theta}_{{\mathsf{f}}}$ is the parameter vector of fGNN. Also $\boldsymbol{H}_{i}^{(L)}=||\big{(}(\boldsymbol{h}_{v}^{(L)})^{\mathsf{T}}\quad\forall v\in\mathcal{V}_{i}\big{)}$ , where $\boldsymbol{h}_{v}^{(L)}$ is the NFE of node $v$ .

4.2 Federated Learning

Refer to caption — Figure 1: General design of the FedStruct framework.

The FL problem can be formalized as learning the model parameters that minimize the aggregated loss across clients,

\boldsymbol{\theta}^{*}=\operatorname*{arg\,min}_{\boldsymbol{\theta}}\;\mathcal{L}(\boldsymbol{\theta})\,,

(6)

with

\displaystyle\mathcal{L}(\boldsymbol{\theta})=\frac{1}{|\tilde{\mathcal{V}}|}\sum_{i=1}^{K}{\mathcal{L}_{i}(\boldsymbol{\theta})},\quad\text{and}\quad\mathcal{L}_{i}(\boldsymbol{\theta})=\sum_{v\in\tilde{\mathcal{V}}_{i}}{\mathrm{CE}(\boldsymbol{y}_{v},\hat{\boldsymbol{y}}_{v})}\,,

(7)

where $\mathrm{CE}$ is the cross-entropy loss function between the true label $\boldsymbol{y}_{v}$ and the predicted label $\hat{\boldsymbol{y}}_{v}$ .

The model $\boldsymbol{\theta}$ is trained iteratively over multiple epochs. At each epoch, the clients compute the local gradients $\nabla_{\boldsymbol{\theta}}\mathcal{L}_{i}(\boldsymbol{\theta})$ and send them to the central server. The server updates the model through gradient descent,

\boldsymbol{\theta}\leftarrow\boldsymbol{\theta}-\lambda\nabla_{\boldsymbol{\theta}}\mathcal{L}(\boldsymbol{\theta}),\quad\nabla_{\boldsymbol{\theta}}\mathcal{L}(\boldsymbol{\theta})=\frac{1}{|\tilde{\mathcal{V}}|}\sum_{i=1}^{K}{\nabla_{\boldsymbol{\theta}}\mathcal{L}_{i}(\boldsymbol{\theta})}\,,

(8)

where $\lambda$ is the learning rate.

5 FedStruct-A: Subgraph FL with Knowledge of the Global Graph

In this section, we introduce FedStruct, a novel subgraph FL framework designed to leverage inter-node dependencies among clients while safeguarding privacy. The central concept of FedStruct is to harness explicit information about the overall graph’s underlying structure while ensuring that neither the server nor the clients have access to the node features. Specifically, in this section, we introduce the first version of FedStruct, referred to as FedStruct-A, that operates under the assumption that the server has complete knowledge of the graph’s connections. This scenario is commonly encountered in applications such as smart grids and pandemic prediction. Importantly, in FedStruct-A the clients remain unaware of the local graphs of other clients. In Section 6, we propose a second version of FedStruct that assumes that the server lacks knowledge of the global graph’s connections. Fig. 1 illustrates the FedStruct framework. In the following, we describe FedStruct-A.

Node structure features. Our proposed method relies on node structure features (NSFs), which encapsulate structural information about the nodes, such as node degree and local neighborhood connection patterns, as well as positional information such as distance to other nodes in the graph. For each node $v\in\mathcal{V}$ , the corresponding NSF, $\boldsymbol{s}_{v}\in\mathbb{R}^{d_{\mathsf{s}}}$ , is a function of the set of edges $\mathcal{E}$ , $\boldsymbol{s}_{v}=\boldsymbol{Q}_{\mathsf{NSF}}(\mathcal{E})$ . We define the matrix containing all NSFs as $\boldsymbol{S}=||(\boldsymbol{s}_{v}^{\mathsf{T}},\quad\forall v\in\mathcal{V})$ .

Function $\boldsymbol{Q}_{\mathsf{NSF}}$ can be defined through various node embedding algorithms, e.g., one-hot degree vector (Xu et al., 2019; Cui et al., 2022), gdv (Pržulj, 2007), or Node2Vec (Grover & Leskovec, 2016). We describe these different node embeddings in Appendix A.

Node structure embeddings. NSFs derived from one-hot degree vectors, GDV, and Node2Vec, are inherently task-agnostic. To use NSFs for label prediction, a learning model is required. To this aim, the server employs a GNN, here referred to as sGNN, that generates node structure embeddings (NSEs) incorporating information from the NSFs and the global graph’s connections. Let $L_{\mathsf{s}}$ be the number of propagation layers of sGNN and $\boldsymbol{S}^{(L_{\mathsf{s}})}=||\big{(}(\boldsymbol{s}_{v}^{(L_{\mathsf{s}})}){{}^{\mathsf{T}}}\quad\forall v\in\mathcal{V}\big{)}$ the matrix of NSEs at layer $L_{\mathsf{s}}$ , with $\boldsymbol{s}_{v}^{(L_{\mathsf{s}})}$ being the NSE of node $v$ . Formally, $\boldsymbol{S}^{(L_{\mathsf{s}})}$ is obtained as

\displaystyle\boldsymbol{S}^{(L_{\mathsf{s}})}=\text{sGNN}_{\boldsymbol{\theta}_{\mathsf{s}}}(\boldsymbol{S},\mathcal{E})\,,

(9)

where $\boldsymbol{\theta}_{\mathsf{s}}$ are the parameters of sGNN.

Node prediction. Upon the creation of the NSEs, the server sends them to the respective clients. At each client $i\in[K]$ , node prediction is performed based on both the NFEs (which are computed locally at each client, see Section 4.1) and the NSEs. To this aim, we pass them through a fully connected layer with parameters $\boldsymbol{\Theta}_{\mathsf{c}}$ (obtained from the server),

	$\displaystyle\hat{\boldsymbol{y}}_{v}$	$\displaystyle=\mathrm{softmax}(\boldsymbol{\Theta}_{\mathsf{c}}^{\mathsf{T}}\big{(}\boldsymbol{s}_{v}^{(L_{\mathsf{s}})}\|\|\boldsymbol{h}_{v}^{(L)}))$		(10)
		$\displaystyle=\mathrm{softmax}\big{(}(\boldsymbol{\Theta}_{\mathsf{c}}^{(\mathsf{s})})^{\mathsf{T}}\boldsymbol{s}_{v}^{(L_{\mathsf{s}})}+(\boldsymbol{\Theta}_{\mathsf{c}}^{(\mathsf{f})})^{\mathsf{T}}\boldsymbol{h}_{v}^{(L)})\big{)}\quad\forall v\in\mathcal{V}_{i}\,,$

where $\boldsymbol{h}_{v}^{(L)}$ and $\boldsymbol{s}_{v}^{(L_{\mathsf{s}})}$ are obtained based on (5) and (9), respectively. We denote the vectorized version of $\boldsymbol{\Theta}_{\mathsf{c}}$ by $\boldsymbol{\theta}_{\mathsf{c}}$ . From (10), we have that $\boldsymbol{\theta}_{\mathsf{c}}=\boldsymbol{\theta}_{\mathsf{c}}^{(\mathsf{s})}\,||\boldsymbol{\theta}_{\mathsf{c}}^{(\mathsf{f})}$ .

Note that, since the NSEs are generated by the server, they capture knowledge of all connections of the global graph. Consequently, the use of NSEs is anticipated to enhance node classification compared to the case of a classifier solely relying on NFEs.

Gradient of the loss function. We optimize the learning parameters of FedStruct-A, $\boldsymbol{\theta}=(\boldsymbol{\theta}_{\mathsf{f}}||\boldsymbol{\theta}_{\mathsf{s}}||\boldsymbol{\theta}_{\mathsf{c}})$ , solving (6)–(7) with $\hat{\boldsymbol{y}}_{v}$ in (10) using stochastic gradient descent.

Theorem 1.

Let $\mathcal{L}_{i}(\boldsymbol{\theta})$ , $\boldsymbol{\theta}=(\boldsymbol{\theta}_{\mathsf{f}}||\boldsymbol{\theta}_{\mathsf{s}}||\boldsymbol{\theta}_{\mathsf{c}})$ , be the local training loss for client $i$ in (7) and $\hat{\boldsymbol{y}}_{v}$ be given in (10). The gradient

\nabla_{\boldsymbol{\theta}}\mathcal{L}_{i}(\boldsymbol{\theta})=||\left(\frac{\partial{\mathcal{L}_{i}(\boldsymbol{\theta})}}{\partial{\theta_{j}}}\quad\forall j\in[|\boldsymbol{\theta}|]\right)\,

(11)

is given by

	$\displaystyle\frac{\partial{\mathcal{L}_{i}(\boldsymbol{\theta})}}{\partial{\theta_{\mathsf{s},j}}}$	$\displaystyle=\sum_{v\in\tilde{\mathcal{V}}}{\frac{\partial{\boldsymbol{s}_{v}^{(L_{\mathsf{s}})}}}{\partial{\theta_{\mathsf{s},j}}}\boldsymbol{\Theta}_{\mathsf{c}}^{(\mathsf{s})}(\hat{\boldsymbol{y}}_{v}-\boldsymbol{y}_{v})}\quad\forall j\in[\|\boldsymbol{\theta}_{\mathsf{s}}\|],$
	$\displaystyle\frac{\partial{\mathcal{L}_{i}(\boldsymbol{\theta})}}{\partial{\theta_{\mathsf{f},j}}}$	$\displaystyle=\sum_{v\in\tilde{\mathcal{V}}}{\frac{\partial{\boldsymbol{h}_{v}^{(L)}}}{\partial{\theta_{\mathsf{f},j}}}\boldsymbol{\Theta}_{\mathsf{c}}^{(\mathsf{f})}(\hat{\boldsymbol{y}}_{v}-\boldsymbol{y}_{v})}\quad\forall j\in[\|\boldsymbol{\theta}_{\mathsf{f}}\|],$
	$\displaystyle\frac{\partial{\mathcal{L}_{i}(\boldsymbol{\theta})}}{\partial{\theta_{\mathsf{c},j}^{(\mathsf{s})}}}$	$\displaystyle=\sum_{v\in\tilde{\mathcal{V}}}{\frac{\partial{\big{(}(\boldsymbol{\Theta}_{\mathsf{c}}^{(\mathsf{s})})^{\mathsf{T}}\boldsymbol{s}_{v}^{(L_{\mathsf{s}})}\big{)}}}{\partial{\theta_{\mathsf{c},j}^{(\mathsf{s})}}}(\hat{\boldsymbol{y}}_{v}-\boldsymbol{y}_{v})}\quad\forall j\in[\|\boldsymbol{\theta}_{\mathsf{c}}^{(\mathsf{s})}\|],$
	$\displaystyle\frac{\partial{\mathcal{L}_{i}(\boldsymbol{\theta})}}{\partial{\theta_{\mathsf{c},j}^{(\mathsf{f})}}}$	$\displaystyle=\sum_{v\in\tilde{\mathcal{V}}}{\frac{\partial{\big{(}(\boldsymbol{\Theta}_{\mathsf{c}}^{(\mathsf{f})})^{\mathsf{T}}\boldsymbol{h}_{v}^{(L)}\big{)}}}{\partial{\theta_{\mathsf{c},j}^{(\mathsf{f})}}}(\hat{\boldsymbol{y}}_{v}-\boldsymbol{y}_{v})}\quad\forall j\in[\|\boldsymbol{\theta}_{\mathsf{c}}^{(\mathsf{f})}\|]\,,$

where $\theta_{\cdot,j}$ denotes the $j$ -th entry of vector $\boldsymbol{\theta}_{\cdot,j}$ .

Proof.

See Appendix B. ∎

Note that client $i$ cannot compute $\nabla_{\boldsymbol{\theta}}\mathcal{L}_{i}(\boldsymbol{\theta})$ directly as calculating $\frac{\partial{\boldsymbol{s}_{v}^{(L_{\mathsf{s}})}}}{\partial{\theta_{\mathsf{s},j}}}$ requires knowledge of the global graph connections. Hence, the server must provide $\frac{\partial{\boldsymbol{s}_{v}^{(L_{\mathsf{s}})}}}{\partial{\theta_{\mathsf{s},j}}}$ along with $\boldsymbol{s}_{v}^{(L_{\mathsf{s}})}$ for all $v\in\tilde{\mathcal{V}}_{i}$ . The FedStruct-A framework is described in Algorithm 1.

Algorithm 1 FedStruct-A algorithm

0: Global graph

\mathcal{G}

, sGNN and fGNN models, NSF generator function

\boldsymbol{Q}_{\mathsf{NSF}}

K

clients with respective subgraphs

\{\mathcal{G}_{i}\}_{i=1}^{K}

, model parameters

\boldsymbol{\theta}=(\boldsymbol{\theta}_{\mathsf{f}}||\boldsymbol{\theta}_{\mathsf{s}}||\boldsymbol{\theta}_{\mathsf{c}})

\boldsymbol{S}\leftarrow\boldsymbol{Q}_{\text{NSF}}(\mathcal{E})

for e=1 to Epochs do

\boldsymbol{S}^{(L_{\mathsf{s}})}\leftarrow\text{sGNN}_{\boldsymbol{\theta}_{\mathsf{s}}}(\boldsymbol{S},\mathcal{E})

for i=1 to K do

Client

i

collects

\boldsymbol{\theta}

from the server

Client

i

collects

\{\boldsymbol{s}_{v}^{(L_{\mathsf{s}})},\forall v\in\tilde{\mathcal{V}}_{i}\}

from the server

Client

i

collects

\{\frac{\partial{\boldsymbol{s}_{v}^{(L_{\mathsf{s}})}}}{\partial{\boldsymbol{\theta}_{\mathsf{s}}}},\forall v\in\tilde{\mathcal{V}}_{i}\}

from the server

\boldsymbol{H}_{i}^{(L)}=\text{fGNN}_{\boldsymbol{\theta}_{\mathsf{f}}}(\boldsymbol{X}_{i},\mathcal{E}_{i})

for

v\in\tilde{\mathcal{V}}_{i}

\hat{\boldsymbol{y}}_{v}=\mathrm{softmax}(\boldsymbol{\Theta}_{\mathsf{c}}^{\mathsf{T}}.(\boldsymbol{s}_{v}^{(L_{\mathsf{s}})}||\boldsymbol{h}_{v}^{(L)}))

end for

Calculate

\mathcal{L}_{i}(\boldsymbol{\theta})

based on (7)

Calculate

\nabla_{\boldsymbol{\theta}}\mathcal{L}_{i}(\boldsymbol{\theta})

from 1

Send

\nabla_{\boldsymbol{\theta}}\mathcal{L}_{i}(\boldsymbol{\theta})

to the server

end for

Calculate

\nabla_{\boldsymbol{\theta}}\mathcal{L}(\boldsymbol{\theta})

based on (8)

\boldsymbol{\theta}\leftarrow\boldsymbol{\theta}-\lambda\nabla_{\boldsymbol{\theta}}\mathcal{L}(\boldsymbol{\theta})

end for

5.1 Task-Dependent Node Structure Feature Vector

Ideally, NSFs should enhance node distinctiveness while exhibiting recognizable patterns that facilitate node classification. For graphs characterized by long-range dependencies, NSFs need to capture structural information beyond direct neighbors. Further, the NSFs should preferably be task-dependent. GDV and Node2Vec capture longer-range dependencies compared to one-hot degree vectors. However, they remain task-agnostic and hinge on a homophily assumption (see Appendix A). To overcome these shortcomings, we introduce a new method to generate NSFs, dubbed Hop2Vec, designed to be task-dependent and applicable to heterophilic graphs.

Hop2Vec constructs NSFs as follows. First, NSFs are initialized with a random vector, after which they are learned from the graph structure during the training of FedStruct. As a result, the NSFs are inherently task-dependent—formed to minimize misclassification. To optimize the NSFs, we modify the objective function (7) to be contingent upon matrix $\boldsymbol{S}$ , i.e., $\mathcal{L}(\boldsymbol{\theta},\boldsymbol{S})$ . The NSFs are optimized by solving

\boldsymbol{S}^{*}=\operatorname*{arg\,min}_{\boldsymbol{S}}{\mathcal{L}(\boldsymbol{\theta},\boldsymbol{S})}\,.

(12)

The optimization is carried out through gradient descent,

\boldsymbol{S}\leftarrow\boldsymbol{S}-\lambda_{\mathsf{s}}\nabla_{\boldsymbol{S}}\mathcal{L}(\boldsymbol{\theta},\boldsymbol{S})\,,

(13)

where $\lambda_{\mathsf{s}}$ denotes the learning rate.

Theorem 2.

Let $\mathcal{L}_{i}(\boldsymbol{\theta})$ denote the local training loss in (7) for client $i$ and $\boldsymbol{S}$ the matrix of NSFs resulting from the Hop2Vec method. Additionally, let $\hat{\boldsymbol{y}}_{v}$ be as in (10). The gradient

\displaystyle\nabla_{\boldsymbol{S}}\mathcal{L}_{i}(\boldsymbol{\theta}_{i},\boldsymbol{S})=||\big{(}(\nabla_{\boldsymbol{s}_{q}}\mathcal{L}_{i}(\boldsymbol{\theta},\boldsymbol{S}))^{\mathsf{T}}\quad\forall q\in\mathcal{V}\big{)}

(14)

is given by

\displaystyle\nabla_{\boldsymbol{s}_{q}}\mathcal{L}_{i}(\boldsymbol{\theta},\boldsymbol{S})=\frac{\partial{\boldsymbol{s}_{v}^{(L_{\mathsf{s}})}}}{\partial{\boldsymbol{s}_{q}}}\boldsymbol{\Theta}_{\mathsf{c}}^{(\mathsf{s})}(\hat{\boldsymbol{y}}_{v}-\boldsymbol{y}_{v})\quad\forall q\in\mathcal{V}\,.

(15)

Proof.

See Appendix C. ∎

To calculate (15), client $i$ must receive $\{\frac{\partial{\boldsymbol{s}_{v}^{(L_{\mathsf{s}})}}}{\partial{\boldsymbol{s}_{q}}},\,\forall v\in\tilde{\mathcal{V}}_{i},q\in\mathcal{V}\}$ from the server. The FedStruct-A framework with Hop2Vec is described in Algorithm 2 in Appendix D.

5.2 Semi-Supervised Learning

In a semi-supervised learning scenario with a limited number of nodes, it is imperative for GNNs to employ more layers to effectively capture longer-range dependencies (Liu et al., 2020). However, standard GNNs suffer from over-smoothing when the number of layers is large. To address this issue, specifically in semi-supervised learning tasks, FedStruct-A employs a decoupled GCN as the underlying GNN.

For client $i\in[K]$ and node $v\in\mathcal{V}_{i}$ , let $\boldsymbol{f}_{\boldsymbol{\theta}_{\mathsf{f}}}(\boldsymbol{x}_{v})$ and $\boldsymbol{g}_{\boldsymbol{\theta}_{\mathsf{s}}}(\boldsymbol{s}_{v})$ be the MLP networks responsible for exploiting the node features and node structure features, respectively. To employ a decoupled GCN in FedStruct, we use (4) to define the sGNN and fGNN models as

	$\displaystyle\boldsymbol{H}_{i}^{(L)}$	$\displaystyle=\text{fGNN}_{\boldsymbol{\theta}_{\mathsf{f}}}(\boldsymbol{X}_{i},\mathcal{E}_{i})=\boldsymbol{\bar{A}}^{(i)}\;\boldsymbol{F}_{\boldsymbol{\theta}_{\mathsf{f}}}(\boldsymbol{X}_{i})$		(16)
	$\displaystyle\boldsymbol{S}^{(L_{\mathsf{s}})}$	$\displaystyle=\text{sGNN}_{\boldsymbol{\theta}_{\mathsf{s}}}(\boldsymbol{S},\mathcal{E})=\boldsymbol{\bar{A}}\;\boldsymbol{G}_{\boldsymbol{\theta}_{\mathsf{s}}}(\boldsymbol{S})$		(17)

where $\boldsymbol{F}_{\boldsymbol{\theta}_{\mathsf{f}}}(\boldsymbol{X}_{i})=||(\boldsymbol{f}_{\boldsymbol{\theta}_{\mathsf{f}}}(\boldsymbol{x}_{u})^{\mathsf{T}}\;\forall u\in\mathcal{V}_{i})$ , $\boldsymbol{G}_{\boldsymbol{\theta}_{\mathsf{s}}}(\boldsymbol{S})=||(\boldsymbol{g}_{\boldsymbol{\theta}_{\mathsf{s}}}(\boldsymbol{s}_{u})^{\mathsf{T}}\;\forall u\in\mathcal{V})$ , and $\boldsymbol{\bar{A}}$ and $\boldsymbol{\bar{A}}^{(i)}$ are the $L$ -hop combined adjacency matrices for graph $\mathcal{G}$ and $\mathcal{G}_{i}$ , respectively.

6 FedStruct-B: Subgraph FL with no Knowledge of the Global Graph

To predict node labels, FedStruct leverages both node features and structure features. In FedStruct-A, structure features are harnessed by the server leveraging its knowledge of the global graph connections. However, in many scenarios, the server does not possess this knowledge, rendering it unable to exploit structure features. To tackle this limitation, we introduce FedStruct-B, which eliminates the need for such global knowledge. In this case, for the learning framework to capitalize on inter-node dependencies among clients, structure features must be locally generated at each client. This necessitates clients to selectively share specific local graph structural information with other clients. To uphold privacy, this sharing of information must be done cautiously. In the following, we describe the different components of FedStruct-B.

Node structure feature vectors. It is important to note that Node2Vec and GDV, along with other well-known methods, require knowledge of the $L$ -hop neighborhood of a node. Hence, in scenarios where the sever lacks knowledge of the global graph connections, these methods cannot be used. In contrast, our method Hop2Vec does not require knowledge of the $L$ -hop neighborhood information. We employs Hop2Vec in FedStruct-B to generate NSFs.

Decoupled GCN. For the scenario where the server lacks knowledge of the global graph, NSEs need to be generated locally. Unfortunately, standard GNNs are ill-suited to locally generate NSEs. The intertwining of the propagation and transformation steps within each GNN layer poses challenges. Specifically, for clients to compute new embeddings for a given layer of the GNN, they require the embeddings of previous layers for neighboring nodes residing in other clients. This would allow the clients to infer the global graph structure. To address this privacy concern, FedStruct-B employs a decoupled GCN. Decoupled GCNs separate the propagation and transformation steps. The propagation step does not involve learning, enabling the computation of all propagation layers before training. This allows us to pinpoint the information needed by other clients for the training and limit the exchange of information to these specifics, thereby minimizing the exchange of information with each client compared to standard GNNs, as we show next.

Structural information sharing. For a given client $i$ , training FedStruct involves computing $\nabla_{\boldsymbol{\theta}}\mathcal{L}_{i}(\boldsymbol{\theta})$ and predicting node labels $\{\hat{\boldsymbol{y}}_{v},\forall v\in\tilde{\mathcal{V}}_{i}\}$ . Our objective is to devise a mechanism allowing the local training of FedStruct without relying on full knowledge of the global graph. In the following, we show that for node classification tasks utilizing the categorical cross-entropy loss function, the incorporation of decoupled GCNs enables the training of FedStruct with limited knowledge of the global graph. To this end, we denote by $\boldsymbol{\bar{A}}^{[i]}=||(\boldsymbol{\bar{a}}_{v}^{\mathsf{T}},\quad\forall v\in\tilde{\mathcal{V}}_{i})$ , client $i$ ’s local partition of $\boldsymbol{\bar{A}}$ where $\boldsymbol{\bar{a}}_{v}=||(\bar{a}_{vu},\quad\forall u\in\mathcal{V})$ .

Theorem 3.

To obtain $\{\hat{\boldsymbol{y}}_{v},\forall v\in\tilde{\mathcal{V}}_{i}\}$ in (10) using decoupled GCN, client $i$ only needs external inputs $\boldsymbol{\bar{A}}^{[i]}$ and $\boldsymbol{S}$ .

Proof.

For a node $v\in\tilde{\mathcal{V}}_{i}$ , expanding (16) and (17) leads to

	$\displaystyle\boldsymbol{h}_{v}^{(L)}$	$\displaystyle=\sum_{u\in\mathcal{V}_{i}}{\bar{a}_{vu}^{(i)}\boldsymbol{f}_{\boldsymbol{\theta}_{\mathsf{f}}}(\boldsymbol{x}_{u})}$		(18)
	$\displaystyle\boldsymbol{s}_{v}^{(L_{s})}$	$\displaystyle=\sum_{u\in\mathcal{V}}{\bar{a}_{vu}\boldsymbol{g}_{\boldsymbol{\theta}_{\mathsf{s}}}(\boldsymbol{s}_{u})}\,,$		(19)

where $\boldsymbol{h}_{v}^{(L)}$ and $\boldsymbol{s}_{v}^{(L_{\mathsf{s}})}$ represent the NFE and NSE of node $v$ , respectively. Using (18) and (19) in (10) leads to

\displaystyle\hat{\boldsymbol{y}}_{v}

\displaystyle=\mathrm{softmax}\Big{(}\sum_{u\in\mathcal{V}}{\bar{a}_{vu}\boldsymbol{g}_{\boldsymbol{\theta}_{\mathsf{s}}}(\boldsymbol{s}_{u})}+\sum_{u\in\mathcal{V}_{i}}{\bar{a}_{vu}^{(i)}\boldsymbol{f}_{\boldsymbol{\theta}_{\mathsf{f}}}(\boldsymbol{x}_{u})}\Big{)}\,

(20)

where (20) comes from integrating $\boldsymbol{\Theta}_{\mathsf{c}}^{(\mathsf{f})}$ and $\boldsymbol{\Theta}_{\mathsf{c}}^{(\mathsf{s})}$ into $\boldsymbol{f}_{\boldsymbol{\theta}_{\mathsf{f}}}(\boldsymbol{x}_{u})$ and $\boldsymbol{g}_{\boldsymbol{\theta}_{\mathsf{s}}}(\boldsymbol{s}_{u})$ , respectively. Equation (20) depends only on $\boldsymbol{\bar{A}}^{[i]}$ and $\boldsymbol{S}$ , which concludes the proof. ∎

Theorem 4.

Let $\mathcal{L}_{i}(\boldsymbol{\theta})$ be the local training loss defined in (7) and let $\hat{\boldsymbol{y}}_{v}$ be as in (20). The local gradient

\displaystyle\nabla_{\boldsymbol{\theta}}\mathcal{L}_{i}(\boldsymbol{\theta})=||\left(\frac{\partial{\mathcal{L}_{i}(\boldsymbol{\theta})}}{\partial{\theta_{j}}}\quad\forall j\in[|\boldsymbol{\theta}|]\right)

is given by

	$\displaystyle\frac{\partial{\mathcal{L}_{i}(\boldsymbol{\theta})}}{\partial{\theta_{\mathsf{s},j}}}$	$\displaystyle=\sum_{v\in\tilde{\mathcal{V}}_{i},u\in\mathcal{V}}{\bar{a}_{vu}\frac{\partial{\boldsymbol{g}_{\boldsymbol{\theta}_{\mathsf{s}}}(\boldsymbol{s}_{u})}}{\partial{\theta_{\mathsf{s},j}}}(\hat{\boldsymbol{y}}_{v}-\boldsymbol{y}_{v})}$		(21)
	$\displaystyle\frac{\partial{\mathcal{L}_{i}(\boldsymbol{\theta})}}{\partial{\theta_{\mathsf{f},j}}}$	$\displaystyle=\sum_{v\in\tilde{\mathcal{V}}_{i},u\in\mathcal{V}_{i}}{\bar{a}^{(i)}_{vu}\frac{\partial{\boldsymbol{f}_{\boldsymbol{\theta}_{\mathsf{f}}}(\boldsymbol{x}_{u})}}{\partial{\theta_{\mathsf{f},j}}}(\hat{\boldsymbol{y}}_{v}-\boldsymbol{y}_{v})}.$		(22)

Proof.

See Appendix E. ∎

Theorem 5.

For client $i$ , the training of FedStruct-b, i.e., computing $\{\hat{\boldsymbol{y}}_{v},\forall v\in\tilde{\mathcal{V}}_{i}\}$ and $\nabla_{\boldsymbol{\theta}}\mathcal{L}_{i}(\boldsymbol{\theta})$ , requires only the local partition $\boldsymbol{\bar{A}}^{[i]}$ and $\boldsymbol{S}$ .

Proof.

It follows by combining Theorems 3 and 4. ∎

From Theorem 5, knowledge of the global adjacency matrix is not required to predict a node’s label or to compute the gradient at a given client. Instead, only the local partition of the global $L$ -hop combined adjacency matrix, $\boldsymbol{\bar{A}}^{[i]}$ , is necessary to operate FedStruct-b. Clients have to collaboratively calculate $\boldsymbol{\bar{A}}^{[i]}$ . In Appendix G, we provide an algorithm to obtain $\boldsymbol{\bar{A}}^{[i]}$ before the training begins without gaining any additional information about the global graph. Moreover, due to graph isomorphism, $\boldsymbol{\bar{A}}^{[i]}$ cannot be used to uniquely determine the adjacency matrix of other clients. The FedStruct-B framework is described in Algorithm 4 in Appendix H.

Use of task-dependent node structure feature vectors. Using Hop2Vec, we note that the learning of $\boldsymbol{g}_{\boldsymbol{\theta}_{\mathsf{s}}}(\boldsymbol{s}_{u})$ is equivalent to the learning of $\boldsymbol{s}_{u}$ . The purpose of passing the NSFs through the sGNN is to generate task-dependent NSEs. Since the NSFs generated by Hop2Vec are learned during the training, they are already task-dependent. Therefore, we do not need to consider the transformation step of sGNN, $\boldsymbol{g}_{\boldsymbol{\theta}_{\mathsf{s}}}(\boldsymbol{s}_{u})$ , but only the propagation step. This is based on the fact that both the MLP parameters $\boldsymbol{\theta}_{\mathsf{s}}$ and the structure features $\boldsymbol{s}_{u}$ are subject to optimization. Therefore, using Hop2Vec, (20) can be rewritten as

\hat{\boldsymbol{y}}_{v}=\mathrm{softmax}\Big{(}\sum_{u\in\mathcal{V}}{\bar{a}_{vu}\boldsymbol{s}_{u}}+\sum_{u\in\mathcal{V}_{i}}{\bar{a}_{vu}^{(i)}\boldsymbol{f}_{\boldsymbol{\theta}_{\mathsf{f}}}(\boldsymbol{x}_{u})}\Big{)}\,.

(23)

Theorem 6.

Let $\mathcal{L}_{i}(\boldsymbol{\theta})$ denote the local training loss in (7) for client $i$ and $\boldsymbol{S}=||(\boldsymbol{s}_{q}^{\mathsf{T}},\quad\forall q\in\mathcal{V})$ represent the matrix containing NSFs generated by Hop2Vec. Additionally, let $\hat{\boldsymbol{y}}_{v}$ be as in (23). The gradient $\nabla_{\boldsymbol{S}}\mathcal{L}_{i}(\boldsymbol{\theta},\boldsymbol{S})$ is given by

\displaystyle\nabla_{\boldsymbol{S}}\mathcal{L}_{i}(\boldsymbol{\theta},\boldsymbol{S})=\sum_{v\in\tilde{\mathcal{V}}_{i}}{\boldsymbol{\bar{a}}_{v}(\hat{\boldsymbol{y}}_{v}-\boldsymbol{y}_{v})^{\mathsf{T}}}\,.

(24)

Proof.

See Appendix F. ∎

To calculate (24), client $i$ only needs $\boldsymbol{\bar{A}}^{[i]}$ . The FedStruct-B framework with Hop2Vec is described in Algorithm 5 in Appendix I.

7 Evaluation

Table 1: Node classification accuracy for various subgraph FL methods with random partitioning. The nodes are split into train-val-test as 10%-10%-80%. For each method, the mean and standard deviation are shown for

10

independent runs.

	Cora			Citeseer			Pubmed
Central GNN	82.06 $\pm$ 1.09			69.17 $\pm$ 0.90			87.71 $\pm$ 0.34
	5 clients	10 clients	20 clients	5 clients	10 clients	20 clients	5 clients	10 clients	20 clients
FL GNN	68.47 $\pm$ 1.12	65.06 $\pm$ 1.92	64.59 $\pm$ 1.56	64.92 $\pm$ 1.84	63.47 $\pm$ 1.29	63.78 $\pm$ 1.46	86.14 $\pm$ 0.38	85.91 $\pm$ 0.42	85.31 $\pm$ 0.56
FedSage+	68.42 $\pm$ 1.73	66.32 $\pm$ 1.81	64.19 $\pm$ 1.54	64.65 $\pm$ 1.28	63.37 $\pm$ 1.45	62.69 $\pm$ 1.35	85.89 $\pm$ 0.39	85.37 $\pm$ 0.55	85.14 $\pm$ 0.28
FedSage-true	81.74 $\pm$ 1.29	80.85 $\pm$ 1.55	79.96 $\pm$ 1.27	69.52 $\pm$ 1.57	68.91 $\pm$ 1.30	69.26 $\pm$ 1.49	85.40 $\pm$ 0.77	85.66 $\pm$ 0.37	85.10 $\pm$ 0.80
FedStruct (Deg)	71.87 $\pm$ 2.52	68.64 $\pm$ 1.51	65.06 $\pm$ 2.12	66.33 $\pm$ 1.68	63.81 $\pm$ 1.31	61.50 $\pm$ 1.52	84.92 $\pm$ 0.53	85.09 $\pm$ 0.48	85.40 $\pm$ 0.52
FedStruct (GDV)	73.81 $\pm$ 1.99	72.19 $\pm$ 1.62	70.32 $\pm$ 1.64	66.56 $\pm$ 1.81	64.44 $\pm$ 0.84	62.89 $\pm$ 1.47	85.60 $\pm$ 0.44	86.12 $\pm$ 0.56	86.01 $\pm$ 0.49
FedStruct (N2V)	80.37 $\pm$ 1.13	80.68 $\pm$ 1.77	79.80 $\pm$ 1.36	67.84 $\pm$ 1.51	66.28 $\pm$ 1.04	65.33 $\pm$ 1.28	86.52 $\pm$ 0.43	87.09 $\pm$ 0.44	86.92 $\pm$ 0.52
FedStruct (H2V)	79.53 $\pm$ 1.52	80.28 $\pm$ 1.44	79.39 $\pm$ 0.90	67.33 $\pm$ 1.43	66.29 $\pm$ 0.87	65.24 $\pm$ 1.49	86.07 $\pm$ 0.35	86.65 $\pm$ 0.53	86.24 $\pm$ 0.56
Local GNN	49.60 $\pm$ 2.13	39.23 $\pm$ 1.21	32.25 $\pm$ 1.68	50.77 $\pm$ 1.78	38.99 $\pm$ 1.60	32.40 $\pm$ 1.57	82.58 $\pm$ 0.28	79.10 $\pm$ 0.54	73.31 $\pm$ 0.41

	Chameleon			Amazon Photo			Amazon Ratings
Central GNN	54.43 $\pm$ 2.44			93.99 $\pm$ 0.48			41.32 $\pm$ 1.03
	5 clients	10 clients	20 clients	5 clients	10 clients	20 clients	5 clients	10 clients	20 clients
FL GNN	39.98 $\pm$ 1.56	35.70 $\pm$ 2.29	34.72 $\pm$ 1.72	92.70 $\pm$ 0.28	90.74 $\pm$ 0.58	89.27 $\pm$ 0.55	37.11 $\pm$ 0.47	36.25 $\pm$ 0.46	36.22 $\pm$ 0.62
FedSage+	39.43 $\pm$ 1.65	35.48 $\pm$ 1.23	34.33 $\pm$ 1.49	92.57 $\pm$ 0.49	90.80 $\pm$ 0.48	89.39 $\pm$ 0.58	36.82 $\pm$ 0.53	36.18 $\pm$ 0.77	36.09 $\pm$ 0.33
FedSage-true	52.66 $\pm$ 1.73	50.75 $\pm$ 2.82	49.35 $\pm$ 1.54	92.94 $\pm$ 3.84	90.95 $\pm$ 5.27	92.60 $\pm$ 2.35	40.44 $\pm$ 0.92	39.98 $\pm$ 0.68	39.93 $\pm$ 0.55
FedStruct (Deg)	44.16 $\pm$ 1.82	41.01 $\pm$ 1.26	40.90 $\pm$ 1.83	90.83 $\pm$ 0.41	89.59 $\pm$ 0.88	88.39 $\pm$ 1.21	39.32 $\pm$ 0.49	39.08 $\pm$ 0.57	38.65 $\pm$ 0.61
FedStruct (GDV)	42.48 $\pm$ 2.17	39.54 $\pm$ 2.46	39.14 $\pm$ 2.35	90.93 $\pm$ 0.37	90.51 $\pm$ 0.45	89.23 $\pm$ 1.17	39.92 $\pm$ 0.61	39.53 $\pm$ 0.87	39.51 $\pm$ 0.36
FedStruct (N2v)	46.35 $\pm$ 1.83	43.49 $\pm$ 1.93	44.53 $\pm$ 2.05	91.62 $\pm$ 0.41	91.19 $\pm$ 1.05	90.06 $\pm$ 0.86	41.45 $\pm$ 0.43	41.43 $\pm$ 0.52	41.77 $\pm$ 0.44
FedStruct (H2V)	53.06 $\pm$ 1.28	52.36 $\pm$ 2.63	52.76 $\pm$ 1.69	91.99 $\pm$ 0.39	91.83 $\pm$ 0.39	90.41 $\pm$ 1.32	41.18 $\pm$ 0.31	41.23 $\pm$ 0.38	41.16 $\pm$ 0.52
Local GNN	35.52 $\pm$ 1.65	28.51 $\pm$ 1.59	24.89 $\pm$ 2.19	88.19 $\pm$ 0.61	79.66 $\pm$ 1.15	63.16 $\pm$ 0.98	34.52 $\pm$ 0.84	32.81 $\pm$ 0.46	31.29 $\pm$ 0.52

To demonstrate the performance and versatility of FedStruct, we conduct experiments on six datasets pertaining to node classification under scenarios with varying i) amounts of labeled training nodes, ii) number of clients, and iii) data partitioning methods. The interconnections between clients heavily depend on the latter. The datasets considered are: Cora (Sen et al., 2008), Citeseer (Sen et al., 2008), Pubmed (Namata et al., 2012), Chameleon (Pei et al., 2020), Amazon Photo (Shchur et al., 2018), and Amazon Ratings (Platonov et al., 2023). Statistics of the different datasets are given in Appendix J along with their edge homophily ratio, measuring the proportion of edges connecting nodes of similar labels, ranging between 0 (heterophilic) to 1 (homophilic) (Zhu et al., 2020).

Experimental setting. We focus on a strongly semi-supervised setting where data is split into training, validation, and test sets containing $10\%$ , $10\%$ , and $80\%$ of the nodes, respectively. We artificially partition the datasets into interconnected subgraphs that are then allocated to the clients. Here, we consider a random partitioning of the datasets, which assigns nodes to subgraphs uniformly at random, resulting in significant interconnectivity between subgraphs. This partitioning is relevant in, e.g., transaction networks where accounts do not necessarily have a preference to interact with nodes in the same subgraph. The random partitioning constitutes the most challenging scenario as the number of interconnections is high and, hence, the learning scheme must exploit such connections. In Appendix K, we also provide details and results for two other partitionings using the Louvain algorithm, as in (Zhang et al., 2021), and the K-means algorithm (Lei et al., 2023).

To provide upper and lower benchmarks, all experiments are also conducted for a centralized setting utilizing the global graph and a localized setting using only the local subgraphs. In Appendix K we also give results for an MLP approach to highlight the importance of utilizing the spatial structure within the data. For the GNN, we rely on GraphSage (Hamilton et al., 2017) with two or three layers, depending on the dataset. We further compare FedStruct to vanilla FL (FL GNN), which does not exploit interconnections, and FedSage+ (Zhang et al., 2021). FedSage+ does not know to what clients the missing neighbors belong to. To compare apples-to-apples, we provide a version of FedSage+ (and FedNI), denoted FedSage-true, where this information is known and the in-painting of missing one-hop neighbors is replaced by their true features. This method provides an upper bound on the performance of FedSage+ and FedNI. For the results pertaining to FedStruct, we consider four different methods to create NSFs: one-hot degree vector (Deg), GDV, Node2Vec (N2V), and the task-dependent approach proposed in Section 5.1 Hop2Vec (H2V), see Appendix A for details.

Overall performance. In Table 1, we report the node classification accuracy on the different datasets for a random partitioning over $10$ independent runs. The accuracy difference between central GNN and local GNN provides insights into the potential gains by utilizing FL. For example, on the Cora and Amazon Ratings datasets with $10$ clients, the gap is $43.78\%$ and $9.05\%$ , respectively. For these datasets, FL GNN closes the gap to $16.94\%$ and $5.24\%$ , respectively. FedSage+ yields similar performance as FL GNN for all datasets. Incorporating the true node features of the mended nodes (FedSage-true) brings performance close to the centralized approach. Note, however, that access to the node features in missing neighbors completely violates privacy.

For FedStruct, we consider a decoupled GCN as the underlying GNN, in which case FedStruct-A and FedStruct-B yield identical performance, which follows from Theorem 5. From Table 1, it can be seen that FedStruct further improves performance compared to FL GNN and FedSage+. The improvement is significant for the Cora and Chameleon datasets. For Chameleon and 20 clients, FedStruct with Hop2Vec achieves an accuracy of 52.76% compared to 34.33% for FedSage+. The table also shows that the performance improves with structure features able to collect more global information as can be seen by the improvements between DEG, GDV, and Node2Vec. Hop2Vec achieves similar performance to Node2Vec for all datasets except for Chameleon, for which it significantly outperforms Node2Vec. We further recall that Node2Vec cannot be used in the scenario without knowledge of the global graph connections.

Notably, FedStruct with Hop2Vec achieves performance close to a centralized approach. For example, on Cora and Amazon Photo, the gap between the central GNN and Hop2Vec is $2.56\%$ and $0.3\%$ , respectively. Gaps of similar sizes are observed across all datasets. Furthermore, compared to, e.g., FL GNN, FedStruct remains robust to an increasing number of clients and, as shown in Appendix K, across different partitioning methods, highlighting its ability to exploit the interconnections between clients. Additional results are provided in Appendix K for various partitionings, varying number of labeled training nodes, and varying degrees of heterophily.

8 Concluding Remarks

We presented FedStruct, a framework for FL over interconnected graphs that decouples node and structure features, explicitly exploiting deep structural dependencies. The distinctive advantages of FedStruct are outlined below.

Privacy. Unlike existing subgraph FL frameworks, which require sharing original or generated node features and/or embeddings among clients, FedStruct eliminates this need, sharing less sensitive information. In Appendix L.1, we discuss privacy considerations of FedStruct.

Semi-supervised learning. FedStruct shows excellent performance, close to that of a centralized approach, even in heavily semi-supervised learning scenarios with a low fraction of labeled nodes (see Appendix K) and significantly outperforms FedSage+ in these challenging scenarios.

Heterophily. FedStruct yields outstanding performance for heavily heterophilic datasets, significantly outperforming FedSage+ in these settings (see Appendix K).

Robustness across different partitionings. In contrast to other approaches like FedSage+, FedStruct exhibits robustness accross different partitionings (Appendix K) and varying number of clients.

Task-dependent NSF. The proposed Hop2Vec does not require knowledge of the global graph, contrary to well-known methods such as Node2Vec and GDV.

References

Abu-El-Haija et al. (2019) Abu-El-Haija, S., Perozzi, B., Kapoor, A., Alipourfard, N., Lerman, K., Harutyunyan, H., Ver Steeg, G., and Galstyan, A. Mixhop: Higher-order graph convolutional architectures via sparsified neighborhood mixing. In ICML, 2019.
Bouritsas et al. (2023) Bouritsas, G., Frasca, F., Zafeiriou, S., and Bronstein, M. M. Improving graph neural network expressivity via subgraph isomorphism counting. IEEE Transactions on Pattern Analysis & Machine Intelligence, 45(1), 2023.
Chen et al. (2021) Chen, F., Li, P., Miyazaki, T., and Wu, C. Fedgraph: Federated graph learning with intelligent sampling. IEEE Transactions on Parallel and Distributed Systems, 33(8):1775–1786, 2021.
Cui et al. (2022) Cui, H., Lu, Z., Li, P., and Yang, C. On positional and structural node features for graph neural networks on non-attributed graphs. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, 2022.
Dong et al. (2021) Dong, H., Chen, J., Feng, F., He, X., Bi, S., Ding, Z., and Cui, P. On the equivalence of decoupled graph convolution network and label propagation. In Proceedings of the Web Conference, 2021.
Du & Wu (2022) Du, B. and Wu, C. Federated graph learning with periodic neighbour sampling. In IEEE International Symposium on Quality of Service (IWQoS), 2022.
Fan et al. (2019) Fan, W., Ma, Y., Li, Q., He, Y., Zhao, E., Tang, J., and Yin, D. Graph neural networks for social recommendation. In The world wide web conference, 2019.
Fox & Rajamanickam (2019) Fox, J. and Rajamanickam, S. How robust are graph neural networks to structural noise? arXiv:1912.10206, 2019.
Grover & Leskovec (2016) Grover, A. and Leskovec, J. node2vec: Scalable feature learning for networks. In International conference on Knowledge discovery and data mining (SIGKDD), 2016.
Hamilton et al. (2017) Hamilton, W., Ying, Z., and Leskovec, J. Inductive representation learning on large graphs. Neurips, 30, 2017.
Jiang & Luo (2022) Jiang, W. and Luo, J. Graph neural network for traffic forecasting: A survey. Expert Systems with Applications, 207, 2022.
Kipf & Welling (2017) Kipf, T. N. and Welling, M. Semi-supervised classification with graph convolutional networks. In ICLR, 2017.
Lei et al. (2023) Lei, R., Wang, P., Zhao, J., Lan, L., Tao, J., Deng, C., Feng, J., Wang, X., and Guan, X. Federated learning over coupled graphs. IEEE Transactions on Parallel and Distributed Systems, 34(4), 2023.
Li et al. (2018) Li, Q., Han, Z., and Wu, X.-M. Deeper insights into graph convolutional networks for semi-supervised learning. In AAAI conference on artificial intelligence, volume 32, 2018.
Liu et al. (2023) Liu, L., Chen, J., Zhu, S., Zhang, J., and Zhang, C. Federated graph learning with cross-subgraph missing links recovery. In IEEE International Conference on Information Technology, Big Data and Artificial Intelligence (ICIBA), volume 3, 2023.
Liu et al. (2020) Liu, M., Gao, H., and Ji, S. Towards deeper graph neural networks. In International conference on knowledge discovery & data mining (SIGKDD), 2020.
Liu et al. (2022) Liu, R., Xing, P., Deng, Z., Li, A., Guan, C., and Yu, H. Federated graph neural networks: Overview, techniques and challenges. arXiv:2202.07256, 2022.
McMahan et al. (2017) McMahan, B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, 2017.
Namata et al. (2012) Namata, G., London, B., Getoor, L., Huang, B., and Edu, U. Query-driven active surveying for collective classification. In International workshop on mining and learning with graphs (MLG), volume 8, 2012.
Pei et al. (2020) Pei, H., Wei, B., Chang, K. C.-C., Lei, Y., and Yang, B. Geom-GCN: Geometric graph convolutional networks. In ICLR, 2020.
Peng et al. (2022) Peng, L., Wang, N., Dvornek, N., Zhu, X., and Li, X. Fedni: Federated graph learning with network inpainting for population-based disease prediction. IEEE Transactions on Medical Imaging, 2022.
Platonov et al. (2023) Platonov, O., Kuznedelev, D., Diskin, M., Babenko, A., and Prokhorenkova, L. A critical look at evaluation of gnns under heterophily: Are we really making progress? In ICLR, 2023.
Pržulj (2007) Pržulj, N. Biological network comparison using graphlet degree distribution. Bioinformatics, 23(2), 2007.
Ribeiro et al. (2017) Ribeiro, L. F., Saverese, P. H., and Figueiredo, D. R. struc2vec: Learning node representations from structural identity. In International conference on knowledge discovery and data mining (SIGKDD), 2017.
Sen et al. (2008) Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B., and Eliassi-Rad, T. Collective classification in network data. AI magazine, 29(3), 2008.
Shchur et al. (2018) Shchur, O., Mumme, M., Bojchevski, A., and Günnemann, S. Pitfalls of graph neural network evaluation. arXiv:1811.05868, 2018.
Stokes et al. (2020) Stokes, J. M., Yang, K., Swanson, K., Jin, W., Cubillos-Ruiz, A., Donghia, N. M., MacNair, C. R., French, S., Carfrae, L. A., Bloom-Ackermann, Z., et al. A deep learning approach to antibiotic discovery. Cell, 180(4), 2020.
Tan et al. (2023) Tan, Y., Liu, Y., Long, G., Jiang, J., Lu, Q., and Zhang, C. Federated learning on non-iid graphs via structural knowledge sharing. In AAAI conference on artificial intelligence, volume 37, 2023.
Veličković et al. (2018) Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., and Bengio, Y. Graph attention networks. In ICLR, 2018.
Xu et al. (2019) Xu, K., Hu, W., Leskovec, J., and Jegelka, S. How powerful are graph neural networks? In ICLR, 2019.
Zhang et al. (2021) Zhang, K., Yang, C., Li, X., Sun, L., and Yiu, S. M. Subgraph federated learning with missing neighbor generation. Neurips, 34, 2021.
Zhang et al. (2022) Zhang, K., Xie, H., Gu, Z., Li, X., Sun, L., Yiu, S. M., Yao, Y., and Yang, C. Subgraph federated learning over heterogeneous graphs. In FedGraph, 2022.
Zhang et al. (2024) Zhang, K., Sun, L., Ding, B., Yiu, S. M., and Yang, C. Deep efficient private neighbor generation for subgraph federated learning. In Siam Conference on Data Mining (SDM), 2024.
Zheng et al. (2022) Zheng, X., Liu, Y., Pan, S., Zhang, M., Jin, D., and Yu, P. S. Graph neural networks for graphs with heterophily: A survey. arXiv:2202.07082, 2022.
Zhu et al. (2020) Zhu, J., Yan, Y., Zhao, L., Heimann, M., Akoglu, L., and Koutra, D. Beyond homophily in graph neural networks: Current limitations and effective designs. Neurips, 33, 2020.

Appendix A Node Structure Feature Generation

In this appendix, we provide some background on the structure-feature creation. The simplest method considered in this paper is the one-hot degree encoding (Deg), which simply creates a vector with a single non-zero entry whose position indicates the node degree. Naturally, the one-hot encoding is unable to capture anything beyond 1-hop neighbors.

Graphlet degree vectors (GDVs) go beyond the 1-hop neighborhood by representing the local structure of a node within the node structure feature (NSF). This is achieved by, for each node, counting the number of occurrences within a pre-defined set of graphlets (Pržulj, 2007). Although GDV is more expressive than Deg, this approach suffers from high complexity as the number of graphlets grows very fast with the number of nodes.

Node2Vec is an unsupervised node embedding algorithm that creates NSFs from random walks in the graph by capturing both local and global properties (Grover & Leskovec, 2016). It uses controlled random walks to explore the graph, generating node sequences similar to sentences in a language. For each node in the graph, multiple random walks are performed creating a large corpus of node sequences. These sequences are then used to train a skip-gram model to predict the context of a given node, i.e., the neighboring nodes in the random walks. Once trained, the NSFs are obtained by querying the skip-gram model with a node and extracting its latent representation.

In the context of FedStruct, Node2Vec shows superior performance compared to Deg and GDV, see Section 7 and Appendix J, due to its ability to capture global properties of the graph, higher expressivity from using continuous representation, and is optimized for the given graph. Additionally, by proper tuning, the NSFs may rely on either homophily or on structural equivalence (Grover & Leskovec, 2016). However, the objective function of Node2Vec contains multiple hyper parameters and is task agnostic, i.e., its representations do not account for the node classification task at hand. Furthermore, Node2Vec requires access to the entire adjacency matrix in order to collect random walks. Therefore, it cannot be used for the scenario with no knowledge of the global graph.

To leverage the advantages of Node2Vec but also alleviate its shortcomings for our setting, we propose Hop2Vec. In Hop2Vec, instead of learning the NSFs before FedStruct initiates, we start from a random NSF and train a generator within FedStruct. As a result, we obtain NSFs that are tailored toward the task, e.g., to minimize miss-classification. Notably, in sharp contrast to Node2Vec, Hop2Vec does not require knowledge of the complete adjacency matrix. Moreover, compared to Node2Vec, Hop2Vec is faster and does not need any hyperparameter tuning. Finally, when paired with decoupled GCN (DGCN), it is able to capture the global properties of the graph. A summary of the different NSF-generating methods is given in Table 2.

Table 2: Comparing NSF generator algorithms

Data	Deg	GDV	Node2Vec	Hop2Vec
task dependent	✗	✗	✗	✓
global properties	✗	✗	✓	✓DGCN
fast	✓	✗	✗	✓
no parameter tuning	✓	✓	✗	✓
locally computable	✓	✗	✗	✓

Appendix B Proof of 1

To prove 1, we need the following lemma.

Lemma 1.

Let $\hat{\boldsymbol{y}}=\mathrm{softmax}(\boldsymbol{z})$ and $\mathcal{L}=\mathrm{CE}(\boldsymbol{y},\hat{\boldsymbol{y}})$ , where $\mathrm{CE}$ is the cross-entropy loss function and $\boldsymbol{y}$ is the label corresponding to the input vector $\boldsymbol{z}$ , with $\sum_{i=1}^{c}{y_{i}}=1$ and $c$ being the number of classes. The gradient of $\mathcal{L}$ with respect to $\boldsymbol{z}$ is equal to

\nabla_{\boldsymbol{z}}\mathcal{L}=\hat{\boldsymbol{y}}-\boldsymbol{y}\,.

(25)

Proof.

Using the definition of cross-entropy

\displaystyle\mathrm{CE}(\boldsymbol{y},\hat{\boldsymbol{y}})=-\sum_{i=1}^{c}{y_{i}\mathrm{log}(\hat{y}_{i})}\,,

(26)

and the definition of softmax,

\displaystyle\mathrm{softmax}(\boldsymbol{z})=||\left(\frac{e^{z_{i}}}{\sum_{j=1}^{c}{e^{z_{j}}}}\quad\forall i\in[c]\right)\,,

(27)

we can rewrite $\mathcal{L}$ as

$\displaystyle\mathcal{L}$	$\displaystyle=\sum_{i=1}^{c}{y_{i}\big{(}\mathrm{LSE}(\boldsymbol{z})-z_{i}\big{)}}$
	$\displaystyle=\mathrm{LSE}(\boldsymbol{z})\sum_{i=1}^{c}{y_{i}}-\sum_{i=1}^{c}{y_{i}z_{i}}$
	$\displaystyle=\mathrm{LSE}(\boldsymbol{z})-\boldsymbol{z}^{\mathsf{T}}\boldsymbol{y}\,,$	(28)

where $\mathrm{LSE}(\boldsymbol{z})=\mathrm{log}(\sum_{j=1}^{c}{e^{z_{j}}})$ is the log-sum-exp function. The partial derivative of $\mathrm{LSE}(\boldsymbol{z})$ with respect to $\boldsymbol{z}$ is the $\mathrm{softmax}$ function

$\displaystyle\nabla_{\boldsymbol{z}}\mathrm{LSE}(\boldsymbol{z})$	$\displaystyle=\|\|(\frac{\partial{\mathrm{LSE}(\boldsymbol{z})}}{\partial{z_{k}}}\quad\forall k\in[c])$
	$\displaystyle=\|\|(\frac{e^{z_{k}}}{\sum_{j=1}^{c}{e^{z_{j}}}}\quad\forall k\in[c])$
	$\displaystyle=\mathrm{softmax}(\boldsymbol{z})\,.$	(29)

Therefore, using (B) and taking the derivative of (B) with respect to $\boldsymbol{z}$ leads to

\displaystyle\nabla_{\boldsymbol{z}}\mathcal{L}

\displaystyle=\nabla_{\boldsymbol{z}}\mathrm{LSE}(\boldsymbol{z})-\nabla_{\boldsymbol{z}}(\boldsymbol{z}^{\mathsf{T}}\boldsymbol{y})=\hat{\boldsymbol{y}}-\boldsymbol{y}\,.

∎

Using 1, we can now prove 1. By the chain rule and 1, we have

	$\displaystyle\frac{\partial{\mathcal{L}_{i}(\boldsymbol{\theta})}}{\partial{\theta_{j}}}$	$\displaystyle=\sum_{v\in\tilde{\mathcal{V}}_{i}}{\frac{\partial{\boldsymbol{z}_{v}}}{\partial{\theta_{j}}}\nabla_{\boldsymbol{z}_{v}}\mathcal{L}_{i}(\boldsymbol{\theta})}$
		$\displaystyle=\sum_{v\in\tilde{\mathcal{V}}_{i}}{\frac{\partial{\boldsymbol{z}_{v}}}{\partial{\theta_{j}}}(\hat{\boldsymbol{y}}_{v}-\boldsymbol{y}_{v})}\quad\forall j\in[\|\boldsymbol{\theta}\|]\,,$		(30)

where $\boldsymbol{z}_{v}=(\boldsymbol{\Theta}_{\mathsf{c}}^{(\mathsf{s})})^{\mathsf{T}}\boldsymbol{s}_{v}^{(L_{\mathsf{s}})}+(\boldsymbol{\Theta}_{\mathsf{c}}^{(\mathsf{f})})^{\mathsf{T}}\boldsymbol{h}_{v}^{(L)}$ . Taking the derivative of $\boldsymbol{z}_{v}$ with respect to different entries of $\boldsymbol{\theta}$ leads to

$\displaystyle\frac{\partial{\boldsymbol{z}_{v}}}{\partial{\theta_{\mathsf{s},j}}}$	$\displaystyle=\frac{\partial{\boldsymbol{s}_{v}^{(L_{\mathsf{s}})}}}{\partial{\theta_{\mathsf{s},j}}}\boldsymbol{\Theta}_{\mathsf{c}}^{(\mathsf{s})}\quad\forall j\in[\|\boldsymbol{\theta}_{\mathsf{s}}\|]$	(31)
$\displaystyle\frac{\partial{\boldsymbol{z}_{v}}}{\partial{\theta_{\mathsf{f},j}}}$	$\displaystyle=\frac{\partial{\boldsymbol{h}_{v}^{(L)}}}{\partial{\theta_{\mathsf{f},j}}}\boldsymbol{\Theta}_{\mathsf{c}}^{(\mathsf{f})}\quad\forall j\in[\|\boldsymbol{\theta}_{\mathsf{f}}\|]$	(32)
$\displaystyle\frac{\partial{\boldsymbol{z}_{v}}}{\partial{\theta_{\mathsf{c},j}^{(\mathsf{s})}}}$	$\displaystyle=\frac{\partial{\big{(}(\boldsymbol{\Theta}_{\mathsf{c}}^{(\mathsf{s})})^{\mathsf{T}}\boldsymbol{s}_{v}^{(L_{\mathsf{s}})}\big{)}}}{\partial{\theta_{\mathsf{c},j}^{(\mathsf{s})}}}\quad\forall j\in[\|\boldsymbol{\theta}_{\mathsf{c}}^{(\mathsf{s})}\|]$	(33)
$\displaystyle\frac{\partial{\boldsymbol{z}_{v}}}{\partial{\theta_{\mathsf{c},j}^{(\mathsf{f})}}}$	$\displaystyle=\frac{\partial{\big{(}(\boldsymbol{\Theta}_{\mathsf{c}}^{(\mathsf{f})})^{\mathsf{T}}\boldsymbol{h}_{v}^{(L)}\big{)}}}{\partial{\theta_{\mathsf{c},j}^{(\mathsf{f})}}}\quad\forall j\in[\|\boldsymbol{\theta}_{\mathsf{c}}^{(\mathsf{f})}\|]\,.$	(34)

Substituting (31), (32), (33), and (34) into (B) concludes the proof.

Appendix C Proof of 2

Using the chain rule and 1, we have

	$\displaystyle\nabla_{\boldsymbol{s}_{q}}\mathcal{L}_{i}(\boldsymbol{\theta},\boldsymbol{S})$	$\displaystyle=\sum_{v\in\tilde{\mathcal{V}}_{i}}{\frac{\partial{\boldsymbol{z}_{v}}}{\partial{\boldsymbol{s}_{q}}}\nabla_{\boldsymbol{z}_{v}}\mathcal{L}_{i}(\boldsymbol{\theta},\boldsymbol{S})}$
		$\displaystyle=\sum_{v\in\tilde{\mathcal{V}}_{i}}{\frac{\partial{\boldsymbol{z}_{v}}}{\partial{\boldsymbol{s}_{q}}}(\hat{\boldsymbol{y}}_{v}-\boldsymbol{y}_{v})}\quad\forall q\in\mathcal{V}\,,$		(35)

\displaystyle\frac{\partial{\boldsymbol{z}_{v}}}{\partial{\boldsymbol{s}_{q}}}=\frac{\partial{\boldsymbol{s}_{v}^{(L_{\mathsf{s}})}}}{\partial{\boldsymbol{s}_{q}}}\boldsymbol{\Theta}_{\mathsf{c}}^{(\mathsf{s})}\quad\forall q\in\mathcal{V}\,.

(36)

Substituting (36) into (C) concludes the proof.

Appendix D Task-Dependent FedStruct-A Algorithm

The framework FedStruct-A with Hop2Vec is described in Algorithm 2.

Algorithm 2 Task-dependent FedStruct-A algorithm

0: global graph

\mathcal{G}

, sGNN and fGNN functions, NSF generator function

\boldsymbol{Q}_{NSF}

, K client with their respective subgraph

\{\mathcal{G}_{i}\}_{i=1}^{K}

FedStruct model parameters

\boldsymbol{\theta}=(\boldsymbol{\theta}_{\mathsf{f}}||\boldsymbol{\theta}_{\mathsf{s}}||\boldsymbol{\theta}_{\mathsf{c}})

Create

\boldsymbol{S}

by randomly initialize

\boldsymbol{s}_{v}\quad\forall v\in\mathcal{V}

for e=1 to Epochs do

\boldsymbol{S}^{(L_{\mathsf{s}})}\leftarrow\text{sGNN}_{\boldsymbol{\theta}_{\mathsf{s}}}(\boldsymbol{S},\mathcal{E})

for i=1 to K do

Client

i

collects

\boldsymbol{\theta}

from the server

Client

i

collects

\{\boldsymbol{s}_{v}^{(L_{\mathsf{s}})},\forall v\in\tilde{\mathcal{V}}_{i}\}

from the server

Client

i

collects

\{\frac{\partial{\boldsymbol{s}_{v}^{(L_{\mathsf{s}})}}}{\partial{\boldsymbol{\theta}_{\mathsf{s}}}},\forall v\in\tilde{\mathcal{V}}_{i}\}

from the server

Client

i

collects

\{\frac{\partial{\boldsymbol{s}_{v}^{(L_{\mathsf{s}})}}}{\partial{\boldsymbol{s}_{q}}},\forall v\in\tilde{\mathcal{V}}_{i},q\in\mathcal{V}\}

from the server

\boldsymbol{H}_{i}^{(L)}=\text{fGNN}_{\boldsymbol{\theta}_{\mathsf{f}}}(\boldsymbol{X}_{i},\mathcal{E}_{i})

for

v\in\tilde{\mathcal{V}}_{i}

\hat{\boldsymbol{y}}_{v}=\mathrm{softmax}(\boldsymbol{\Theta}_{\mathsf{c}}^{\mathsf{T}}.(\boldsymbol{s}_{v}^{(L_{\mathsf{s}})}||\boldsymbol{h}_{v}^{(L)}))

end for

Calculate

\mathcal{L}_{i}(\boldsymbol{\theta},\boldsymbol{S})

based on (7)

Calculate

\nabla_{\boldsymbol{\theta}}\mathcal{L}_{i}(\boldsymbol{\theta},\boldsymbol{S})

from 1

Calculate

\nabla_{\boldsymbol{S}}\mathcal{L}_{i}(\boldsymbol{\theta},\boldsymbol{S})

from 2

Send

\nabla_{\boldsymbol{\theta}}\mathcal{L}_{i}(\boldsymbol{\theta},\boldsymbol{S})

and

\nabla_{\boldsymbol{S}}\mathcal{L}_{i}(\boldsymbol{\theta},\boldsymbol{S})

to the server

end for

Calculate

\nabla_{\boldsymbol{\theta}}\mathcal{L}(\boldsymbol{\theta},\boldsymbol{S})

based on (8)

Calculate

\nabla_{\boldsymbol{S}}\mathcal{L}(\boldsymbol{\theta},\boldsymbol{S})=\frac{1}{|\tilde{\mathcal{V}}|}\sum_{i=1}^{K}{\nabla_{\boldsymbol{S}}\mathcal{L}_{i}(\boldsymbol{\theta},\boldsymbol{S})}

\boldsymbol{\theta}\leftarrow\boldsymbol{\theta}-\lambda\nabla_{\boldsymbol{\theta}}\mathcal{L}(\boldsymbol{\theta},\boldsymbol{S})

\boldsymbol{S}\leftarrow\boldsymbol{S}-\lambda_{\mathsf{s}}\nabla_{\boldsymbol{S}}\mathcal{L}(\boldsymbol{\theta},\boldsymbol{S})

end for

Appendix E Proof of 4

Using (B) we have

\displaystyle\frac{\partial{\mathcal{L}_{i}(\boldsymbol{\theta})}}{\partial{\theta_{j}}}=\sum_{v\in\tilde{\mathcal{V}}_{i}}{\frac{\partial{\boldsymbol{z}_{v}}}{\partial{\theta_{j}}}(\hat{\boldsymbol{y}}_{v}-\boldsymbol{y}_{v})}\,,

where $\boldsymbol{z}_{v}=\sum_{u\in\mathcal{V}}{\bar{a}_{vu}\boldsymbol{g}_{\boldsymbol{\theta}_{\mathsf{s}}}(\boldsymbol{s}_{u})}+\sum_{u\in\mathcal{V}_{i}}{\bar{a}_{vu}^{(i)}\boldsymbol{f}_{\boldsymbol{\theta}_{\mathsf{f}}}(\boldsymbol{x}_{u})}$ . Taking the derivative of $\boldsymbol{z}_{v}$ with respect to different entries of $\boldsymbol{\theta}$ leads to

	$\displaystyle\frac{\partial{\boldsymbol{z}_{v}}}{\partial{\theta_{\mathsf{s},j}}}$	$\displaystyle=\sum_{u\in\mathcal{V}}{\bar{a}_{vu}\frac{\partial{\boldsymbol{g}_{\boldsymbol{\theta}_{\mathsf{s}}}(\boldsymbol{s}_{u})}}{\partial{\theta_{\mathsf{s},j}}}}\quad\forall j\in[\|\boldsymbol{\theta}_{\mathsf{s}}\|]$		(37)
	$\displaystyle\frac{\partial{\boldsymbol{z}_{v}}}{\partial{\theta_{\mathsf{f},j}}}$	$\displaystyle=\sum_{u\in\mathcal{V}}{\bar{a}_{vu}\frac{\partial{\boldsymbol{h}_{\boldsymbol{\theta}_{\mathsf{f}}}(\boldsymbol{s}_{u})}}{\partial{\theta_{\mathsf{f},j}}}}\quad\forall j\in[\|\boldsymbol{\theta}_{\mathsf{f}}\|]$		(38)

Substituting (37) and (38) into (B) and separating the summation over $\mathcal{V}$ over the different clients concludes the proof.

Appendix F Proof of 6

First notice that

\displaystyle\nabla_{\boldsymbol{S}}\mathcal{L}_{i}(\boldsymbol{\theta},\boldsymbol{S})=||\big{(}(\nabla_{\boldsymbol{s}_{q}}\mathcal{L}_{i}(\boldsymbol{\theta},\boldsymbol{S}))^{\mathsf{T}}\quad\forall q\in\mathcal{V}\big{)}\,.

(39)

Using (C) we have

\displaystyle\nabla_{\boldsymbol{s}_{q}}\mathcal{L}_{i}(\boldsymbol{\theta},\boldsymbol{S})

\displaystyle=\sum_{v\in\tilde{\mathcal{V}}_{i}}{\frac{\partial{\boldsymbol{z}_{v}}}{\partial{\boldsymbol{s}_{q}}}(\hat{\boldsymbol{y}}_{v}-\boldsymbol{y}_{v})}\quad\forall q\in\mathcal{V}\,,

where $\boldsymbol{z}_{v}=\sum_{u\in\mathcal{V}}{\bar{a}_{vu}\boldsymbol{s}_{u}}+\sum_{u\in\mathcal{V}_{i}}{\bar{a}_{vu}^{(i)}\boldsymbol{f}_{\boldsymbol{\theta}_{\mathsf{f}}}(\boldsymbol{x}_{u})}$ . Taking the derivative of $\boldsymbol{z}_{v}$ with respect to $\boldsymbol{s}_{q}$ leads to

$\displaystyle\frac{\partial{\boldsymbol{z}_{v}}}{\partial{\boldsymbol{s}_{q}}}$	$\displaystyle=\sum_{u\in\mathcal{V}}{\bar{a}_{vu}\frac{\partial{\boldsymbol{s}_{u}}}{\partial{\boldsymbol{s}_{q}}}}$
	$\displaystyle=\bar{a}_{vq}\frac{\partial{\boldsymbol{s}_{q}}}{\partial{\boldsymbol{s}_{q}}}$
	$\displaystyle=\bar{a}_{vq}\boldsymbol{I}\quad\forall q\in\mathcal{V}$	(40)

Substitute (F) into (C) leads to

\displaystyle\nabla_{\boldsymbol{s}_{q}}\mathcal{L}_{i}(\boldsymbol{\theta},\boldsymbol{S})

\displaystyle=\sum_{v\in\tilde{\mathcal{V}}_{i}}{\bar{a}_{vq}(\hat{\boldsymbol{y}}_{v}-\boldsymbol{y}_{v})}\quad\forall q\in\mathcal{V}

(41)

Substitute (41) into (39) validate the theorem.

Appendix G Scheme to Obtain $\boldsymbol{\bar{A}}^{[i]}$

As discussed in Section 6, FedStruct with decoupled GCN only requires the clients to have access to the local partition of the global $L$ -hop combined adjacency matrix, i.e., $\boldsymbol{\bar{A}}^{[i]}$ . In this appendix, we present an algorithm to enable clients to privately acquire this matrix.

We denote by $\boldsymbol{\tilde{A}}^{[i]}=||(\boldsymbol{\tilde{a}}_{v}^{\mathsf{T}},\quad\forall v\in\tilde{\mathcal{V}}_{i})$ and $\boldsymbol{\hat{A}}^{[i]}=||(\boldsymbol{\hat{a}}_{v}^{\mathsf{T}},\quad\forall v\in\tilde{\mathcal{V}}_{i})$ the local partition of $\boldsymbol{\tilde{A}}$ and $\boldsymbol{\hat{A}}$ pertaining to the labeled nodes of client $i$ , respectively. Further, note that $\boldsymbol{\hat{A}}^{[i]}=\big{(}\boldsymbol{\tilde{D}}^{[i]}\big{)}^{-1}\boldsymbol{\tilde{A}}^{[i]}$ , where $\boldsymbol{\tilde{D}}^{[i]}\in\mathbb{R}^{|\tilde{\mathcal{V}}_{i}|\times|\tilde{\mathcal{V}}_{i}|}$ is the diagonal matrix of node degrees, with $\boldsymbol{\tilde{D}}^{[i]}(v,v)=\sum_{u\in\mathcal{V}}{\tilde{a}_{vu}}$ , $\forall v\in\tilde{\mathcal{V}}_{i}$ . Also, let $\boldsymbol{\tilde{A}}^{[i]}_{j}\in\mathbb{R}^{|\tilde{\mathcal{V}}_{i}|\times|\mathcal{V}_{j}|}$ denote the entries of $\boldsymbol{\tilde{A}}^{[i]}$ that connects client $i$ to client $j$ for $j\in[K]$ and define $\boldsymbol{\hat{A}}^{[i]}_{j}$ analogously. Consequently, we have $\boldsymbol{\hat{A}}^{[i]}_{j}=\big{(}\boldsymbol{\tilde{D}}^{[i]}\big{)}^{-1}\boldsymbol{\tilde{A}}^{[i]}_{j}$ . We note that client $i$ may compute $\boldsymbol{\hat{A}}^{[i]}_{j}$ locally as it knows the destination of its outgoing edges, i.e., $\boldsymbol{\tilde{A}}^{[i]}_{j}$ for all $j\in[K]$ , and can compute $\boldsymbol{\tilde{D}}^{[i]}$ .

In client $i$ , we are interested in acquiring its local partition of $\boldsymbol{\bar{A}}$ , i.e., $\boldsymbol{\bar{A}}^{[i]}$ . For $L_{s}=2$ , we have

\displaystyle\boldsymbol{\bar{A}}^{[i]}_{j}

\displaystyle=\beta_{1}\boldsymbol{\hat{A}}^{[i]}_{j}+\beta_{2}\sum_{k\in[K]}\boldsymbol{\hat{A}}^{[i]}_{k}\boldsymbol{\hat{A}}^{[k]}_{j},\quad j\in[K]\,.

(42)

We introduce the notation $[\boldsymbol{\hat{A}}^{[i]}_{j}]^{\ell}=\sum_{k\in[K]}\boldsymbol{\hat{A}}^{[i]}_{k}\big{[}\boldsymbol{\hat{A}}^{[k]}_{j}\big{]}^{\ell-1}$ and write

	$\displaystyle[\boldsymbol{\hat{A}}^{[i]}_{j}]^{2}$	$\displaystyle=\sum_{k\in[K]}\boldsymbol{\hat{A}}^{[i]}_{k}\boldsymbol{\hat{A}}^{[k]}_{j},\quad j\in[K]$		(43)
		$\displaystyle=(\boldsymbol{\tilde{D}}^{[i]})^{-1}\sum_{k\in[K]}\underbrace{\boldsymbol{\tilde{A}}^{[i]}_{k}\boldsymbol{\hat{A}}^{[k]}_{j}}_{\text{computed at client }k},\quad j\in[K]\,.$		(44)

From (44), we note that client $i$ will have stored $[\boldsymbol{\hat{A}}^{[i]}_{j}]^{2}$ for all $j\in[K]$ and, therefore, also $[\boldsymbol{\hat{A}}^{[i]}]^{2}$ . For integer $\ell\geq 2$ , we can leverage the information that was acquired in the previous rounds as

\displaystyle[\boldsymbol{\hat{A}}^{[i]}_{j}]^{\ell}=(\boldsymbol{\tilde{D}}_{i})^{-1}\sum_{k\in[K]}\underbrace{\boldsymbol{B}_{ijk}}_{\text{computed at client }k},\quad j\in[K]\,.

(45)

where $\boldsymbol{B}_{ijk}=\boldsymbol{\tilde{A}}^{[i]}_{k}[\boldsymbol{\hat{A}}^{[k]}_{j}]^{\ell-1}$ is an $n_{i}\times n_{j}$ matrix that can be calculated in client $k$ . Hence, the clients may compute $\boldsymbol{\bar{A}}^{[i]}_{j}$ sequentially as

\displaystyle\boldsymbol{\bar{A}}^{[i]}_{j}

\displaystyle=\beta_{1}\boldsymbol{\hat{A}}^{[i]}_{j}+\sum_{\ell=2}^{L_{s}}\beta_{\ell}[\boldsymbol{\hat{A}}^{[i]}_{j}]^{\ell},\quad j\in[K]\,.

(46)

The resulting algorithm is shown in Algorithm 3. Finally, we notice that Algorithm 3 requires each client to evaluate $L_{s}K^{2}$ matrix products and to communicate $L_{s}(K-1)K$ matrices. Although the complexity is large, we emphasize that this algorithm is performed only once before the training initiates.

Algorithm 3 Private acquisition of

\boldsymbol{\bar{A}}^{[i]}

\big{\{}\boldsymbol{\tilde{A}}^{[i]}_{j},\boldsymbol{\tilde{A}}^{[j]}_{i}\big{\}}_{j=1}^{K},\boldsymbol{\tilde{D}}^{[i]}

, power

L

, and list of weights

\{\beta_{\ell}\}_{\ell=1}^{L}

0: local matrix

\{\boldsymbol{\bar{A}}^{[i]}\}

for

i=1

K

for

j=1

K

Initialize

\boldsymbol{\bar{A}}^{[i]}_{j}=\beta_{1}(\boldsymbol{\tilde{D}}^{[i]})^{-1}\boldsymbol{\tilde{A}}^{[i]}_{j}

end for

for

\ell=2

L_{s}

for

i=1

K

Client

i

collects

\boldsymbol{B}_{ijk}=\boldsymbol{\tilde{A}}^{[i]}_{k}[\boldsymbol{\hat{A}}^{[k]}_{j}]^{\ell-1}

from clients

k\in[K]

Client

i

stores

[\boldsymbol{\hat{A}}^{[i]}_{j}]^{\ell}=(\boldsymbol{\tilde{D}}^{[i]})^{-1}\sum_{k=1}^{K}\boldsymbol{B}_{ijk}

for

j\in[K]

\boldsymbol{\bar{A}}^{[i]}_{j}\leftarrow\textasciiacute\boldsymbol{\bar{A}}^{[i]}_{j}+\beta_{\ell}[\boldsymbol{\hat{A}}^{[i]}_{j}]^{\ell}

for

j\in[K]

end for

Appendix H FedStruct-B Algorithm

The FedStruct-B framework is described in Algorithm 4.

Algorithm 4 FedStruct-B

0: K client with their respective subgraph

\{\mathcal{G}_{i}\}_{i=1}^{K}

FedStruct model parameters

\boldsymbol{\theta}=(\boldsymbol{\theta}_{\mathsf{f}}||\boldsymbol{\theta}_{\mathsf{s}})

for i=1 to K do

Collaboratively obtain

\boldsymbol{\bar{A}}^{[i]}

based on Algorithm 3

Locally compute NSFs

\mathcal{S}_{i}=\{\boldsymbol{s}_{u},\forall u\in\mathcal{V}_{i}\}

\mathcal{S}_{i}

with all the clients

end for

for e=1 to Epochs do

for i=1 to K do

Client

i

collects

\boldsymbol{\theta}

from the server

for

v\in\tilde{\mathcal{V}}_{i}

Calculate

\hat{\boldsymbol{y}}_{v}

based on (20)

end for

Calculate

\mathcal{L}_{i}(\boldsymbol{\theta})

based on (7)

Calculate

\nabla_{\boldsymbol{\theta}}\mathcal{L}_{i}(\boldsymbol{\theta})

based on 4

Send

\nabla_{\boldsymbol{\theta}}\mathcal{L}_{i}(\boldsymbol{\theta})

to the server

end for

Calculate

\nabla_{\boldsymbol{\theta}}\mathcal{L}(\boldsymbol{\theta})

based on (8)

\boldsymbol{\theta}\leftarrow\boldsymbol{\theta}-\lambda\nabla_{\boldsymbol{\theta}}\mathcal{L}(\boldsymbol{\theta})

end for

Appendix I FedStruct-B with Hop2Vec algorithm

The training process of FedStruct-B using Hop2Vec is described in Algorithm 5.

Algorithm 5 Task dependent FedStruct-B

0: K client with their respective subgraph

\{\mathcal{G}_{i}\}_{i=1}^{K}

FedStruct model parameters

\boldsymbol{\theta}=(\boldsymbol{\theta}_{\mathsf{f}}||\boldsymbol{\theta}_{\mathsf{s}})

Serrver initialize

\nabla_{\boldsymbol{S}}\mathcal{L}(\boldsymbol{\theta},\boldsymbol{S})=\boldsymbol{0}

for i=1 to K do

Collaboratively obtain

\boldsymbol{\bar{A}}^{[i]}

based on Algorithm 3

Create

\mathcal{S}_{i}=\{\boldsymbol{s}_{u},\forall u\in\mathcal{V}_{i}\}

by randomly initialize

\boldsymbol{s}_{u}\quad\forall u\in\mathcal{V}_{i}

\mathcal{S}_{i}

with all the clients

Collect

\mathcal{S}_{j}\quad\forall j\in[K]

and create

\boldsymbol{S}=||(\boldsymbol{s}_{v}^{\mathsf{T}},\quad\forall v\in\mathcal{V})

end for

for e=1 to Epochs do

for i=1 to K do

Client

i

collects

\boldsymbol{\theta}

from the server

Client

i

collects

\nabla_{\boldsymbol{S}}\mathcal{L}(\boldsymbol{\theta},\boldsymbol{S})

from the server

Client

i

updates

\boldsymbol{S}\leftarrow\boldsymbol{S}-\lambda_{\mathsf{s}}\nabla_{\boldsymbol{S}}\mathcal{L}(\boldsymbol{\theta},\boldsymbol{S})

for

v\in\tilde{\mathcal{V}}_{i}

Calculate

\hat{\boldsymbol{y}}_{v}

based on (23)

end for

Calculate

\mathcal{L}_{i}(\boldsymbol{\theta},\boldsymbol{S})

based on (7)

Calculate

\nabla_{\boldsymbol{\theta}}\mathcal{L}_{i}(\boldsymbol{\theta},\boldsymbol{S})

based on 4

Calculate

\nabla_{\boldsymbol{S}}\mathcal{L}_{i}(\boldsymbol{\theta},\boldsymbol{S})

based on 6

Send

\nabla_{\boldsymbol{\theta}}\mathcal{L}_{i}(\boldsymbol{\theta},\boldsymbol{S})

and

\nabla_{\boldsymbol{S}}\mathcal{L}_{i}(\boldsymbol{\theta},\boldsymbol{S})

to the server

end for

Calculate

\nabla_{\boldsymbol{\theta}}\mathcal{L}(\boldsymbol{\theta},\boldsymbol{S})

based on (8)

\boldsymbol{\theta}\leftarrow\boldsymbol{\theta}-\lambda\nabla_{\boldsymbol{\theta}}\mathcal{L}(\boldsymbol{\theta},\boldsymbol{S})

Calculate

\nabla_{\boldsymbol{S}}\mathcal{L}(\boldsymbol{\theta},\boldsymbol{S})=\frac{1}{|\tilde{\mathcal{V}}|}\sum_{i=1}^{K}{\nabla_{\boldsymbol{S}}\mathcal{L}_{i}(\boldsymbol{\theta},\boldsymbol{S})}

end for

Appendix J Experimental Setting

In this section, we provide more details about the experiments in Section 7 and Appendix K. In Table 3, statistics of the different datasets are shown. The edge homophily ratio, measuring the fraction of edges that connect nodes with the same label, provides a measure of the homophily within the dataset. Typically, a value above $0.5$ is considered homophilic (Zheng et al., 2022). According to this rule-of-thumb, among the datasets considered in this paper, two would be considered heterophilic, i.e., Chameleon and Amazon Ratings.

Next, we provide the hyper-parameters for the experiments. In Table 4, we provide the step sizes $\lambda$ and $\lambda_{\mathsf{s}}$ for the gradient descent step during the training, the weight decay in the L2 regularization, the number of training iterations (epochs), the number of layers $L$ in the node feature embedding, the number of layers $L_{\mathsf{s}}$ in the DGCN, the dimensionality of the NSFs, $d_{\mathsf{s}}$ , and the model architecture of the node feature and node structure feature predictors, $\boldsymbol{f}_{{\boldsymbol{\theta}}_{\mathsf{f}}}$ and $\boldsymbol{g}_{{\boldsymbol{\theta}}_{\mathsf{s}}}$ , respectively.

Table 3: Statistics of the datasets.

Data	Cora	CiteSeer	PubMed	Chameleon	Amazon Photo	Amazon Ratings
$\#$ Classes	7	6	3	5	8	5
$\|\mathcal{V}\|$	2708	3327	19717	2277	7650	24492
$\|\mathcal{E}\|$	5278	4676	44327	36101	238162	93050
$\#$ Features	1433	3703	500	2325	745	300
Edge homophily ratio	0.81	0.74	0.80	0.23	0.82	0.38

Table 4: Hyper-parameters of the datasets.

Data	Cora	CiteSeer	PubMed	Chameleon	Amazon Photo	Amazon Ratings
$\lambda$	0.002	0.002	0.008	0.003	0.005	0.002
$\lambda_{\mathsf{s}}$	0.002	0.002	0.008	0.003	0.005	0.002
Weight decay	0.0005	0.0005	0.001	0.0003	0.001	0.0002
Epochs	40	60	125	60	150	70
$L$	2	2	2	1	1	2
$L_{\mathsf{s}}$	10	20	20	1	3	5
$d_{\mathsf{s}}$	256	256	256	256	256	256
$\boldsymbol{\theta}_{\mathsf{f}}$ layers	[1433,64,7]	[3703,64,6]	[500,128,3]	[2325,64,5]	[745,256,8]	[300,64,5]
$\boldsymbol{\theta}_{\mathsf{s}}$ layers	[256,256,7]	[256,128,64,6]	[256,128,3]	[256,256,5]	[256,256,8]	[256,256,5]

Appendix K Additional Results

In this appendix, we provide additional results that we could not include in the main paper due to space constraints.

K.1 Different Partitionings

In this section, we provide results for different types of data partitioning. We consider three different partitioning methods: the Louvain algorithm, the K-means algorithm, and random partitioning.

The Louvain algorithm aims at finding communities with high link density, effectively creating subgraphs with limited interconnections. Here, we follow (Zhang et al., 2021) where the global graph is first divided into a number of subgraphs using the Louvain algorithm. As we strive to have an even number of nodes among the clients, we inspect the size of the subgraphs and if any exceeds $n/K$ nodes, it is split in two. This procedure is repeated until no subgraph has more than $n/K$ nodes. Next, as the number of subgraphs may be larger than $K$ , some subgraphs must be merged. To this end, we sort the subgraphs in descending order with respect to their size and assign the first $K$ subgraphs to the $K$ clients. If there are more than $K$ subgraphs, we assign the ( $K+1$ )-th subgraph by iterating the $K$ clients and assigning it to the first client for which the total number of nodes after a merge would not exceed $n/K$ . This process is repeated until there are only $K$ subgraphs remaining. This procedure results in subgraphs with approximately the same number of nodes and a low number of interconnections.

For the K-means algorithm, we follow (Lei et al., 2023) and partition the data into $K$ clusters by proximity in the node feature space. Similarly to the Louvain partitioning, we split any subgraph that exceeds $n/K$ nodes in two and assign it to another subgraph such that the resulting number of nodes is less than $n/K$ . Compared to the previous partitioning, the K-means approach does not consider the topology, hence, there will be more interconnections between the different subgraphs compared to the Louvain-based partitioning. Further, as highlighted in (Lei et al., 2023), as nodes with similar features tend to have the same label, this partitioning method creates a highly heterogeneous partitioning as each subgraph tends to have an over-representation of nodes of a given class.

Table 5: Classification accuracies for FedStruct with an underlying decoupled GCN. The results are shown for 10 clients with a 10–10–80 train-val-test split.

	Cora			Citeseer			Pubmed
Central GNN	82.06 $\pm$ 1.09			69.17 $\pm$ 0.90			87.71 $\pm$ 0.34
Central MLP	65.31 $\pm$ 1.75			64.18 $\pm$ 0.89			85.80 $\pm$ 0.33
	Louvain	random	kmeans	Louvain	random	kmeans	Louvain	random	kmeans
FL MLP	64.42 $\pm$ 0.54	63.71 $\pm$ 1.59	64.24 $\pm$ 2.02	63.39 $\pm$ 0.68	63.98 $\pm$ 1.53	64.92 $\pm$ 1.39	85.39 $\pm$ 0.33	85.58 $\pm$ 0.39	85.00 $\pm$ 0.32
FL GNN	80.99 $\pm$ 1.33	65.06 $\pm$ 1.92	67.12 $\pm$ 2.01	68.55 $\pm$ 1.25	63.47 $\pm$ 1.29	65.07 $\pm$ 1.39	85.31 $\pm$ 0.44	85.91 $\pm$ 0.42	86.02 $\pm$ 0.35
Fedsage+	80.75 $\pm$ 0.88	66.32 $\pm$ 1.81	67.08 $\pm$ 2.19	68.20 $\pm$ 1.34	63.37 $\pm$ 1.45	64.32 $\pm$ 1.25	85.55 $\pm$ 0.59	85.37 $\pm$ 0.55	85.83 $\pm$ 0.44
Fedsage-TRUE	82.38 $\pm$ 1.28	80.85 $\pm$ 1.55	80.58 $\pm$ 1.30	68.99 $\pm$ 1.33	68.91 $\pm$ 1.30	69.53 $\pm$ 1.25	84.69 $\pm$ 1.14	85.66 $\pm$ 0.37	85.81 $\pm$ 0.80
FedSruct (Deg)	81.64 $\pm$ 2.33	68.64 $\pm$ 1.51	70.25 $\pm$ 2.33	68.41 $\pm$ 1.20	63.81 $\pm$ 1.31	66.16 $\pm$ 1.31	84.52 $\pm$ 0.47	85.09 $\pm$ 0.48	85.22 $\pm$ 0.37
FedSruct (GDV)	80.60 $\pm$ 2.55	72.19 $\pm$ 1.62	73.24 $\pm$ 2.06	68.30 $\pm$ 1.06	64.44 $\pm$ 0.84	66.71 $\pm$ 1.62	84.63 $\pm$ 0.36	86.12 $\pm$ 0.56	85.79 $\pm$ 0.45
FedSruct (N2V)	82.78 $\pm$ 1.03	80.68 $\pm$ 1.77	80.59 $\pm$ 1.50	68.42 $\pm$ 1.24	66.28 $\pm$ 1.04	67.58 $\pm$ 1.18	84.72 $\pm$ 0.44	87.09 $\pm$ 0.44	86.64 $\pm$ 0.49
FedSruct (H2V)	81.23 $\pm$ 1.14	80.28 $\pm$ 1.44	79.80 $\pm$ 1.53	68.50 $\pm$ 0.93	66.29 $\pm$ 0.87	67.20 $\pm$ 1.52	83.74 $\pm$ 0.39	86.65 $\pm$ 0.53	86.01 $\pm$ 0.38
Local GNN	75.52 $\pm$ 1.45	39.23 $\pm$ 1.21	46.80 $\pm$ 4.59	58.75 $\pm$ 1.37	38.99 $\pm$ 1.60	49.15 $\pm$ 5.21	81.17 $\pm$ 0.37	79.10 $\pm$ 0.54	81.34 $\pm$ 0.63
Local MLP	66.33 $\pm$ 1.12	38.57 $\pm$ 1.55	45.81 $\pm$ 3.90	53.27 $\pm$ 1.02	39.50 $\pm$ 1.19	50.12 $\pm$ 5.80	82.78 $\pm$ 0.67	78.51 $\pm$ 0.57	80.81 $\pm$ 0.56

	Chameleon			Amazon Photo			Amazon Ratings
	Louvain	random	kmeans	Louvain	random	kmeans	Louvain	random	kmeans
Central GNN	54.43 $\pm$ 2.44			93.99 $\pm$ 0.48			41.32 $\pm$ 1.03
Central MLP	32.15 $\pm$ 1.42			88.68 $\pm$ 0.98			37.56 $\pm$ 0.48
FL MLP	31.77 $\pm$ 2.32	31.72 $\pm$ 1.93	32.33 $\pm$ 3.27	88.65 $\pm$ 1.11	88.04 $\pm$ 1.36	88.37 $\pm$ 0.88	37.43 $\pm$ 0.68	37.62 $\pm$ 0.61	37.48 $\pm$ 0.73
FL GNN	47.74 $\pm$ 2.11	35.70 $\pm$ 2.29	38.85 $\pm$ 2.41	93.75 $\pm$ 0.40	90.74 $\pm$ 0.58	90.92 $\pm$ 0.77	41.68 $\pm$ 0.76	36.25 $\pm$ 0.46	36.78 $\pm$ 0.49
Fedsage+	46.67 $\pm$ 1.13	35.48 $\pm$ 1.23	38.24 $\pm$ 1.86	92.86 $\pm$ 3.21	90.80 $\pm$ 0.48	90.87 $\pm$ 0.38	41.21 $\pm$ 0.38	36.18 $\pm$ 0.77	37.26 $\pm$ 0.82
Fedsage-TRUE	53.89 $\pm$ 1.72	50.75 $\pm$ 2.82	52.13 $\pm$ 2.78	93.65 $\pm$ 1.41	90.95 $\pm$ 5.27	93.58 $\pm$ 0.75	41.74 $\pm$ 0.60	39.98 $\pm$ 0.68	39.61 $\pm$ 0.55
FedSruct (Deg)	48.65 $\pm$ 2.39	41.01 $\pm$ 1.26	42.26 $\pm$ 2.60	91.20 $\pm$ 1.24	89.59 $\pm$ 0.88	89.84 $\pm$ 0.53	41.37 $\pm$ 0.61	39.08 $\pm$ 0.57	38.61 $\pm$ 0.58
FedSruct (GDV)	47.87 $\pm$ 1.88	39.54 $\pm$ 2.46	41.09 $\pm$ 2.17	91.69 $\pm$ 0.87	90.51 $\pm$ 0.45	90.60 $\pm$ 0.57	41.92 $\pm$ 0.53	39.53 $\pm$ 0.87	39.44 $\pm$ 0.61
FedSruct (N2V)	50.91 $\pm$ 1.88	43.49 $\pm$ 1.93	46.27 $\pm$ 2.26	91.68 $\pm$ 0.74	91.19 $\pm$ 1.05	91.60 $\pm$ 0.47	41.95 $\pm$ 0.69	41.43 $\pm$ 0.52	41.27 $\pm$ 0.51
FedSruct (H2V)	54.00 $\pm$ 1.71	52.36 $\pm$ 2.63	53.97 $\pm$ 2.73	90.45 $\pm$ 1.23	91.83 $\pm$ 0.39	91.44 $\pm$ 0.80	41.28 $\pm$ 0.65	41.23 $\pm$ 0.38	40.38 $\pm$ 0.59
Local GNN	45.26 $\pm$ 2.08	28.51 $\pm$ 1.59	31.08 $\pm$ 3.09	91.47 $\pm$ 0.66	79.66 $\pm$ 1.15	80.60 $\pm$ 1.57	40.88 $\pm$ 0.95	32.81 $\pm$ 0.46	35.57 $\pm$ 0.57
Local MLP	29.50 $\pm$ 2.03	21.06 $\pm$ 0.89	24.52 $\pm$ 3.22	88.34 $\pm$ 0.71	71.25 $\pm$ 1.30	75.07 $\pm$ 1.24	37.77 $\pm$ 0.67	33.24 $\pm$ 0.50	35.36 $\pm$ 0.59

Finally, we consider a random partitioning where each node is allocated to a client uniformly at random. As this partitioning does not take into account the topology or the features, we end up with a large number of interconnections where each subgraph has the same distribution over class labels, i.e., the data is homogeneous across clients. Notably, given the large number of interconnections, this setting arguably constitutes the most challenging scenario in subgraph FL as it is paramount to exploit the interconnections to achieve good performance.

In Table 5, we show the performance of FedStruct over each of the different partitionings for $10$ clients and the train-val-test split according to 10%-10%-80%. We use this split to focus on a challenging semi-supervised scenario. Furthermore, each of the results is reported using the mean and standard deviation obtained from $10$ independent runs.

First, as the central version of the training with GNN does not depend on the partitioning, it is the same as in Table 1. We also report the performance of an MLP approach. By comparing the performance of Central GNN with Central MLP, one may infer the gains of accounting for the spatial structure within the data. As seen in the table, exploiting the graph structure brings the largest benefits for Cora and Chameleon.

In similar spirit, the performance gap between Local GNN (Local MLP) and Central GNN (Central MLP) indicates the potential gains of employing collaborative learning between the clients. Notably, this gap depends on the partitioning. From Table 5, it can be seen that the gap is the largest for random partitioning followed by the K-means algorithm for all datasets. This is expected, as local training suffers from a large number of interconnections. For example, in Cora, the gap is 42.83%, 35.26%, and 6.54% for random, K-means, and Louvain partitioning, respectively.

Considering FL GNN, it can be seen that all scenarios benefit from collaborative learning. For the Louvain partitioning, due to its community-based partitioning with the low number of inter-connections, FL GNN performs well for all datasets. For K-means and random partitioning, FL GNN performs worse, with the worst performance seen for Cora, Chameleon, and Amazon Ratings. FedSage+ achieves similar performance to FL GNN. The FedSage-TRUE scheme, incorporating the knowledge of the node IDs of missing neighbors in other clients, is based on a mended graph with flawless in-painting of missing one-hop neighbors. Hence, this scheme serves as an upper bound on techniques based on in-painting of one-hop neighbors, such as FedSage+ and FedNI. Notably, this scheme completely violates privacy as the node features of missing clients cannot be shared between clients. From Table 5, we see that this scheme is robust to different partitionings by offering consistent performance close to the Central GNN.

We consider FedStruct with a decoupled GCN as the underlying GNN. Note that in this case FedStruct-A and FedStruct-B yield identical performance. We consider FedStruct with four different methods to generate NSFs: one-hot degree vectors (Deg), graphlet matching (GDV), node2vec (N2V), and our task-dependent method Hop2Vec (H2V), see Appendix A for information. The hyper-parameters are chosen as in Table 4. In Table 5, it can be seen that DEG and GDV, only being able to capture the local structure, perform worse than the methods able to capture more global properties of the graph. This is especially true for the K-means and random partitionings. Node2Vec and Hop2Vec achieve similar performance throughout all scenarios. It should be noted, however, that Node2Vec requires complete knowledge of the adjacency matrix to create the NSFs whereas our proposed Hop2Vec is performed locally and only requires knowledge of the $L$ -hop adjacency matrix. Hence, in situations where clients’ intra-edges are private, Hop2Vec is the preferred NSF generating method. Furthermore, it can be seen that Hop2Vec performs close to FedSage-TRUE for all scenarios and is sometimes superior, e.g., on the Chameleon datasets. This highlights the importance of going beyond the one-hop neighbors for some datasets, something that is not possible in FedSage-TRUE.

K.2 Impact of the Number of Labeled Training Nodes

Figure 2: Accuracy vs training-ratio for Cora with random partitioning.

As highlighted in (Liu et al., 2020), the decoupled GCN is suitable in situations where over-smoothing is a bottleneck. In (Dong et al., 2021), it was further shown that decoupled GCN is especially suited for semi-supervised learning due to its ability to leverage pseudo-labels from all the unlabeled nodes during training. In Fig. 2, we show the performance of FedStruct with underlying decoupled GCN (FedStruct-A and FedStruct-B yield identical performance in this case) on the Cora dataset using different NSF generators for different amounts of labeled nodes in the training set. In particular, we use a $x$ %-10%- $(1-x)\%$ train-val-test split where $x$ denotes the ratio of labeled training nodes.

From Fig 2, it can be seen that FedStruct with both N2V and H2V achieves a performance very close to the Central GNN for all fractions of labeled nodes in the training set. Notably, FedStruct performs very well even for a heavily semi-supervised scenario: it is performant already at 5% of labeled training nodes, achieving close to 80% top-1 accuracy. Federated approaches not utilizing global graph structure, like FedSage+, exhibit a sharper decline in performance for a low fraction of labeled nodes. For example, the top- $1$ accuracy of FedSage+ declines from about 75% for 50% labeled nodes in the training set to below 60% for 5% labeled nodes. The performance of FedStruct is superior for all fractions of level nodes and experiences a much shallower decline, from about 84% for 50% labeled nodes in the training set to about 78% for 5% labeled nodes for FedStruct with N2V. Finally, the performance of local GNN deteriorates significantly from scarce labels.

In Fig. 3, to demonstrate the advantages of incorporating a decoupled GCN into our framework, we provide results for FedStruct with both an underlying decoupled GCN and a standard GNN (GraphSage) for a semi-supervised learning scenario with varying fraction of labeled training nodes Specifically, we show the accuracy for scenarios involving 50%, 10%, and 1% labeled nodes. For 50% and 10% of labeled nodes, FedStruct with both underlying decoupled GCN and standard GNN yield similar results. However, in a heavily semi-supervised scenario (as encountered in applications like anti-money laundering), the accuracy of FedStruct with standard GNN experiences a significant decline. In contrast, FedStruct with decoupled GCN achieves performance close to that of central GNN even in such a challenging scenario. This highlights the critical role of employing a decoupled GCN to effectively tackle the semi-supervised learning scenarios.

Figure 3: Accuracy vs fraction of training nodes with labels for FedStruct with underlying decoupled GCN and underlying standard GNN (GraphSage) on the Amazon Photo dataset with K-means partitioning.

K.3 Impact of the number of Layers in the Decoupled GCN

In this section, we provide results on the impact of the number of layers used in the decoupled GCN, $L_{s}$ , in FedStruct on the Cora dataset. In Fig. 4, we show the top-1 accuracy after training FedStruct using both NFVs and NSFs but at inference, we use only NSFs (s) or both NSFs and NFVs (s+f). The Central GNN and FL GNN rely on GraphSage with $3$ layers and is, therefore, independent of the layers used in the decoupled GCN.

First, we observe that the top-1 accuracy for Deg s+f and GDV s+f do not vary significantly with respect to $L_{s}$ . For Node2Vec (N2V) and Hop2Vec (H2V), the situation is different. The performance of H2V is low for $L_{s}$ below 10 and above 30, achieving the best performance between $10$ and $20$ layers. This is because the NSF is unable to learn an extended neighborhood for small $L_{s}$ whereas, for large $L_{s}$ , over-smoothing becomes the bottleneck. In contrast, N2V shows strong performance also for small $L_{s}$ due to the already learned structure although, it still suffers from over-smoothing. We also visualize N2V s and H2V s to highlight the impact of NSFs alone. For $L_{s}=10$ , N2V s achieves $76\%$ top-1 accuracy whereas N2V s+f achieves close to $82\%$ . Similarly, H2V s achieves $75\%$ whereas H2V s+f achieves $80\%$ top1-accuracy. From here, we conclude that the NSFs alone convey useful information to predict labels but that NFVs are required to achieve performance close to Central GNN.

Figure 4: Accuracy vs number of decoupled GCN propagation layers

L_{s}

on Cora with K-means partitioning.

Appendix L Discussion

L.1 Privacy

Previous frameworks for subgraph FL (Zhang et al., 2021; Peng et al., 2022; Lei et al., 2023; Zhang et al., 2022; Liu et al., 2023; Zhang et al., 2024; Chen et al., 2021; Du & Wu, 2022) require the sharing of either original or generated node features and/or embeddings among clients. In stark contrast, our proposed framework, FedStruct, eliminates this need, sharing less sensitive information.

In the scenario where the server has knowledge of the global graph’s connections, our proposed framework FedStruct-A limits the server access to the NSFs of the nodes and the clients’ local gradients. In turn, clients only receive the gradients of the NSEs.

For the scenario with no knowledge of the global graph, clients need to share some information with other clients. Next, we discuss some privacy considerations associated with FedStruct-B. We begin our discussion by mounting a potential attack to show that some information may be revealed from other clients. Thereafter, we discuss a strategy to mitigate such attacks by refraining from sharing individual information about nodes. Specifically, we present a slightly different version of FedStruct-B that relies on sharing aggregated quantities between clients without affecting the performance of the original version.

In FedStruct-B using task-agnostic NSFs, each client is provided with $\boldsymbol{S}$ , i.e., the NSFs of all clients. Consider an honest-but-curious client $i$ with NSFs $\{\boldsymbol{s}_{u}\mathrel{\mathop{\mathchar 58\relax}}u\in\mathcal{V}_{i}\}$ . Client $i$ may target client $j$ by identifying NSFs within client $j$ that closely match some of its own NSFs. By this simple procedure, client $i$ may compare the topology of the matched local nodes and infer the topology of some of the nodes within $j$ . Furthermore, as $\boldsymbol{\theta}_{s}$ is shared among all clients, client $i$ would be able to guess the labels of the identified nodes within client $j$ . Given a node’s label and topology, it may be further possible to infer something about the node features. Using task-dependent NSF, i.e., Hop2Vec, the situation is similar.

Next, we discuss how to alter FedStruct-B to prevent reconstruction of the local NSFs, therefore mitigating the aforementioned attack, starting from the task-agnostic NSF generators. From 3 and 4, the summation over $\mathcal{V}$ may be split into a summation over clients. The prediction and local gradients for client $i$ may then be written as

	$\displaystyle\hat{\boldsymbol{y}}_{v}$	$\displaystyle=\mathrm{softmax}\Big{(}\sum_{j\in[K]}\sum_{u\in\mathcal{V}_{j}}{\bar{a}_{vu}\boldsymbol{g}_{\boldsymbol{\theta}_{\mathsf{s}}}(\boldsymbol{s}_{u})}+\sum_{u\in\mathcal{V}_{i}}{\bar{a}_{vu}^{(i)}\boldsymbol{f}_{\boldsymbol{\theta}_{\mathsf{f}}}(\boldsymbol{x}_{u})}\Big{)}\,,\quad v\in\tilde{\mathcal{V}}_{i},$		(47)
	$\displaystyle\frac{\partial{\mathcal{L}_{i}(\boldsymbol{\theta})}}{\partial{\theta_{\mathsf{s},q}}}$	$\displaystyle=\sum_{v\in\tilde{\mathcal{V}}_{i}}\sum_{j\in[K]}{\sum_{u\in\mathcal{V}_{j}}\bar{a}_{vu}\frac{\partial{\boldsymbol{g}_{\boldsymbol{\theta}_{\mathsf{s}}}(\boldsymbol{s}_{u})}}{\partial{\theta_{\mathsf{s},q}}}(\hat{\boldsymbol{y}}_{v}-\boldsymbol{y}_{v})},\quad\forall q\in[\|\boldsymbol{\theta}_{s}\|].$		(48)

Notably, client $i$ may evaluate its local gradient provided access to

	$\displaystyle\bar{\boldsymbol{s}}^{(j)}_{v}$	$\displaystyle=\sum_{u\in\mathcal{V}_{j}}{\bar{a}_{vu}\boldsymbol{g}_{\boldsymbol{\theta}_{\mathsf{s}}}(\boldsymbol{s}_{u})},\quad\forall v\in\tilde{\mathcal{V}}_{i}$		(49)
	$\displaystyle\tilde{\boldsymbol{s}}^{(j)}_{vq}$	$\displaystyle=\sum_{u\in\mathcal{V}_{j}}\bar{a}_{vu}\frac{\partial{\boldsymbol{g}_{\boldsymbol{\theta}_{\mathsf{s}}}(\boldsymbol{s}_{u})}}{\partial{\theta_{\mathsf{s},q}}},\quad\forall q\in[\|\boldsymbol{\theta}_{s}\|],\forall v\in\tilde{\mathcal{V}}_{i}$		(50)

for each client $j\in[K]$ . For client $j\neq i$ to be able to evaluate these quantities, it needs access to $\bar{a}_{vu}$ for all $v\in\tilde{\mathcal{V}}_{i}$ and $u\in\tilde{\mathcal{V}}$ , i.e., the weighted sum of the $\ell$ -hop paths, $\ell\in[L]$ , from nodes in client $i$ to nodes in client $j$ . Following Appendix G, client $j$ has access to $a_{vu}$ for $v\in\tilde{\mathcal{V}}_{j}$ and $u\in\mathcal{V}$ . Notably, $a_{vu}\neq a_{uv}$ due to different degree normalizations. Hence, to obtain the required quantitites in (49) and (50), client $i$ may share $\{\bar{a}_{vu}\mathrel{\mathop{\mathchar 58\relax}}v\in\tilde{\mathcal{V}}_{i},u\in\mathcal{V}_{j}\}$ with client $j\in[K]$ after the algorithm in Appendix G has finished.

Compared to FedStruct-B, the gradients of the above method cannot be calculated locally but bring the benefit of clients only knowing their own local NSFs rather than the individual NSFs of others. Because of this, the training has to be done in two phases for each iteration where phase 1 accounts for each client $i\in[K]$ to collect the quantities in (49) and (50) from all other clients and where each client computes the local gradients and shares these with the server during phase 2. Note that the communication complexity in each training iteration does not increase as (49) and (50) amounts to sharing $|\tilde{\mathcal{V}}_{i}|c$ and $|\boldsymbol{\theta}_{s}|c$ parameters, respectively, where $c$ is the number of class labels. Hence, in total, each client $i$ must share $(|\tilde{\mathcal{V}}\setminus\tilde{\mathcal{V}}_{i}|+|\boldsymbol{\theta}_{s}|)c$ parameters in each iteration with the other clients. Note that this sharing can be made either via the server or via peer-to-peer links.

Although a formal privacy analysis seems formidable, we may follow an approach similar to (Lei et al., 2023) to argue around the difficulty of perfectly reconstructing the local NSFs. To this end, we take a conservative approach and consider honest-but-curious clients where all but one client is colluding. Hence, we may consider two clients where client 1 is the attacker and client 2 is the target. In each iteration, client 1 receives $\tilde{\boldsymbol{s}}_{v}^{(2)}\in\mathbb{R}^{c}$ and $\tilde{\boldsymbol{s}}^{(2)}_{vq}\in\mathbb{R}^{c}$ for $v\in\tilde{\mathcal{V}}_{1}$ and $q\in[|\boldsymbol{\theta}_{2}|]$ . To evaluate $\tilde{\boldsymbol{s}}_{v}^{(2)}$ , client 2 utilizes $\mathinner{\!\left\lVert\boldsymbol{a}_{v}\right\rVert}_{0}$ $d_{s}$ -dimensional NSFs where the L0-norm counts the number of non-zero elements. Hence, for each received aggregate $\tilde{\boldsymbol{s}}_{v}^{(2)}$ , client $1$ obtains $c$ observations from $\mathinner{\!\left\lVert\boldsymbol{a}_{v}\right\rVert}_{0}d_{s}$ unknowns. In total, client $1$ will collect $|\tilde{\mathcal{V}}_{1}|$ such observations, i.e., it observes $c|\tilde{\mathcal{V}}_{1}|$ parameters. Assuming a, for client $2$ , the worst-case scenario with all $\boldsymbol{a}_{v}$ being non-zero in the same locations with $m$ non-zero elements, client $2$ will use the same inputs for each observation in client $1$ . Hence, if $nd_{s}>c|\tilde{\mathcal{V}}_{1}|$ , client $1$ is unable to recover individual NSFs from client $2$ . Notably, client $2$ has full control as it knows $m$ and can count the number of queries from client $1$ and refuse to answer if it exceeds $md_{s}/c$ . Similarly, to evaluate $\tilde{\boldsymbol{s}}^{(2)}_{vq}$ , client 2 utilizes $\mathinner{\!\left\lVert\boldsymbol{a}_{v}\right\rVert}_{0}$ $d_{s}$ -dimensional inputs to compute a $c$ -dimensional output. Hence, client $2$ will share $|\boldsymbol{\theta}_{s}|c$ parameters, computed from $\mathinner{\!\left\lVert\boldsymbol{a}_{v}\right\rVert}_{0}d_{s}$ parameters $|\tilde{\mathcal{V}}_{1}|$ times. Hence, again considering the worst-case view, if $md_{s}>c|\boldsymbol{\theta}_{s}||\tilde{\mathcal{V}}_{1}|$ , client $1$ cannot reconstruct the local gradient. Again, client $2$ may monitor the number of queries such that the condition is not violated.

When using Hop2Vec as the NSF generator, we no longer optimize over $\boldsymbol{\theta}_{s}$ but rather over the NSFs $\boldsymbol{S}$ directly. The problem we face in FedStruct-B is that each client has access to $\boldsymbol{S}$ . To alleviate this issue, we use Theorem 6 to split the summation over $\mathcal{V}$ to obtain

\displaystyle\bar{\boldsymbol{s}}^{(j)}_{v}

\displaystyle=\sum_{u\in\mathcal{V}_{j}}\bar{a}_{vu}\boldsymbol{s}_{u},\quad\forall v\in\tilde{\mathcal{V}}_{i}.

(51)

Furthermore, the local gradients for the NSFs in client $i$ are given as

\nabla_{\boldsymbol{S}}\mathcal{L}_{i}(\boldsymbol{\theta},\boldsymbol{S})=\sum_{v\in\tilde{\mathcal{V}}_{i}}{\boldsymbol{\bar{a}}_{v}(\hat{\boldsymbol{y}}_{v}-\boldsymbol{y}_{v})^{\mathsf{T}}}.\,

(52)

We note that (51) is required to compute the local predictions on $\tilde{\mathcal{V}}_{i}$ , and hence to obtain (52). In this version of FedStruct-B, the training procedure will be divided into three phases. Phase 1 amounts to each client evaluating (51) for all other clients and sharing the results. This results in the sharing of $|\tilde{\mathcal{V}}\setminus\tilde{\mathcal{V}}_{i}|c$ in total for each client. Phase 2 amounts to each client evaluating its local gradient in (52) and sharing it with the server, similar to the original FedStruct-B. Next, the server aggregates the local gradients over $\boldsymbol{S}$ and partitions the result with respect to the entries residing in each client (the server must know the unique node-IDs in the clients) and returns only the NSF gradients corresponding to the nodes residing in each client. Using this procedure, clients will only have information about the NSFs pertaining to their local nodes.

Next, we again consider honest-but-curious clients with all but one colluding against a single target. We denote the colluding clients as client $1$ and the target as client $2$ . Client $1$ observes (51) which constitutes $c$ observations from $\mathinner{\!\left\lVert\boldsymbol{a}_{v}\right\rVert}_{0}$ $c$ -dimensional inputs. Assuming a worst-case scenario as above with all $\boldsymbol{a}_{v}$ having $m$ non-zero entries in the same locations, client $1$ observes in total $c|\tilde{\mathcal{V}}_{1}|$ parameters from $m c$ inputs. Hence, to prevent client $1$ from perfectly reconstructing individual NSFs, client $2$ must ensure that $m>|\tilde{\mathcal{V}}_{1}|$ . Again, client $2$ has full control over this by counting the number of queries ensuring it is below $m$ .

To summarize, we are able to alter FedStruct-B to reveal less information about individual nodes without impacting the performance. Further, each client may prevent perfect reconstruction from being possible by limiting the amount of information that is shared with other clients. To go beyond perfect reconstruction and understand the risk of approximate reconstruction seems formidable.

Table 6: Communication Complexity of FedStruct.

Data	before training	training
FedStruct-A	0	$\mathcal{O}(E\cdot K\cdot\|\boldsymbol{\theta}\|+E\cdot n\cdot d\cdot\|\boldsymbol{\theta}\|)$
FedStruct-A + Hop2Vec	0	$\mathcal{O}(E\cdot K\cdot\|\boldsymbol{\theta}\|+E\cdot n\cdot d\cdot\|\boldsymbol{\theta}\|+E\cdot K\cdot n\cdot d+E\cdot n^{2}\cdot d^{2})$
FedStruct-B	$\mathcal{O}(K\cdot n\cdot d+L_{\mathsf{s}}\cdot n^{2})$	$\mathcal{O}(E\cdot K\cdot\|\boldsymbol{\theta}\|)$
FedStruct-B + Hop2Vec	$\mathcal{O}(K\cdot n\cdot d+L_{\mathsf{s}}\cdot n^{2})$	$\mathcal{O}(E\cdot K\cdot\|\boldsymbol{\theta}\|+E\cdot K\cdot n\cdot d)$
FedStruct-B pruning	$\mathcal{O}(K\cdot n\cdot d+L_{\mathsf{s}}\cdot p\cdot n)$	$\mathcal{O}(E\cdot K\cdot\|\boldsymbol{\theta}\|)$
FedSage	0	$\mathcal{O}(E\cdot K^{2}\cdot\|\boldsymbol{\theta}\|+E\cdot K\cdot n\cdot d)$

L.2 Communication Complexity

In Table 6, we present the communication complexity of FedStruct-A and FedStruct-B alongside that of FedSage+. We divide the communication complexity into two parts: an offline component, accounting for events taking place before the training is initiated, and an online component accounting for the actual training phase. In the table, parameters $E$ , $K$ , and $n$ represent the number of training rounds, clients, and nodes in the graph, respectively. Parameter $d$ is the node feature size (all feature dimensions are assumed to be equal to $d$ ), and $|\boldsymbol{\theta}|$ is the number of model parameters (all model parameters are assumed to be of the same order). In the following, we discuss the complexity of the different variants of FedStruct.

•

In FedStruct-A, there is no offline phase. During the training, in each training round, each client collects $\boldsymbol{\theta}$ and returns $\nabla_{\boldsymbol{\theta}}\mathcal{L}_{i}(\boldsymbol{\theta},\boldsymbol{S})$ to the server, totaling a communication of $2E\cdot K\cdot|\boldsymbol{\theta}|$ parameters during the training. This is consistent across all our frameworks. Additionally, in FedStruct-A, the server sends $\boldsymbol{s}_{v}^{(L_{\mathsf{s}})}$ , $\forall v\in\tilde{\mathcal{V}}_{i}$ , and $\{\frac{\partial{\boldsymbol{s}_{v}^{(L_{\mathsf{s}})}}}{\partial{\boldsymbol{\theta}_{\mathsf{s}}}},\forall v\in\tilde{\mathcal{V}}_{i}\}$ to each client, adding up to $E\cdot n\cdot d(|\boldsymbol{\theta}|+1)$ parameters.
•

FedStruct-A with Hop2Vec entails some additional communication complexity, corresponding to the collection of $\{\frac{\partial{\boldsymbol{s}_{v}^{(L_{\mathsf{s}})}}}{\partial{\boldsymbol{s}_{q}}},\forall v\in\tilde{\mathcal{V}}_{i},q\in\mathcal{V}\}$ by client $i$ , constituting $E\cdot n^{2}\cdot d^{2}$ parameters, and sending $\nabla_{\boldsymbol{S}}\mathcal{L}_{i}(\boldsymbol{\theta},\boldsymbol{S})$ to the server, constituting $E\cdot K\cdot n\cdot d$ parameters. In practical scenarios with large graphs, $E\cdot n^{2}\cdot d^{2}$ is the dominant term out of the two.
•

FedStruct-B requires an offline phase where clients acquire $\boldsymbol{\bar{A}}^{[i]}$ and $\mathcal{S}_{i}=\{\boldsymbol{s}_{u},\forall u\in\mathcal{V}_{i}\}$ from other clients. The communication requirement of this phase is $L_{\mathsf{s}}\cdot n^{2}$ and $K\cdot n\cdot d$ parameters, respectively. Beyond the offline phase, similar to the previous methods, $\boldsymbol{\theta}$ is federated during the training phase.
•

FedStruct-B with Hop2Vec shares the same offline phase as without Hop2Vec. However, during the training phase, clients exchange gradients $\nabla_{\boldsymbol{S}}\mathcal{L}_{i}(\boldsymbol{\theta},\boldsymbol{S})$ , which entails an additional complexity of $E\cdot K\cdot n\cdot d$ .
•

In FedSage, feature generator parameters and node embedding vectors are shared between all clients, amounting to a communication overhead of $E\cdot K^{2}\cdot|\boldsymbol{\theta}|$ and $E\cdot K\cdot n\cdot d$ parameters, respectively.

Discussion. In FedStruct-A, where the server knows the global graph, clients only communicate with the server. Therefore, there is no pre-training information exchange. The dominant term of the online complexity is $E\cdot n\cdot d\cdot|\boldsymbol{\theta}|$ , which scales with $n$ . The complexity of FedStruct-A is therefore of the same order as that of FedSage.

In FedStruct-A with Hop2Vec, the dominant complexity term is on the order of $n^{2}$ , which is impractical for large networks. It should be noted, however, that real-world graph-structured datasets are sparse. Furthermore, each node only depends on its $L_{\mathsf{s}}$ -hop neighbors. Hence, many of the values $\{\frac{\partial{\boldsymbol{s}_{v}^{(L_{\mathsf{s}})}}}{\partial{\boldsymbol{s}_{q}}},\forall v\in\tilde{\mathcal{V}}_{i},q\in\mathcal{V}\}$ are zero and do not need to be communicated. Assuming an average node degree $\bar{d}$ and $L_{\mathsf{s}}$ layers, for which each node has access to $\bar{d}^{L_{\mathsf{s}}}$ nodes, the complexity of FedStruct-A with Hop2Vec is on the order of $\min\{n\cdot\bar{d}^{L_{\mathsf{s}}},n^{2}\}$ . For small average degree $\bar{d}$ and $L_{\mathsf{s}}$ , the complexity can therefore be reduced significantly.

FedStruct-B includes an offline phase before the training. The online complexity of FedStruct-B is significantly lower than that of FedSage, which is on the order of $n$ . On the other hand, the online complexity of FedStruct-B with Hop2Vec, is also on the order of $n$ . The offline complexity of FedStruct-B is on the order of $n^{2}$ . While the offline complexity is less critical than the online complexity, scaling as $n^{2}$ may be impractical for large networks. In Appendix L.2.1, we provide a pruning method to reduce this complexity without hampering the performance significantly.

Table 7: Classification accuracy of FedStruct-B with pruning (pruning parameter

p=30

) and without pruning. The results for pruning are presented in bold. The results are shown for 10 clients with a 10–10–80 train-val-test split.

	Cora			Citeseer			Pubmed
Central GNN	82.06 $\pm$ 1.09			69.17 $\pm$ 0.90			87.71 $\pm$ 0.34
Central MLP	65.31 $\pm$ 1.75			64.18 $\pm$ 0.89			85.80 $\pm$ 0.33
	Louvain	random	kmeans	Louvain	random	kmeans	Louvain	random	kmeans
FedStruct (Deg)	82.14 $\pm$ 0.76	67.83 $\pm$ 2.73	70.82 $\pm$ 0.88	67.98 $\pm$ 1.58	64.13 $\pm$ 1.24	65.46 $\pm$ 2.11	84.65 $\pm$ 0.41	84.87 $\pm$ 0.38	85.31 $\pm$ 0.60
FedStruct (GDV)	80.79 $\pm$ 2.05	72.34 $\pm$ 2.69	73.52 $\pm$ 0.78	68.02 $\pm$ 1.96	65.07 $\pm$ 0.92	66.28 $\pm$ 1.70	84.67 $\pm$ 0.52	85.94 $\pm$ 0.32	85.97 $\pm$ 0.54
FedStruct (N2V)	82.25 $\pm$ 0.86	80.53 $\pm$ 0.88	80.61 $\pm$ 0.91	68.06 $\pm$ 1.85	66.58 $\pm$ 1.16	67.13 $\pm$ 1.23	84.66 $\pm$ 0.55	86.75 $\pm$ 0.30	86.81 $\pm$ 0.45
FedStruct (H2V)	80.81 $\pm$ 1.32	80.00 $\pm$ 1.21	80.01 $\pm$ 1.51	67.99 $\pm$ 1.85	66.73 $\pm$ 1.45	66.93 $\pm$ 1.29	83.97 $\pm$ 0.49	86.05 $\pm$ 0.48	86.25 $\pm$ 0.50
FedStruct (Deg)	81.82 $\pm$ 0.89	69.23 $\pm$ 2.21	71.48 $\pm$ 1.16	68.01 $\pm$ 1.93	63.97 $\pm$ 1.31	65.83 $\pm$ 1.79	84.68 $\pm$ 0.43	84.74 $\pm$ 0.41	85.46 $\pm$ 0.60
FedStruct (GDV)	82.10 $\pm$ 0.63	70.49 $\pm$ 2.53	72.48 $\pm$ 0.82	67.98 $\pm$ 1.82	64.20 $\pm$ 1.06	65.36 $\pm$ 1.90	84.57 $\pm$ 0.50	84.89 $\pm$ 0.34	85.33 $\pm$ 0.54
FedStruct (N2V)	82.18 $\pm$ 1.30	80.38 $\pm$ 0.89	80.18 $\pm$ 0.94	68.00 $\pm$ 2.03	66.34 $\pm$ 1.00	67.29 $\pm$ 1.20	84.46 $\pm$ 0.53	85.77 $\pm$ 0.34	86.06 $\pm$ 0.49
FedStruct (H2V)	80.19 $\pm$ 1.28	79.36 $\pm$ 1.59	79.92 $\pm$ 1.40	67.85 $\pm$ 2.01	66.12 $\pm$ 1.36	66.96 $\pm$ 1.54	82.39 $\pm$ 0.76	84.18 $\pm$ 0.38	84.25 $\pm$ 0.70

	Chameleon			Amazon Photo			Amazon Ratings
	Louvain	random	kmeans	Louvain	random	kmeans	Louvain	random	kmeans
Central GNN	54.43 $\pm$ 2.44			93.99 $\pm$ 0.48			41.32 $\pm$ 1.03
Central MLP	32.15 $\pm$ 1.42			88.68 $\pm$ 0.98			37.56 $\pm$ 0.48
FedStruct (Deg)	41.38 $\pm$ 2.21	48.09 $\pm$ 1.58	42.97 $\pm$ 2.27	91.92 $\pm$ 0.37	89.84 $\pm$ 0.52	89.21 $\pm$ 1.09	41.12 $\pm$ 0.34	38.90 $\pm$ 0.50	39.04 $\pm$ 0.51
FedStruct (GDV)	40.02 $\pm$ 1.89	48.09 $\pm$ 1.05	41.42 $\pm$ 1.89	92.04 $\pm$ 0.46	90.47 $\pm$ 0.68	90.25 $\pm$ 0.97	41.39 $\pm$ 0.43	39.53 $\pm$ 0.34	39.80 $\pm$ 0.47
FedStruct (N2V)	44.72 $\pm$ 1.49	49.48 $\pm$ 0.74	45.84 $\pm$ 2.73	92.03 $\pm$ 0.44	91.65 $\pm$ 0.53	91.48 $\pm$ 0.73	41.70 $\pm$ 0.48	41.63 $\pm$ 0.50	41.48 $\pm$ 0.35
FedStruct (H2V)	52.02 $\pm$ 1.71	52.36 $\pm$ 1.54	53.01 $\pm$ 1.88	92.07 $\pm$ 0.42	91.98 $\pm$ 0.58	91.79 $\pm$ 0.70	41.07 $\pm$ 0.45	40.93 $\pm$ 0.71	40.72 $\pm$ 0.59
FedStruct (Deg)	41.40 $\pm$ 2.19	47.57 $\pm$ 1.18	42.58 $\pm$ 2.24	91.98 $\pm$ 0.41	89.72 $\pm$ 0.50	89.09 $\pm$ 1.10	40.97 $\pm$ 0.38	39.02 $\pm$ 0.42	39.07 $\pm$ 0.43
FedStruct (GDV)	39.67 $\pm$ 2.56	47.58 $\pm$ 1.03	41.34 $\pm$ 1.58	91.94 $\pm$ 0.27	89.75 $\pm$ 0.57	89.02 $\pm$ 0.95	40.95 $\pm$ 0.48	39.02 $\pm$ 0.54	39.29 $\pm$ 0.56
FedStruct (N2V)	44.66 $\pm$ 1.55	49.43 $\pm$ 1.57	45.70 $\pm$ 2.35	91.90 $\pm$ 0.41	90.72 $\pm$ 0.42	90.42 $\pm$ 0.99	41.65 $\pm$ 0.51	41.61 $\pm$ 0.40	41.46 $\pm$ 0.51
FedStruct (H2V)	52.80 $\pm$ 1.97	53.09 $\pm$ 1.59	52.83 $\pm$ 1.73	91.50 $\pm$ 0.40	91.16 $\pm$ 0.71	90.76 $\pm$ 0.81	40.76 $\pm$ 0.51	41.02 $\pm$ 0.56	40.44 $\pm$ 0.52

L.2.1 Complexity Reduction via Pruning

The offline complexity component of FedStruct-B scales with $n^{2}$ , making it impractical for very large networks. In this section, we introduce a modified version of FedStruct-B that significantly reduces the $n^{2}$ complexity while achieving similar performance. This modified version of FedStruct-B builds on the fact that real-world graph-structured datasets are often sparse.

The key idea is to apply a pruning technique when obtaining the shared matrix $\boldsymbol{\bar{A}}^{[i]}$ using Algorithm 3 in Appendix G. More specifically, we choose an integer $p\ll n$ and, in each offline communication round, retain only the top $p\cdot n_{i}$ values of matrix $\boldsymbol{B}_{ijk}$ (originally of dimension $n_{i}\times n_{j}$ ). Hence, the complexity is reduced from $L_{\mathsf{s}}\cdot n^{2}$ to $L_{\mathsf{s}}\cdot p\cdot n$ . Overall the complexity of FedStruct-B with pruning is the same of FedSage, i.e., on the order of $n$ . The communication complexity of FedStruct-B with pruning is provided in Table 6.

The performance of FedStruct-B with pruning is illustrated in Table 7 for pruning parameter $p=30$ . It can be observed that the performance of FedStruct-B with pruning is similar to that of FedStruct-B with no pruning across all datasets and partitionings. Table 7 also highlights the robustness of FedStruct-B against the presence of structural noise. That is, in the presence of unreliable or noisy connections, pruning can be used to limit to reliable connections whilst achieving comparable performance to the case where all connections are reliable (Fox & Rajamanickam, 2019).

L.3 Heterophily

Employing a decoupled GCN within the FedStruct framework allows the nodes to access larger receptive fields compared to standard GNNs. This proves advantageous in handling heterophily graphs. This is because in heterophilic graphs, neighboring nodes may not provide substantial information about the node labels. In such scenarios, the mixing of embeddings from higher-order neighbors involves a greater number of nodes with the same class, promoting increased similarity among nodes of the same class. Higher-order neighbor mixing is one of the well-known approaches to deal with heterophilic graphs (Zheng et al., 2022). Notably, (Abu-El-Haija et al., 2019) aggregates embeddings from multi-hop neighbors showing superior performance compared to one-hop neighbor aggregation. To show why higher-order neighborhoods help in heterophilic settings, (Zhu et al., 2020) theoretically establish that, on average, the labels of $2$ -hop neighbors exhibit greater similarity to the ego node label when the labels of $1$ -hop neighbors are conditionally independent from the ego node.

Furthermore, the parameters $\{\beta_{j}\}_{j=1}^{L_{\mathsf{s}}}$ can be trained or adjusted to control the mixing weight of different $l$ -hop neighbors. This allows the network to explicitly capture both local and global structural information within the graph. Specifically, different powers of the normalized adjacency matrix $\boldsymbol{\hat{A}}^{l},\ l\in[L_{\mathsf{s}}]$ , collect information from distinct localities in the graph. Smaller powers capture more local information, while larger powers tend to collect more global information. Consequently, selecting the appropriate parameters $\{\beta_{j}\}_{j=1}{L_{\mathsf{s}}}$ enables the model to adapt to various structural properties within graphs.

An additional well-established strategy for addressing heterophilic graphs involves the discovery of potential neighbors, as proposed by (Zheng et al., 2022). The concept of potential neighbor discovery broadens the definition of neighbors by identifying nodes that may not be directly linked to the ego nodes but share similar neighborhood patterns in different regions of the graph. Geom-GCN (Pei et al., 2020) introduces a novel approach by defining a geometric Euclidean space and establishing a new neighborhood for nodes that are proximate in this latent space. To this aim, FedStruct employs NSFs to foster long-range similarity between nodes that may not be in close proximity in the graph but exhibit proximity in the latent structural space. Specifically, the use of methods such as Struc2Vec (Ribeiro et al., 2017) for generating NSFs enables the creation of vectors that share similarity, even among nodes that may not be directly connected in the graph. Consequently, the NSEs produced by these NSFs also demonstrate similarity, contributing to consistent node label predictions. This approach showcases FedStruct’s capacity to capture latent structural relationships and enhance the model’s ability to make accurate predictions in the context of heterophilic graphs.

The robustness of FedStruct in handling heterophilic graphs is demonstrated in Table 1 and Table 5 for the Chameleon and Amazon Ratings datasets. Such datasets are heterophilic datasets, with Chameleon having a higher degree of heterophily than Amazon Ratings (see Table 3). For both datasets FedStruct yields performance close to Central GNN and significantly outperforms FedSage+. In Table 1, for the Chameleon dataset and 20 clients, FedStruct with Hop2Vec achieves an accuracy of 52.76% compared to 34.33% for FedSage+. For the Amazon Ratings dataset and 20 clients, FedStruct with Node2Vec achieves an accuracy of 41.77% compared to 36.09% for FedSage+. Note that the improvement is larger for the more heterophilic dataset. Similar results can be observed for different partitioning methods, see Table 5. For Chameleon, FedStruct with Hop2Vec outperforms FedSage+ with 6.33%, 16.88% and 15.73% for Louvain, random and K-means partitioning, respectively. For Amazon ratings, FedStruct with Node2Vec outperforms FedSage+ with 0.74%, 5.25%, and 4.01% for Louvain, random and K-means partitioning, respectively.

We highlight that we have not specifically optimized the parameters of FedStruct to handle heterophilic graph structures, hence it might be possible to improve the performance of FedStruct for these heterophilic graphs.

The robustness of FedStruct to different degrees of homophily/heterophily (in contrast to frameworks such as FedSage+), underscores the adaptability and effectiveness of our framework to diverse graph scenarios, affirming its potential for a wide range of applications.

FedStruct: Federated Decoupled Learning over Interconnected Graphs

Abstract

1 Introduction

2 Related Work

3 Preliminaries

4 System Model

4.1 Node Embeddings

4.2 Federated Learning

5 FedStruct-A: Subgraph FL with Knowledge of the Global Graph

Theorem 1.

Proof.

5.1 Task-Dependent Node Structure Feature Vector

Theorem 2.

Proof.

5.2 Semi-Supervised Learning

6 FedStruct-B: Subgraph FL with no Knowledge of the Global Graph

Theorem 3.

Proof.

Theorem 4.

Proof.

Theorem 5.

Proof.

Theorem 6.

Proof.

7 Evaluation

8 Concluding Remarks

References

Appendix A Node Structure Feature Generation

Appendix B Proof of 1

Lemma 1.

Proof.

Appendix C Proof of 2

Appendix D Task-Dependent FedStruct-A Algorithm

Appendix E Proof of 4

Appendix F Proof of 6

Appendix G Scheme to Obtain 𝑨¯[i]

Appendix H FedStruct-B Algorithm

Appendix I FedStruct-B with Hop2Vec algorithm

Appendix J Experimental Setting

Appendix K Additional Results

K.1 Different Partitionings

K.2 Impact of the Number of Labeled Training Nodes

K.3 Impact of the number of Layers in the Decoupled GCN

Appendix L Discussion

L.1 Privacy

L.2 Communication Complexity

L.2.1 Complexity Reduction via Pruning

L.3 Heterophily

Appendix G Scheme to Obtain $\boldsymbol{\bar{A}}^{[i]}$