BernNet: Learning Arbitrary Graph Spectral Filters via Bernstein Approximation

Mingguo He
Renmin University of China
mingguo@ruc.edu.cn Zhewei Wei
111Zhewei Wei and Hongteng Xu are the corresponding authors. Work partially done at Gaoling School of Artificial Intelligence, Beijing Key Laboratory of Big Data Management and Analysis Methods, MOE Key Lab of Data Engineering and Knowledge Engineering, Renmin University of China, and Pazhou Lab, Guangzhou, 510330, China.
Renmin University of China
zhewei@ruc.edu.cn Zengfeng Huang
Fudan University
huangzf@fudan.edu.cn Hongteng Xu
111Zhewei Wei and Hongteng Xu are the corresponding authors. Work partially done at Gaoling School of Artificial Intelligence, Beijing Key Laboratory of Big Data Management and Analysis Methods, MOE Key Lab of Data Engineering and Knowledge Engineering, Renmin University of China, and Pazhou Lab, Guangzhou, 510330, China.
Renmin University of China
hongtengxu@ruc.edu.cn
Abstract

Many representative graph neural networks, e.g., GPR-GNN and ChebNet, approximate graph convolutions with graph spectral filters. However, existing work either applies predefined filter weights or learns them without necessary constraints, which may lead to oversimplified or ill-posed filters. To overcome these issues, we propose BernNet, a novel graph neural network with theoretical support that provides a simple but effective scheme for designing and learning arbitrary graph spectral filters. In particular, for any filter over the normalized Laplacian spectrum of a graph, our BernNet estimates it by an order-K Bernstein polynomial approximation and designs its spectral property by setting the coefficients of the Bernstein basis. Moreover, we can learn the coefficients (and the corresponding filter weights) based on observed graphs and their associated signals and thus achieve the BernNet specialized for the data. Our experiments demonstrate that BernNet can learn arbitrary spectral filters, including complicated band-rejection and comb filters, and it achieves superior performance in real-world graph modeling tasks. Code is available at https://github.com/ivam-he/BernNet.

1 Introduction

Graph neural networks (GNNs) have received extensive attention from researchers due to their excellent performance on various graph learning tasks such as social analysis qiu2018deepinf ; li2019encoding ; tong2019leveraging , drug discovery jiang2021could ; rathi2019practical , traffic forecasting li2017diffusion ; bogaerts2020graph ; cui2019traffic , recommendation system ying2018graph ; wu2019session and computer vision zhao2019semantic ; chen2019multi . Recent studies suggest that many popular GNNs operate as polynomial graph spectral filters Chebnet ; kipf2016semi ; chien2021GPR-GNN ; levie2018cayleynets ; bianchi2021ARMA ; xu2018graphWavelet . Specifically, we denote an undirected graph with node set V and edge set E as G=(V,E), whose adjacency matrix is 𝐀. Given a signal 𝐱=[x]∈Rn on the graph, where n=|V| is the number of nodes, we can formulate its graph spectral filtering operation as βˆ‘k=0Kwk𝐋k𝐱, wk’s are the filter weights, 𝐋=πˆβˆ’πƒβˆ’1/2π€πƒβˆ’1/2 is the symmetric normalized Laplacian matrix of G, and 𝐃 is the diagonal degree matrix of 𝐀. Another equivalent polynomial filtering operation is βˆ‘k=0Kck𝐏k𝐱, where 𝐏=πƒβˆ’1/2π€πƒβˆ’1/2 is the normalized adjacency matrix and ck’s are the filter weights.

We can broadly categorize the GNNs applying the above filtering operation into two classes, depending on whether they design the filter weights or learn them based on observed graphs. Some representative models in these two classes are shown below.

  • β€’

    The GNNs driven by designing filters: GCN kipf2016semi uses a simplified first-order Chebyshev polynomial, which is proven to be a low-pass filter balcilar2021analyzing ; wu2019sgc ; xu2020graphheat ; zhu2021interpreting . APPNP appnp utilizes Personalized PageRank (PPR) to set the filter weights and achieves a low-pass filter as well klicpera2019diffusion ; zhu2021interpreting . GNN-LF/HF zhu2021interpreting designs filter weights from the perspective of graph optimization functions, which can simulate high- and low-pass filters.

  • β€’

    The GNNs driven by learning filters: ChebNet Chebnet approximates the filtering operation with Chebyshev polynomials, and learns a filter via trainable weights of the Chebyshev basis. GPR-GNN chien2021GPR-GNN learns a polynomial filter by directly performing gradient descent on the filter weights, which can derive high- or low-pass filters. ARMA bianchi2021ARMA learns a rational filter via the family of Auto-Regressive Moving Average filters narang2013signal .

Although the above GNNs achieve some encouraging results in various graph modeling tasks, they still suffer from two major drawbacks. Firstly, most existing methods focus on designing or learning simple filters (e.g., low- and/or high-pass filters), while real-world applications often require much more complex filters such as band-rejection and comb filters. To the best of our knowledge, none of the existing work supports designing arbitrary interpretable spectral filters. The GNNs driven by learning filters can learn arbitrary filters in theory, but they cannot intuitively show what filters they have learned. In other words, their interpretability is poor. For example, GPR-GNN chien2021GPR-GNN learns the filter weights wk’s but only proves a small subset of the learnt weight sequences corresponds to low- or high-pass filters. Secondly, the GNNs often design their filters empirically or learn the filter weights without any necessary constraints. As a result, their filter weights often have poor controllability. For example, GNN-LF/HF zhu2021interpreting designs its filters with a complex and non-intuitive polynomial with difficult-to-tune hyperparameters. The multi-layer GCN/SGC kipf2016semi ; wu2019sgc leads to β€œill-posed” filters (i.e., those deriving negative spectral responses). Additionally, the filters learned by GPR-GNN chien2021GPR-GNN or ChebNet Chebnet have a chance to be ill-posed as well.

Refer to caption
Figure 1: An illustration of the proposed BernNet.

To overcome the above issues, we propose a novel graph neural network called BernNet, which provides an effective algorithmic framework for designing and learning arbitrary graph spectral filters. As illustrated in Figure 1, for an arbitrary spectral filter h:[0,2]↦[0,1] over the spectrum of the symmetric normalized Laplacian 𝐋, our BernNet approximates h by a K-order Bernstein polynomial approximation, i.e., h(Ξ»)=βˆ‘k=0KΞΈkbkK(Ξ»). The non-negative coefficients {ΞΈk}k=0K of the Bernstein basis {bkK(Ξ»)}k=0K work as the model parameter, which can be interpreted as h(2k/K), k=0,…,K (i.e., the filter values uniformly sampled from [0,2]). By designing or learning the ΞΈk’s, we can obtain various spectral filters, whose filtering operation can be formulated as βˆ‘k=0KΞΈk12K(Kk)(2πˆβˆ’π‹)Kβˆ’k𝐋k𝐱, where 𝐱 is the graph signal. We further demonstrate the rationality of our BernNet from the perspective of graph optimization β€” any valid polynomial filers, i.e., those polynomial functions mapping [0,2] to [0,1], can always be expressed by our BernNet, and accordingly, the filters learned by our BernNet are always valid. Finally, we conduct experiments to demonstrate that 1) BernNet can learn arbitrary graph spectral filters (e.g., band-rejection, comb, low-band-pass, etc.), and 2) BernNet achieves superior performance on real-world datasets.

2 BernNet

2.1 Bernstein approximation of spectral filters

Given an arbitrary filter function h:[0,2]↦[0,1], let 𝐋=π”πš²π”T denote the eigendecomposition of the symmetric normalized Laplacian matrix 𝐋, where 𝐔 is the matrix of eigenvectors and 𝚲=π‘‘π‘–π‘Žπ‘”[Ξ»1,…,Ξ»n] is the diagonal matrix of eigenvalues. We use

h(𝐋)𝐱=𝐔h(𝚲)𝐔T𝐱=π”π‘‘π‘–π‘Žπ‘”[h(Ξ»1),…,h(Ξ»n)]𝐔T𝐱 (1)

to denote a spectral filter on graph signal 𝐱. The key of our work is approximate h(𝐋) (or, equivalently, h(Ξ»)). For this purpose, we leverage the Bernstein basis and Bernstein polynomial approximation defined below.

Definition 2.1 ( farouki2012bernstein ).

(Bernstein polynomial approximation) Given an arbitrary continuous function f(t) on t∈[0,1], the Bernstein polynomial approximation (of order K) for f is defined as

pK(t):=βˆ‘k=0KΞΈkβ‹…bkK(t)=βˆ‘k=0Kf(kK)β‹…(Kk)(1βˆ’t)Kβˆ’ktk. (2)

Here, for k=0,…,K, bkK(t)=(Kk)(1βˆ’t)Kβˆ’ktk is the k-th Bernstein base, and ΞΈk=f(kK) is the function value at k/K, which works as the coefficient of bkK(t).

Lemma 2.1 ( farouki2012bernstein ).

Given an arbitrary continuous function f(t) on t∈[0,1], let pK(t) denote the Bernstein approximation of f(t) as defined in Equation (2). We have pK(t)β†’f(t) as Kβ†’βˆž.

For the filter function h:[0,2]↦[0,1], we let t=Ξ»2 and f(t)=h(2t), so that the Bernstein polynomial approximation becomes applicable, where ΞΈk=f(k/K)=h(2k/K) and bkK(t)=bkK(Ξ»2)=(Kk)(1βˆ’Ξ»2)Kβˆ’k(Ξ»2)k for k=1,…,K. Consequently, we can approximate h(Ξ») by pK(Ξ»/2)=βˆ‘k=0KΞΈk(Kk)(1βˆ’Ξ»2)Kβˆ’k(Ξ»2)k=βˆ‘k=0KΞΈk12K(Kk)(2βˆ’Ξ»)Kβˆ’kΞ»k, and Lemma 2.1 ensures that pK(Ξ»/2)β†’h(Ξ») as Kβ†’βˆž.

Replacing {h(Ξ»i)}i=1n with {pK(Ξ»i/2)}i=1n, we approximate the spectral filter h(𝐋) in Equation (1) as π”π‘‘π‘–π‘Žπ‘”[pK(Ξ»1/2),…,pK(Ξ»n/2)]𝐔T and derive the proposed BernNet. In particular, given a graph signal 𝐱, the convolutional operator of our BernNet is defined as follows:

𝐳=π”π‘‘π‘–π‘Žπ‘”[pK(Ξ»1/2),…,pK(Ξ»n/2)]𝐔T⏟BernNet𝐱=βˆ‘k=0KΞΈk12K(Kk)(2πˆβˆ’π‹)Kβˆ’k𝐋k𝐱 (3)

where each coefficient ΞΈk can be either set to h(2k/K) to approximate a predetermined filter h, or learnt from the graph structure and signal in an end-to-end fashion. As a natural extension of Lemma 2.1, our BernNet owns the following proposition.

Proposition 2.1.

For an arbitrary continuous filter function h:[0,2]β†’[0,1], by setting ΞΈk=h(2k/K),k=0,…,K, the 𝐳 in Equation (3) satisfies 𝐳→h(𝐋)𝐱 as Kβ†’βˆž.

Proof.

According to the above derivation, we have pK(Ξ»/2)=βˆ‘k=0KΞΈk(Kk)(1βˆ’Ξ»2)Kβˆ’k(Ξ»2)k=βˆ‘k=0KΞΈk12K(Kk)(2βˆ’Ξ»)Kβˆ’kΞ»k, and Lemma 2.1 ensures that pK(Ξ»/2)β†’h(Ξ») as ΞΈk=h(2k/K) and Kβ†’βˆž.

Consequently, we have

𝐳=π”π‘‘π‘–π‘Žπ‘”[pK(Ξ»1/2),…,pK(Ξ»n/2)]𝐔Tπ±β†’π”π‘‘π‘–π‘Žπ‘”[h(Ξ»1),…,h(Ξ»n)]𝐔T𝐱=h(𝐋)

as ΞΈk=h(2k/K) and Kβ†’βˆž.

∎

2.2 Realizing existing filters with BernNet.

As shown in Proposition 2.1, our BernNet can approximate arbitrary continuous spectral filters with sufficient precision. Below we give some representative examples of how our BernNet exactly realizes existing filters that are commonly used in GNNs.

  • β€’

    All-pass filter h(Ξ»)=1. We set ΞΈk=1 for k=0,…,K, and the approximation pK(Ξ»2)=1 is exactly the same with h(Ξ»). Accordingly, our BernNet becomes an identity matrix, which realizes the all-pass filter perfectly.

  • β€’

    Linear low-pass filter h(Ξ»)=1βˆ’Ξ»/2. We set ΞΈk=1βˆ’k/K for k=0,…,K and obtain pK(Ξ»2)=1βˆ’Ξ»/2. The BernNet becomes βˆ‘k=0K(Kβˆ’k)K12K(Kk)(2πˆβˆ’π‹)Kβˆ’k𝐋k=πˆβˆ’12𝐋, which achieves the linear low-pass filter exactly. Note that πˆβˆ’12𝐋=12(𝐈+𝐏) is also the same as the graph convolutional network (GCN) before renormalization kipf2016semi .

  • β€’

    Linear high-pass filter h(Ξ»)=Ξ»/2. Similarly, we can set ΞΈk=k/K for k=0,…,K to get a perfect approximation pK(Ξ»2)=Ξ»2, and the BernNet becomes 12𝐋.

Note that even for those non-continuous spectral filters, e.g., the impulse low/high/band-pass filters, our BernNet can also provide good approximations (with sufficient large K).

  • β€’

    Impulse low-pass filter h(Ξ»)=Ξ΄0(Ξ»).222The impulse function Ξ΄x(Ξ»)=1 if Ξ»=x, otherwise Ξ΄x(Ξ»)=0 We set ΞΈ0=1 and ΞΈk=0 for kβ‰ 0, and pK(Ξ»2)=(1βˆ’Ξ»2)K. Accordingly, the BernNet becomes 12K(2πˆβˆ’π‹)K, deriving an K-layer linear low-pass filter.

  • β€’

    Impulse high-pass filter h(Ξ»)=Ξ΄2(Ξ»). We set ΞΈK=1 and ΞΈk=0 for kβ‰ K, and pK(Ξ»2)=(Ξ»2)K. The BernNet becomes 12K𝐋K, i.e., an K-layer linear high-pass filter.

  • β€’

    Impulse band-pass filter h(Ξ»)=Ξ΄1(Ξ»). Similarly, we set ΞΈK/2=1 and ΞΈk=0 for kβ‰ K/2, and pK(Ξ»2)=(KK/2)(1βˆ’Ξ»/2)K/2(Ξ»/2)K/2. The BernNet becomes 12K(KK/2)(2πˆβˆ’π‹)K/2𝐋K/2, which can be explained as stacking a K/2-layer linear low-pass filter and a K/2-layer linear high-pass filter. Obviously, K should be an even number in this case.

Table 1 summarizes the design of the BernNet for the filters above. We can find that an appealing advantage of our BernNet is that its coefficients are highly correlated with the spectral property of the target filter. In particular, we can determine to pass or reject the spectral signal with Ξ»β‰ˆ2kK by using a large or small ΞΈk because each Bernstein base bkK(Ξ») corresponds to a β€œbump” located at 2kK. This property provides useful guidance when designing filters, which enhances the interpretability of our BernNet.

Table 1: Realizing commonly used filters with BernNet.
Filter types Filter h(Ξ») ΞΈk for k=0,…,K Bernstein approximation pK(Ξ»2) BernNet
All-pass 1 θk=1 1 𝐈
Linear low-pass 1βˆ’Ξ»/2 ΞΈk=1βˆ’k/K 1βˆ’Ξ»/2 πˆβˆ’12𝐋
Linear high-pass Ξ»/2 ΞΈk=k/K Ξ»/2 12𝐋
Impulse low-pass Ξ΄0(Ξ») ΞΈ0=1 and other ΞΈk=0 (1βˆ’Ξ»/2)K 12K(2πˆβˆ’π‹)K
Impulse high-pass Ξ΄2(Ξ») ΞΈK=1 and other ΞΈk=0 (Ξ»/2)K 12K𝐋K
Impulse band-pass Ξ΄1(Ξ») ΞΈK/2=1 and other ΞΈk=0 (KK/2)(1βˆ’Ξ»/2)K/2(Ξ»/2)K/2 12K(KK/2)(2πˆβˆ’π‹)K/2𝐋K/2

2.3 Learning complex filters with BernNet

Besides designing the above typical filters, our BernNet can express more complex filters, such as band-pass, band-rejection, comb, low-band-pass filters, etc. Moreover, given the graph signals before and after applying such filters (i.e., the 𝐱’s and the corresponding 𝐳’s), our BernNet can learn their approximations in an end-to-end manner. Specifically, given the pairs {𝐱,𝐳}, we learn the coefficients {ΞΈk}k=0K of the BernNet by gradient descent. More implementation details can be found at the experimental section below. Figure 2 illustrates the four complex filters and the approximations we learned (The low-band pass filter is h(Ξ»)=I[0,0.5](Ξ»)+exp⁑(βˆ’100(Ξ»βˆ’0.5)2)I(0.5,1)(Ξ»)+exp⁑(βˆ’50(Ξ»βˆ’1.5)2)I[1,2](Ξ»), where IΞ©(Ξ»)=1 when λ∈Ω, otherwise IΞ©(Ξ»)=0). In general, our BernNet can learn a smoothed approximation of these complex filters, and the approximation precision improves with the increase of the order K. Note that although the BernNet cannot pinpoint the exact peaks of the comb filter or drop to 0 for the valleys of comb or low-band-pass filters due to the limitation of K, it still significantly outperforms other GNNs for learning such complex filters.

Refer to caption
(a) Band-pass
Refer to caption
(b) Band-rejection
Refer to caption
(c) Comb
Refer to caption
(d) Low-band-pass
Figure 2: Illustrations of four complex filters and their approximations learnt by BernNet.

3 BernNet in the Lens of Graph Optimization

In this section, we motivate BernNet from the perspective of graph optimization. In particular, we show that any polynomial filter that attempts to approximate a valid filter has to take the form of BernNet.

3.1 A generalized graph optimization problem

Given a n-dimensional graph signal 𝐱, we consider a generalized graph optimization problem

min𝐳⁑f(𝐳)=(1βˆ’Ξ±)𝐳TΞ³(𝐋)𝐳+Ξ±β€–π³βˆ’π±β€–22 (4)

where α∈[0,1) is a trade-off parameter, 𝐳∈Rn denotes the propagated representation of the input graph signal 𝐱, and Ξ³(𝐋) denotes an energy function of 𝐋, determining the rate of propagation spielman2019spectral . Generally, Ξ³(β‹…) operates on the spectral of 𝐋, and we have Ξ³(𝐋)=π”π‘‘π‘–π‘Žπ‘”[Ξ³(Ξ»1),…,Ξ³(Ξ»n)]𝐔T.

We can model the polynomial filtering operation of existing GNNs with the optimal solution of Equation (4). For example, if we set Ξ³(𝐋)=𝐋, then the optimization function (4) becomes f(𝐳)=(1βˆ’Ξ±)𝐳T𝐋𝐳+Ξ±β€–π³βˆ’π±β€–22, a well-known convex graph optimization function proposed by Zhou et al. zhouLearing . f(𝐳) takes the minimum when the derivative βˆ‚f(𝐳)βˆ‚π³=2(1βˆ’Ξ±)𝐋𝐳+2Ξ±(π³βˆ’π±)=𝟎, which solves to

π³βˆ—=Ξ±(πˆβˆ’(1βˆ’Ξ±)(πˆβˆ’π‹))βˆ’1𝐱=βˆ‘k=0∞α(1βˆ’Ξ±)k(πˆβˆ’π‹)k𝐱=βˆ‘k=0∞α(1βˆ’Ξ±)k𝐏k𝐱.

By taking a suffix sum βˆ‘k=0KΞ±(1βˆ’Ξ±)k𝐏k𝐱, we obtain the polynomial filtering operation for APPNP appnp . Zhu et al. zhu2021interpreting further show that GCN kipf2016semi , DAGNN liu2020DAGNN , and JKNet xu2018jknet can be interpreted by the optimization function (4) with Ξ³(𝐋)=𝐋.

The generalized form of Equation (4) allows us to simulate more complex polynomial filtering operation. For example, let Ξ±=0.5 and Ξ³(𝐋)=etπ‹βˆ’πˆ, a heat kernel with t as the temperature parameter. Then f(𝐳) takes the minimum when the derivative βˆ‚f(𝐳)βˆ‚π³=(etπ‹βˆ’πˆ)𝐳+π³βˆ’π±=𝟎, which solves to

π³βˆ—=eβˆ’t𝐋𝐱=eβˆ’t(πˆβˆ’π)𝐱=βˆ‘k=0∞eβˆ’ttkk!𝐏k𝐱.

By taking a suffix sum βˆ‘k=0Keβˆ’ttkk!𝐏k𝐱, we obtain the polynomial filtering operation for the heat kernal based GNN such as GDC klicpera2019diffusion and GraphHeat xu2020graphheat .

3.2 Non-negative constraint on polynomial filters

A natural question is that, does an arbitrary energy function Ξ³(𝐋) correspond to a valid or ill-posed spectral filter? Conversely, does any polynomial filtering operation βˆ‘k=0Kwk𝐋k𝐱 correspond to the optimal solution of the optimization function (4) for some energy function Ξ³(𝐋)?

As it turns out, there is a β€œminimum requirement” for the energy function Ξ³(𝐋); Ξ³(𝐋) has to be positive semidefinite. In particular, if Ξ³(𝐋) is not positive semidefinite, then the optimization function f(𝐳) is not convex, and the solution to βˆ‚f(𝐳)βˆ‚π³=0 may corresponds to a saddle point. Furthermore, without the positive semidefinite constraint on Ξ³(𝐋), f(𝐳) may goes to βˆ’βˆž as we set 𝐳 to be a multiple of the eigenvector corresponding to the negative eigenvalue.

Non-negative polynomial filters. Given a positive semidefinite energy function Ξ³(𝐋), we now consider how the corresponding polynomial filtering operation βˆ‘k=0Kwk𝐋k𝐱 should look like. Recall that we assume Ξ³(𝐋)=π”π‘‘π‘–π‘Žπ‘”[Ξ³(Ξ»1),…,Ξ³(Ξ»n)]𝐔T. By the positive semidefinite constraint, we have Ξ³(Ξ»)β‰₯0 for λ∈[0,2]. Since the objective function f(𝐳) is convex, it takes the minimum when βˆ‚f(𝐳)βˆ‚π³=2(1βˆ’Ξ±)Ξ³(𝐋)𝐳+2Ξ±(π³βˆ’π±)=𝟎. Accordingly, the optimum π’›βˆ— can be derived as

Ξ±(α𝐈+(1βˆ’Ξ±)Ξ³(𝐋))βˆ’1𝐱=π”π‘‘π‘–π‘Žπ‘”[Ξ±Ξ±+(1βˆ’Ξ±)Ξ³(Ξ»1),…,Ξ±Ξ±+(1βˆ’Ξ±)Ξ³(Ξ»n)]𝐔T𝐱. (5)

Let h(Ξ»)=Ξ±Ξ±+(1βˆ’Ξ±)Ξ³(Ξ») denote the exact spectral filter, and g(Ξ»)=βˆ‘k=0KwkΞ»k denote a polynomial approximation of h(Ξ») (e.g. the suffix sum of h(Ξ»)’s taylor expansion). Since Ξ³(Ξ»)β‰₯0 when λ∈[0,2], we have 0≀h(Ξ»)≀αα+(1βˆ’Ξ±)β‹…0=1 for λ∈[0,2]. Consequently, it is natural to assume the polynomial filter g(Ξ»)=βˆ‘k=0KwkΞ»k also satisfies 0≀g(Ξ»)≀1.

Constraint 3.1.

Assuming the energy function Ξ³(𝐋) is positive semidefinite, a polynomial filter g(Ξ»)=βˆ‘k=0KwkΞ»k approximating the optimal solution to Equation (4) has to satisfy

0≀g(Ξ»)=βˆ‘k=0KwkΞ»k≀1,βˆ€Ξ»βˆˆ[0,2]. (6)

While Constraint 3.1 seems to be simple and intuitive, some of the existing GNN may not satisfies this constraint. For example, GCN uses 𝐳=𝐏𝐱=(πˆβˆ’π‹)𝐱, which corresponds to a polynomial filter g(Ξ»)=1βˆ’Ξ» that takes negative value when Ξ»>1, violating Constraint 3.1. As shown in wu2019sgc , the renormalization trick 𝐏~=(𝐈+𝐃)βˆ’1/2(𝐈+𝐀)(𝐈+𝐃)βˆ’1/2 shrinks the spectral and thus reliefs the problem. However, g(Ξ») may still take negative value as the maximum eigenvalue of 𝐋~=πˆβˆ’π~ is still larger than 1.

3.3 Non-negative polynomials and Bernstein basis

Constraint 3.1 motivates us to design polynomial filters g(Ξ»)=βˆ‘k=0KwkΞ»k such that 0≀g(Ξ»)≀1 when λ∈[0,2]. The g(Ξ»)≀1 part is trivial, as we can always rescale each wk by a factor of βˆ‘k=0K|wk|2k. The g(Ξ»)β‰₯0 part, however, requires more elaboration. Note that we can not simply set wkβ‰₯0 for each k=0…,K, since it is shown in chien2021GPR-GNN that such polynomials only correspond to low-pass filters.

As it turns out, the Bernstein basis has the following nice property: a polynomial that is non-negative on a certain interval can always be expressed as a non-negative linear combination of Bernstein basis. Specifically, we have the following lemma.

Lemma 3.1 (powers2000polynomials ).

Assume a polynomial p(x)=βˆ‘k=0KΞΈkxk satisfies p(x)β‰₯0 for x∈[0,1]. Then there exists a sequence of non-negative coefficients ΞΈk, k=0,…,K, such that

p(x)=βˆ‘k=0KΞΈkbkK(x)=βˆ‘k=0KΞΈk(Kk)(1βˆ’x)Kβˆ’kxk

Lemma 3.1 suggests that to approximate a valid filter, the polynomial filter g(Ξ») has to be a non-negative linear combination of Bernstein basis. Specifically, by setting x=Ξ»/2, the filter g(Ξ») that satisfies g(Ξ»)β‰₯0 for λ∈[0,2] can be expressed as

g(Ξ»):=p(Ξ»2)=βˆ‘k=0KΞΈk12K(Kk)(2βˆ’Ξ»)Kβˆ’kΞ»k.

Consequently, any valid polynomial filter that approximate the optimal solution of (4) with positive semidefinite energy function Ξ³(𝐋) has to take the following form: 𝐳=βˆ‘k=0KΞΈk12K(Kk)(2πˆβˆ’π‹)Kβˆ’k𝐋k𝐱. This observation motivates our BernNet from the perspective of graph optimization β€” any valid polynomial filers, i.e., the g:[0,2]↦[0,1], can always be expressed by BernNet, and accordingly, the filters learned by our BernNet are always valid.

4 Related Work

Graph neural networks (GNNs) can be broadly divided into spectral-based GNNs and spatial-based GNNs wu2020comprehensive .

Spectral-based GNNs design spectral graph filters in the spectral domain. ChebNet Chebnet uses Chebyshev polynomial to approximate a filter. GCN kipf2016semi simplifies the Chebyshev filter with the first-order approximation. GraphHeat xu2020graphheat uses heat kernel to design a graph filter. APPNP appnp utilizes Personalized PageRank (PPR) to set the filter weights. GPR-GNN chien2021GPR-GNN learns the polynomial filters via gradient descent on the polynomial coefficients. ARMA bianchi2021ARMA learns a rational filter via the family of Auto-Regressive Moving Average filters narang2013signal . AdaGNN adagnn learns simple filters across multiple layers with a single parameter for each feature channel at each layer. As aforementioned, these methods mainly focus on designing low- or high-pass filters or learning filters without any constraints, which may lead to misspecified even ill-posed filters.

On the other hand, spatial-based GNNs directly propagate and aggregate graph information in the spatial domain. From this perspective, GCN kipf2016semi can be explained as the aggregation of the one-hop neighbor information on the graph. GAT gat uses the attention mechanism to learn aggregation weights. Recently, Balcilar et al. balcilar2021analyzing bridge the gap between spectral-based and spatial-based GNNs and unify GNNs in the same framework. Their work shows that the GNNs can be interpreted as sophisticated data-driven filters. This motivates the design of the proposed BernNet, which can learn arbitrary non-negative spectral filters from real-world graph signals.

5 Experiments

In this section, we conduct experiments to evaluate BernNet’s capability to learn arbitrary filters as well as the performance of BernNet on real datasets. All the experiments are conducted on a machine with an NVIDIA TITAN V GPU (12GB memory), Intel Xeon CPU (2.20 GHz), and 512GB of RAM.

Refer to caption
(a) Original
Refer to caption
(b) Low-pass
Refer to caption
(c) High-pass
Refer to caption
(d) Band-pass
Refer to caption
(e) Band-rejection
Refer to caption
(f) Comb
Figure 3: A input image and the filtering results.
Table 2: Average sum of squared error and R2 score in parentheses.
Low-pass High-pass Band-pass Band-rejection Comb
exp⁑(βˆ’10Ξ»2) 1βˆ’exp⁑(βˆ’10Ξ»2) exp⁑(βˆ’10(Ξ»βˆ’1)2) 1βˆ’exp⁑(βˆ’10(Ξ»βˆ’1)2) |sin⁑(πλ)|
GCN 3.4799(.9872) 67.6635(.2364) 25.8755(.1148) 21.0747(.9438) 50.5120(.2977)
GAT 2.3574(.9905) 21.9618(.7529) 14.4326(.4823) 12.6384(.9652) 23.1813(.6957)
GPR-GNN 0.4169(.9984) 0.0943(.9986) 3.5121(.8551) 3.7917(.9905) 4.6549(.9311)
ARMA 1.8478(.9932) 1.8632(.9793) 7.6922(.7098) 8.2732(.9782) 15.1214(.7975)
ChebNet 0.8220(.9973) 0.7867(.9903) 2.2722(.9104) 2.5296(.9934) 4.0735(.9447)
BernNet 0.0314(.9999) 0.0113(.9999) 0.0411(.9984) 0.9313(.9973) 0.9982(.9868)

5.1 Learning filters from the signal

We conduct an empirical analysis on 50 real images with the resolution of 100Γ—100 from the Image Processing Toolbox in Matlab. We conduct independent experiments on these 50 images and report the average of the evaluation index. Following the experimental setting in balcilar2021analyzing , we regard each image as a 2D regular 4-neighborhood grid graph. The graph structure translates to an 10,000Γ—10,000 adjacency matrix while the pixel intensity translates to a 10,000-dimensional signal vector.

For each of the 50 images, we apply 5 different filters (low-pass, high-pass, band-pass, band-rejection and comb) to the spectral domain of its signal. The formula of each filter is shown in Table 2. Recall that applying a low-pass filter exp⁑(βˆ’10Ξ»2) to the spectral domain 𝐋=π”π‘‘π‘–π‘Žπ‘”[Ξ»1,…,Ξ»n]π”βŠ€ means applying π”π‘‘π‘–π‘Žπ‘”[exp⁑(βˆ’10Ξ»12),…,exp⁑(βˆ’10Ξ»n2)]π”βŠ€ to the graph signal. Figure 3 shows the one of the input image and the corresponding filtering results.

In this task, we use the original graph signal as the input and the filtering signal to supervise the training process. The goal is to minimize the square error between output and the filtering signal by learning the correct filter. We evaluate BernNet against five popular GNN models: GCN kipf2016semi , GAT gat , GPR-GNN chien2021GPR-GNN , ARMA bianchi2021ARMA and ChebNet Chebnet . To ensure fairness, we use two convolutional units and a linear output layer for all models. We train all models with approximately 2k trainable parameters and tune the hidden units to ensure they have similar parameters. Following balcilar2021analyzing , we discard any regularization or dropout and simply force the GNN to learn the input-output relation. For all models, we set the maximum number of epochs to 2000 and stop the training if the loss does not drop for 100 consecutive times and use Adam optimization with a 0.01 learning rate without decay. Models do not use the position information of the picture pixels. We use a mask to cover the edge nodes of the picture, so the problem can be regarded as a simple regression problem. For BernNet, we use a two-layer model, with each layer sharing the same set of ΞΈk for k=0,…,K and set K=10. For GPR-GNN, we use the officially released code (see the supplementary materials for URL and commit numbers) and set the order of polynomial filter K=10. Other baseline models are based on Pytorch Geometric implementation fey2019fast . The more detailed experiments setting can be found in the Appendix.

Table 2 shows the average of the sum of squared error (lower the better) and the R2 scores (higher the better). We first observe that GCN and GAT can only handle low-pass filters, which concurs with the theoretical analysis in balcilar2021analyzing . GPR-GNN, ARMA and ChebNet can learn different filters from the signals. However, BernNet consistently outperformed these models by a large margin on all tasks in terms of both metrics. We attribute this quality to BernNet’s ability to tune the coefficients ΞΈk’s, which directly correspond to the uniformly sampled filter values.

5.2 Node classification on real-world datasets

We now evaluate the performance of BernNet against the competitors on real-world datasets. Following chien2021GPR-GNN , we include three citation graph Cora, CiteSeer and PubMed sen2008collective ; yang2016revisiting , and the Amazon co-purchase graph Computers and Photo mcauley2015image . As shown in chien2021GPR-GNN these 5 datasets are homophilic graphs on which the connected nodes tend to share the same label. We also include the Wikipedia graph Chameleon and Squirrel musae , the Actor co-occurrence graph, and webpage graphs Texas and Cornell from WebKB333http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-11/www/wwkb/ pei2020geom . These 5 datasets are heterophilic datasets on which connected nodes tend to have different labels. We summarize the statistics of these datasets in Table 3.

Table 3: Dataset statistics.
Cora CiteSeer PubMed Computers Photo Chameleon Squirrel Actor Texas Cornell
Nodes 2708 3327 19717 13752 7650 2277 5201 7600 183 183
Edges 5278 4552 44324 245861 119081 31371 198353 26659 279 277
Features 1433 3703 500 767 745 2325 2089 932 1703 1703
Classes 7 6 5 10 8 5 5 5 5 5

Following chien2021GPR-GNN , we perform full-supervised node classification task with each model, where we randomly split the node set into train/validation/test set with ratio 60%/20%/20%. For fairness, we generate 10 random splits by random seeds and evaluate all models on the same splits, and report the average metric for each model.

We compare BernNet with 6 baseline models: MLP, GCN kipf2016semi , GAT gat , APPNP appnp , ChebNet Chebnet , and GPR-GNN chien2021GPR-GNN . For GPR-GNN, we use the officially released code (see the supplementary materials for URL and commit numbers) and set the order of polynomial filter K=10. For other models, we use the corresponding Pytorch Geometric library implementations fey2019fast . For BernNet, we use the following propagation process:

𝐙=βˆ‘k=0KΞΈk12K(Kk)(2πˆβˆ’π‹)Kβˆ’k𝐋kf(𝐗), (7)

where f(𝐗) is a 2-layer MLP with 64 hidden units on the feature matrix 𝐗. Note that this propagation process is almost identical to that of APPNP or GPR-GNN. The only difference is that we substitute the Generalized PageRank polynomial with Bernstein polynomial. We set the K=10 and use different learning rate and dropout for the linear layer and the propagation layer. For all models, we optimal leaning rate over {0.001,0.002,0.01,0.05} and weight decay {0.0,0.0005}. More detailed experimental settings are discussed in Appendix.

We use the micro-F1 score with a 95% confidence interval as the evaluation metric. The relevant results are summarized in Table 4. Boldface letters indicate the best result for the given confidence interval. We observe that BernNet provides the best results on seven out of the ten datasets. On the other three datasets, BernNet also achieves competitive results against SOTA methods.

Refer to caption
(a) Chameleon
Refer to caption
(b) Actor
Refer to caption
(c) Squirrel
Refer to caption
(d) Texas
Figure 4: Filters learnt from real-world datasets by BernNet.
Table 4: Results on real world benchmark datasets: Mean accuracy (%) Β± 95% confidence interval.
GCN GAT APPNP MLP ChebNet GPR-GNN BernNet
Cora 87.14Β±1.01 88.03Β±0.79 88.14Β±0.73 76.96Β±0.95 86.67Β±0.82 88.57Β±0.69 88.52Β±0.95
CiteSeer 79.86Β±0.67 80.52Β±0.71 80.47Β±0.74 76.58Β±0.88 79.11Β±0.75 80.12Β±0.83 80.09Β±0.79
PubMed 86.74Β±0.27 87.04Β±0.24 88.12Β±0.31 85.94Β±0.22 87.95Β±0.28 88.46Β±0.33 88.48Β±0.41
Computers 83.32Β±0.33 83.32Β±0.39 85.32Β±0.37 82.85Β±0.38 87.54Β±0.43 86.85Β±0.25 87.64Β±0.44
Photo 88.26Β±0.73 90.94Β±0.68 88.51Β±0.31 84.72Β±0.34 93.77Β±0.32 93.85Β±0.28 93.63Β±0.35
Chameleon 59.61Β±2.21 63.13Β±1.93 51.84Β±1.82 46.85Β±1.51 59.28Β±1.25 67.28Β±1.09 68.29Β±1.58
Actor 33.23Β±1.16 33.93Β±2.47 39.66Β±0.55 40.19Β±0.56 37.61Β±0.89 39.92Β±0.67 41.79Β±1.01
Squirrel 46.78Β±0.87 44.49Β±0.88 34.71Β±0.57 31.03Β±1.18 40.55Β±0.42 50.15Β±1.92 51.35Β±0.73
Texas 77.38Β±3.28 80.82Β±2.13 90.98Β±1.64 91.45Β±1.14 86.22Β±2.45 92.95Β±1.31 93.12Β±0.65
Cornell 65.90Β±4.43 78.21Β±2.95 91.81Β±1.96 90.82Β±1.63 83.93Β±2.13 91.37Β±1.81 92.13Β±1.64

More interestingly, this experiment also shows BernNet can learn complex filters from real-world datasets with only the supervision of node labels. Figure 4 plots some of the filters BernNet learnt in the training process. On Actor, BernNet learns an all-pass-alike filter, which concurs with the fact that MLP outperforms all other baselines on this dataset. On Chameleon and Squirrel, BernNet learns two comb-alike filters. Given that BernNet outperforms all competitors by at least 1% on these two datasets, it may suggest that comb-alike filters are necessary for Chameleon and Squirrel. Figure 5 shows the Coefficients ΞΈk learnt from real-world datasets by BernNet. When comparing Figures 4 and 5, we observe that the curves of filters and curves of coefficients are almost the same. This is because BernNet’s coefficients are highly correlated with the spectral property of the target filter, which indicates BernNet Bernnet has strong interpretability.

Refer to caption
(a) Chameleon
Refer to caption
(b) Actor
Refer to caption
(c) Squirrel
Refer to caption
(d) Texas
Figure 5: Coefficients ΞΈk learnt from real-world datasets by BernNet.

Finally, we present the train time for each method in Table 5. BernNet is slower than other methods due to its quadratic dependence on the degree K. However, compared to the SOTA method GPR-GNN, the margin is generally less than 2, which is often acceptable in practice. In theory, both ChebNet Chebnet and GPR-GNN chien2021GPR-GNN are linear time complexity related to propagation step K, but BernNet is quadratic time complexity related to K. Delgado et al. delgado2003linear show that Bernstein approximation can be evaluated in linear time related to K using the corner cutting algorithm. However, BernNet can not use this algorithm directly, because we need to multiply signal 𝐱. How to convert BernNet to linear complexity will be a problem worth studying in the future.

Table 5: Average running time per epoch (ms)/average total running time (s).
GCN GAT APPNP MLP ChebNet GPR-GNN BernNet
Cora 4.59/1.62 9.56/2.03 7.16/2.32 3.06/0.93 6.25/1.76 9.94/2.21 19.71/5.47
CiteSeer 4.63/1.95 9.93/2.21 7.79/2.77 2.95/1.09 8.28/2.56 11.16/2.37 22.36/6.32
PubMed 5.12/1.87 16.16/3.41 8.21/2.63 2.91/1.61 18.04/3.03 10.45/2.81 22.02/8.19
Computers 5.72/2.52 30.91/7.85 9.19/3.48 3.47/1.31 20.64/9.64 16.05/4.38 28.83/8.69
Photo 5.08/2.63 19.97/5.41 8.69/4.18 3.67/1.66 13.25/7.02 13.96/3.94 24.69/7.37
Chameleon 4.93/0.99 13.11/2.66 7.93/1.62 3.14/0.63 10.92/2.25 10.93/2.41 22.54/4.75
Actor 5.43/1.09 11.94/2.45 8.46/1.71 3.82/0.77 7.99/1.62 11.57/2.35 23.34/5.81
Squirrel 5.61/1.13 22.76/4.91 8.01/1.61 3.41/0.69 38.12/7.78 9.87/5.56 25.58/9.23
Texas 4.58/0.92 9.65/1.96 7.83/1.63 3.19/0.65 6.51/1.34 10.45/2.16 23.35/4.81
Cornell 4.83/0.97 9.79/1.99 8.23/1.68 3.25/0.66 5.85/1.22 9.86/2.05 22.23/5.26

6 Conclusion

This paper proposes BernNet, a graph neural network that provides a simple and intuitive mechanism for designing and learning an arbitrary spectral filter via Bernstein polynomial approximation. Compared to previous methods, BernNet can approximate complex filters such as band-rejection and comb filters, and can provide better interpretability. Furthermore, the polynomial filters designed and learned by BernNet are always valid. Experiments show that BernNet outperforms SOTA methods in terms of effectiveness on both synthetic and real-world datasets. For future work, an interesting direction is to improve the efficiency of BernNet.

Broader Impact

The proposed BernNet algorithm addresses the challenge of designing and learning arbitrary spectral filters on graphs. We consider this algorithm a general technical and theoretical contribution, without any foreseeable specific impacts. For applications in bioinformatics, computer vision, and natural language processing, applying the BernNet algorithm may improve the performance of existing GNN models. We leave the exploration of other potential impacts to future work.

Acknowledgments and Disclosure of Funding

Zhewei Wei was supported in part by National Natural Science Foundation of China (No. 61972401, No. 61932001 and No. 61832017), by Beijing Outstanding Young Scientist Program NO. BJJWZYJH012019100020098, by Alibaba Group through Alibaba Innovative Research Program, and by CCF-Baidu Open the Fund (NO.2021PP15002000). Zengfeng Huang was supported by National Natural Science Foundation of China Grant No. 61802069, and by Shanghai Science and Technology Commission Grant No. 17JC1420200. Hongteng Xu was supported by Tencent AI Lab Rhino-Bird Joint Research Program. This work is supported by China Unicom Innovation Ecological Cooperation Plan and by Intelligent Social Governance Platform, Major Innovation & Planning Interdisciplinary Platform for the β€œDouble-First Class” Initiative, Renmin University of China. We also wish to acknowledge the support provided and contribution made by Public Policy and Decision-making Research Lab of Renmin University of China.

References

  • [1] Muhammet Balcilar, Pierre HΓ©roux, Benoit GaΓΌzΓ¨re, SΓ©bastien Adam, and Paul Honeine. Analyzing the expressive power of graph neural networks in a spectral perspective. In ICLR, 2021.
  • [2] Filippo Maria Bianchi, Daniele Grattarola, Lorenzo Livi, and Cesare Alippi. Graph neural networks with convolutional arma filters. TPAMI, 2021.
  • [3] Toon Bogaerts, Antonio D Masegosa, Juan S Angarita-Zapata, Enrique Onieva, and Peter Hellinckx. A graph cnn-lstm neural network for short and long-term traffic forecasting based on trajectory data. Transportation Research Part C: Emerging Technologies, 2020.
  • [4] Zhao-Min Chen, Xiu-Shen Wei, Peng Wang, and Yanwen Guo. Multi-label image recognition with graph convolutional networks. In CVPR, 2019.
  • [5] Eli Chien, Jianhao Peng, Pan Li, and Olgica Milenkovic. Adaptive universal generalized pagerank graph neural network. In ICLR, 2021.
  • [6] Zhiyong Cui, Kristian Henrickson, Ruimin Ke, and Yinhai Wang. Traffic graph convolutional recurrent neural network: A deep learning framework for network-scale traffic learning and forecasting. T-ITS, 21(11):4883–4894, 2019.
  • [7] MichaΓ«l Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In NeurIPS, pages 3844–3852, 2016.
  • [8] Jorge Delgado and Juan Manuel Pena. A linear complexity algorithm for the bernstein basis. In GMP, pages 162–167, 2003.
  • [9] Yushun Dong, Kaize Ding, Brian Jalaian, Shuiwang Ji, and Jundong Li. Graph neural networks with adaptive frequency response filter. In CIKM, 2021.
  • [10] Rida T Farouki. The bernstein polynomial basis: A centennial retrospective. Computer Aided Geometric Design, 29(6):379–419, 2012.
  • [11] Matthias Fey and Jan Eric Lenssen. Fast graph representation learning with pytorch geometric. In ICLR, 2019.
  • [12] Dejun Jiang, Zhenxing Wu, Chang-Yu Hsieh, Guangyong Chen, Ben Liao, Zhe Wang, Chao Shen, Dongsheng Cao, Jian Wu, and Tingjun Hou. Could graph neural networks learn better molecular representation for drug discovery? a comparison study of descriptor-based and graph-based models. Journal of cheminformatics, 13(1):1–23, 2021.
  • [13] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In ICLR, 2017.
  • [14] Johannes Klicpera, Aleksandar Bojchevski, and Stephan GΓΌnnemann. Predict then propagate: Graph neural networks meet personalized pagerank. In ICLR, 2019.
  • [15] Johannes Klicpera, Stefan Weißenberger, and Stephan GΓΌnnemann. Diffusion improves graph learning. In NeurIPS, 2019.
  • [16] Ron Levie, Federico Monti, Xavier Bresson, and Michael M Bronstein. Cayleynets: Graph convolutional neural networks with complex rational spectral filters. Transactions on Signal Processing, 67(1):97–109, 2018.
  • [17] Chang Li and Dan Goldwasser. Encoding social information with graph convolutional networks forpolitical perspective detection in news media. In ACL, 2019.
  • [18] Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu. Diffusion convolutional recurrent neural network: Data-driven traffic forecasting. In ICLR, 2018.
  • [19] Meng Liu, Hongyang Gao, and Shuiwang Ji. Towards deeper graph neural networks. In KDD, 2020.
  • [20] Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton Van Den Hengel. Image-based recommendations on styles and substitutes. In SIGIR, 2015.
  • [21] Sunil K Narang, Akshay Gadde, and Antonio Ortega. Signal processing techniques for interpolation in graph structured data. In ICASSP. IEEE, 2013.
  • [22] Hongbin Pei, Bingzhe Wei, Kevin Chen-Chuan Chang, Yu Lei, and Bo Yang. Geom-gcn: Geometric graph convolutional networks. In ICLR, 2020.
  • [23] Victoria Powers and Bruce Reznick. Polynomials that are positive on an interval. Transactions of the American Mathematical Society, 352(10):4677–4692, 2000.
  • [24] Jiezhong Qiu, Jian Tang, Hao Ma, Yuxiao Dong, Kuansan Wang, and Jie Tang. Deepinf: Social influence prediction with deep learning. In KDD, 2018.
  • [25] Prakash Chandra Rathi, R Frederick Ludlow, and Marcel L Verdonk. Practical high-quality electrostatic potential surfaces for drug discovery using a graph-convolutional deep neural network. Journal of medicinal chemistry, 63(16):8778–8790, 2019.
  • [26] Benedek Rozemberczki, Carl Allen, and Rik Sarkar. Multi-scale attributed node embedding. Journal of Complex Networks, 9(2):cnab014, 2021.
  • [27] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. Collective classification in network data. AI magazine, 29(3):93–93, 2008.
  • [28] Daniel Spielman. Spectral and algebraic graph theory. Yale lecture notes, draft of December, 4, 2019.
  • [29] Peihao Tong, Qifan Zhang, and Junjie Yao. Leveraging domain context for question answering over knowledge graph. Data Science and Engineering, 4(4):323–335, 2019.
  • [30] Petar VeličkoviΔ‡, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. In ICLR, 2018.
  • [31] Felix Wu, Amauri Souza, Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Weinberger. Simplifying graph convolutional networks. In ICML, 2019.
  • [32] Shu Wu, Yuyuan Tang, Yanqiao Zhu, Liang Wang, Xing Xie, and Tieniu Tan. Session-based recommendation with graph neural networks. In AAAI, 2019.
  • [33] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems, 2020.
  • [34] Bingbing Xu, Huawei Shen, Qi Cao, Keting Cen, and Xueqi Cheng. Graph convolutional networks using heat kernel for semi-supervised learning. In IJCAI, 2019.
  • [35] Bingbing Xu, Huawei Shen, Qi Cao, Yunqi Qiu, and Xueqi Cheng. Graph wavelet neural network. In ICLR, 2018.
  • [36] Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Ken-ichi Kawarabayashi, and Stefanie Jegelka. Representation learning on graphs with jumping knowledge networks. In ICML, 2018.
  • [37] Zhilin Yang, William Cohen, and Ruslan Salakhudinov. Revisiting semi-supervised learning with graph embeddings. In ICML, pages 40–48. PMLR, 2016.
  • [38] Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L Hamilton, and Jure Leskovec. Graph convolutional neural networks for web-scale recommender systems. In KDD, 2018.
  • [39] Long Zhao, Xi Peng, Yu Tian, Mubbasir Kapadia, and Dimitris N Metaxas. Semantic graph convolutional networks for 3d human pose regression. In CVPR, 2019.
  • [40] Dengyong Zhou, Olivier Bousquet, Thomas Lal, Jason Weston, and Bernhard SchΓΆlkopf. Learning with local and global consistency. In NeurIPS, 2004.
  • [41] Meiqi Zhu, Xiao Wang, Chuan Shi, Houye Ji, and Peng Cui. Interpreting and unifying graph neural networks with an optimization framework. In WWW, 2021.

Appendix A Additional experimental details

A.1 Learning filters from the signal (Section 5.1)

For all models, we use two convolutional layers and a linear output layer that projects the final node representation onto the single output for each node. We train all models with approximately 2k trainable parameters and tune the hidden units to ensure they have similar parameters. We discard any regularization or dropout and simply force the GNN to learn the input-output relation. We stop the training if the loss does not drop for 100 consecutive epochs with a maximum limit of 2000 epochs and use Adam optimization with a 0.01 learning rate without decay. For GPR-GNN, we use the officially released code (URL and commit number in Table 6), and other baseline models are based on Pytorch Geometric implementation [11].

For GCN, we set the hidden units to 32 and set the linear units to 64. For GAT, the first layer has 4 attention heads and each head has 8 hidden; the second layer has 8 attention heads and each head has 8 hidden. And we set the linear units to 64. For ARMA, we set the ARMA layer and ARMA stacks to 1, and set the hidden units to 32 and set the linear units to 64. For ChebNet, we use 3 steps propagation for each layer with 32 hidden units and set the linear units to 64. For GPR-GNN, we set the hidden units to 32 with 10 steps propagation and set the linear units to 64, and use the random initialization for the GPR part. For BernNet, we use same set of ΞΈk for k=0,…,K in each layer and set K=10.

Table 6: URL and commit number of GPR-GNN codes
URL Commit
GRP-GNN https://github.com/jianhao2016/GPRGNN 2507f10

Regarding the analysis experiment in section 2.3, BenNet’s settings are the same as above and the image used is Figure 3.

A.2 Node classification on real-world datasets (Section 5.2)

In this experiment, we also use the officially released code (URL and commit number in Table 6) for GPR-GNN and Pytorch Geometric implementation [11] for other models. For all models, we use two convolutional layers and use early stopping 200 with a maximum of 1000 epochs for all datasets. We use the Adam optimizer to train the models and optimal leaning rate over {0.001,0.002,0.01,0.05} and weight decay {0.0,0.0005}.

Table 7: Hyperparameters for BernNet on real-world datasets.
Datasets Linear layer learning rate Propagation layer learning rate Hidden dimension Propagation layer dropout Linear layer dropout K Weight decay
Cora 0.01 0.01 64 0.0 0.5 10 0.0005
CiteSeer 0.01 0.01 64 0.5 0.5 10 0.0005
PubMed 0.01 0.01 64 0.0 0.5 10 0.0
Computers 0.01 0.05 64 0.6 0.5 10 0.0005
Photo 0.01 0.01 64 0.5 0.5 10 0.0005
Chameleon 0.05 0.01 64 0.7 0.5 10 0.0
Actor 0.05 0.01 64 0.9 0.5 10 0.0
Squirrel 0.05 0.01 64 0.6 0.5 10 0.0
Texas 0.05 0.002 64 0.5 0.5 10 0.0005
Cornell 0.05 0.001 64 0.5 0.5 10 0.0005

For GCN, we use 64 hidden units for each GCN convolutional layer. For MLP, we use the 2-layer full connected network with 64 hidden units. For GAT, we make the first layer have 8 attention heads and each heads have 8 hidden units, the second layer have 1 attention head and 64 hidden units. For APPNP, we use 2-layer MLP with 64 hidden units and set the propagation steps K to be 10. We search the optimal α within {0.1,0.2,0.5,0.9}. For ChebNet, we set the propagation steps K to be 2 and use 32 hidden units for each layer, which the number of equivalent hidden units is 64 for this case. For GPR-GNN, we use 2-layer MLP with 64 hidden units and set the propagation steps K to be 10, and use PPR initialization with α∈{0.1,0.2,0.5,0.9} for the GPR weights. For BernNet, we use 2-layer MLP with 64 hidden units and set the order K=10, and we optimize the learning rate for the linear layer and the propagation layer. We fix the dropout rate for the convolutional layer or the linear layer to be 0.5 for all models, and optimize the dropout rate for the propagation layer for GPR-GNN and BernNet. The Table 7 shows the hyperparameters of BernNet on real-world datasets.

Figure 6 plots the filters learnt from real-world datasets by BernNet. Besides some special filters discussed in section 5.2, we find that BernNet learns a low-pass filter on Cora and CiteSeer which concurs the analysis in [1]. Figure 7 shows the Coefficients ΞΈk learnt from real-world datasets by BernNet. We observe that the curves of filters and curves of coefficients are almost the same, this is because BernNet’s coefficients are highly correlated with the spectral property of the target filter which indicates BernNet Bernnet has strong interpretability.

Refer to caption
(a) Cora
Refer to caption
(b) CiteSeer
Refer to caption
(c) PubMed
Refer to caption
(d) Computers
Refer to caption
(e) Photo
Refer to caption
(f) Chameleon
Refer to caption
(g) Actor
Refer to caption
(h) Squirrel
Refer to caption
(i) Texas
Refer to caption
(j) Cornell
Figure 6: Filters learnt from real-world datasets by BernNet.
Refer to caption
(a) Cora
Refer to caption
(b) CiteSeer
Refer to caption
(c) PubMed
Refer to caption
(d) Computers
Refer to caption
(e) Photo
Refer to caption
(f) Chameleon
Refer to caption
(g) Actor
Refer to caption
(h) Squirrel
Refer to caption
(i) Texas
Refer to caption
(j) Cornell
Figure 7: Coefficients ΞΈk learnt from real-world datasets by BernNet.