What’s a Graph Neural Network?

February 15, 2024

This content is available as a PDF here.

Contents

There is a lot of expository materials on Graph Neural Networks (GNNs) out there. I want here to focus on a short “mathematical” introduction to it, in the sense to quickly arrive to the concept of equivariance and invariance to permutations, and how GNNs are designed to deal with it, in order to discuss in a future post our work (??, ????) on the convergence of GNNs to some “continuous” limit when the number of nodes goes to infinity.

1 Graphs

Graphs are ubiquitous structure 1 in mathematics and computer science. Graphs may represent several interactions, such as brain connectivity, computer graphic meshes, protein interactions or social networks. In its most elementary definition, a graph is a couple G=(V,E) where V is a finite set of vertices or nodes V={v1,,vn}, and E is a subset of the unordered pairs P2(V) of V, called edges E={ei1j1,,eimjm}P2(V).

(image)

Figure 1: A bipartite graph.

Beyond this formalism, it is quite frequent to enumerate the vertices 1,,n and consider the associated adjacency matrix ARn×n defined by A=(ai,j)1i,jn where

ai,j={1if ij,0otherwise.

where ij means that there is an edge between i and j. Note that A is a real symmetric matrix.

The graph in Figure 1 can be represented by the adjacency matrix

A=(000110000011000101101000110000011000).

A popular way to encode it concretely on a computer is to use the so-called adjacency list, that we can represent mathematically speaking as a couple G=(V,N) where V is a finite set and N:VP2(V) is the neighbor function such jN(i) if, and only if, {i,j}E.

The graph in Figure 1 can be represented by the adjacency list

1{4,5},2{5,6},3{4,6},4{1,3},5{1,2},6{2,3}.

We also say that a vertex i has a degree di of k when it has exactly k neighbors. Alternatively, di is defined as the cardinal of N(i).

Note that when speaking about graphs, one might also think of more general structures such as

  • Digraphs (directed graphs) where E is not a subset of P2(V) but a subset of V×V;

  • Graphs with self-loop, i.e., EP2(V)V;

  • Multi/Hyper-graph where E might have repeated edges;

  • or quivers, i.e., multi-digraphs, etc.

Another important generalization is the consideration of weighted (di)graph where instead of dealing with a couple G=(V,E), we add a (symmetric if undirected) function ω:V×VR>0. It is then common to use the weighted adjacency matrix Aω and weighted degree diω, defined as

ai,jω={ω(i,j)if ij,0otherwise, and diω=jN(i)ω(ωi,j).

Note that we recover the standard definition of a unweighted graph if ωi,j=1 when ij.

1 I think at this point this sentence is a meme in machine learning. At least, it is one for me.

2 Graphs in supervised learning

As a reminder, we typically say that we are performing supervised learning when we have access to a training set

T={xi,yi}i=1n(X×Y)μi.i.d.,

where μ is a probability measure over X×Y. The goal is to learn from it a function Φ:XY that generalize well in the sense that for (x,y)μ, Φ(x)y. This is typically – but not only – achieved using Empirical Risk Minimization (ERM), that is solving the minimizing problem

argminΦF1ni=1nL(yi,Φ(xi)),

where L:Y×Y is an adequate loss function, e.g, the 2 loss L(y,y)=yy22, and F is a candidate set of function FYX. Modern (deep) learning is often considered in the interpolating regime, i.e., we seek highly parameterized models where the Φ is expected to achieve a 0-error on the training set: i{1,,n}, Φ(xi)=yi.

Graph classification

Classification of graphs aims to answer questions such as “Does this protein G is useful (+1) or not (-1)?” Formally, it is an application (for binary classification)

Φ¯:{G{1,+1}GΦ(G)

where G is the set of all graphs with a finite number of vertices. Here, the object of interest is the domain, e.g, the graph. To learn this function Φ¯, we typically have access to a training set (Gi,yi)i=1n(G×{1,1})n.

Node classification

In contrast, node classification aims to answer “Does this atom i of charge z(i) involved (+1) or not (-1) in the activity of the protein G?” Here, the map of interest has the form

Φ:{H(V){1,+1}zΦ(z)

where H(V) is the set of all functions z:VR, or equivalently, node features zR|V|. To learn this function Φ, the training set has the form (zi,yi)i=1n(H(V)×{1,1})n. If both task are built around the idea of a graph G, they fundamentally differ as graph classification aims to learn a domain whereas node classification aims to learn a function over this domain. Note that H(V) is isomorphic to R|V|, and thus can be endowed an Hilbert structure using the canonical inner product.

Similarly to the Euclidean case, classification admits a continuous counterpart called regression, and one can define node regression as the task

Φ:{H(V)RzΦ(z)

Typical question would be “What is the energy z(i) of the atom i in the activity of the protein G?

3 Invariance and Equivariance

In the context of graph learning, we are interested in the behavior of the function Φ with respect to the structure of the graph. In particular, we are interested in the behavior of Φ with respect to the permutation of the vertices.

Reminders on groups

A group is a couple (G,) where G is a set and a binary operation on G. This (internal) binary operation satisfies three properties

  • Associativity. For any g1,g2,g3G, one has

(g1g2)g3=g1(g2g3).

  • Existence of a unit element. There exists a unique element eG such that for any gG, one has

ge=eg=g.

  • Existence of an inverse. For any gG, there exists a unique g1G such that

gg1=g1g=e.

Groups have a central place in the study of symmetry in algebra, and the notion of invariance emerges directly from it. Basic examples of groups includes

  • The real line endowed with the addition (R,+);

  • The set of rotations endowed with the composition (SO(n),);

  • The set of permutations endowed with the composition (Sn,).

Group action

Without diving in the subtilites of representation theory, we can recall that one of the most important tool of group theory is the notion of group action. A (left) group action of (G,) onto a set X is a map from G×X to X that maps (g,x)G×X to gxX such that it is

  • associative: for all g1,g2G, xX, one has g1(g2x)=(g1g2)x ;

  • identity law: for all xX, one has ex=x .

A natural group action encountered in geometry and imaging is the (natural) action of the 2D (resp. 3D) rotations SO(2) (resp. SO(3)) onto the ambient space R2 (resp. R3). Intuitively, this is simply applying a rotation to a point in Rn, and by extension to a subset SR2.

Another well-known example is the action of Sn on Rn defined by

σx=(xσ(1)xσ(n))

Remark that a group can act on different sets, for instance Sn can act on the finite set {1,,n} by σm=σ(m), and a group can act with two different actions on the same set (replace σ by σ1).

These two examples of actions have a common property: they are linear group actions. Linear group actions are probably the most fundamental examples of actions, and a whole field is dedicated to their studies: group representations. A group action is said to be linear if X is a vector space, for all gG, x,xX and for all α,βR,

g(αx+βx)=α(gx)+β(gy).

Invariance and Equivariance

Consider a function f:XY. We say that f is G-invariant if for any gG, xX, one has

f(gx)=f(x).

For instance, if G=SO(2) is the group of 2D rotations and X=R2 the Euclidean plane, then the Euclidean norm f(x)=x is SO(2)-invariant: given any rotation rθSO(2), one has rθ(x)=x.

Consider now a group G acting on X and also2 acting on Y. We say that f is G-equivariant if for any gG, xX, one has

f(gx)=gf(x).

Important examples includes function that index-based in Rn. For instance, if G=Sn and X=Y=Rn, then f(x)=argmin1inxi is Sn-equivariant: f(σx)=σ(argmin1inxi).

2 In full generality, we may consider different groups but we will not need this generalization for our purpose of studying graph neural networks.

4 A Fourier look at graph signal processing

At this point, we are quite far from concrete application in machine learning. A first question is: what are the invariance or equivariance that we want to enforce? In the following section, we will consider the important example of shift-invariance as a building block of Convolutional Neural Networks.

Convolution

The convolution is a fundamental operation in signal processing. Given two real functions g,h, let say Lebesgue-integrable, one defines the convolution gh as

(gh)(t)=Rg(tτ)h(τ)dτ.

One may look at g as a signal and h as a filter eventhough the two functions may play reversed-role. An important interpretation of the convolution is to see it as a local averaging method: it is indeed a sort of generalization of a moving average.

For our purpose, the important property of convolution is that it is shift-equivariant. This means that if we shift the input signal, the output signal is shifted in the same way: if vR, g,hL1(R), then,

(Tvg)h=Tv(gh).

where Tv is the translation operator defined by Tvg(t)=g(tv).

In traditional signal processing, such filters h are given: Gaussian blur, sharpening, etc. In contrast, in machine learning, and especially for CNNs, such filter are learned during the training procedure (see course Introduction to Deep Learning).

Our issues here are

  • What is a convolution on a graph?

  • Even more basic, what is shift-equivariance on a graph?

There is no universal answer to these two questions.

Convolution and Fourier

For an integrable function g:RR, its Fourier transform F[g]:RC is defined as

F[g]:ωRg(t)e2iπωtdt.

On the L2(R) space, F[g] might be view as the inner product between g and the complex exponentials

F[g](ω)=g,exp2iπωL2(R)

The key property used in signal processing is that Fourier transform diagonalize the convolution: for any g,hL2(R), then

F[gh]=F[g]F[h].

I show in Figure 2 an example of the convolution of two functions and its Fourier transform.

(image)

Figure 2: Convolution of an image by a low-pass filter.

The goal hence is to define what is a “Fourier transform” on graphs. To this end, we take yet another interpretation: the complex exponentials exp2iπω:te2iπωt are the eigenfunctions of the Laplacian operator on the torus S1=R/Z. If f=exp2iπω, then f(t)=2iπωf(t) and

f(t)=(2iπω)2f(t)=4π2ω2f(t).

5 Graph Laplacian

  • Definition 1. Graph Laplacians The (combinatorial) graph Laplacian of a graph G is defined as

    L=DA

    where A is the adjacency matrix of G, and D=diag(di)1in is the degree matrix defined by di being the degree of the vertex i.

    The (normalized) graph Laplacian of a graph G is defined as

    L=D1/2LD1/2=IdD1/2AD1/2.

The combinatorial graph Laplacian is a semidefinite positive symmetric matrix associated to the quadratic form

Lz,z=ij(xixj)2.

The spectral graph theory is more often concerned with the normalized version (??, a). It has several interesting properties (among many others):

  • Proposition 1. Let G a graph and L its associated normalized Laplacian. Then,

    • L is a differential operator on the space of function g:VR:

      (Lg)(i)=1diji(g(i)dig(j)dj)

    • There exists a matrix MRn×m such that L=MM and M is called the normalized incidence matrix that reads

      Mi,e={±1/diif ie0otherwise.

    • L is a real symmetric operator.

An other important fact is that seeing L as a map of G, we can say that it is Sn-equivariant: if one reorders the nodes, it is sufficient to reorder lines and columns to get the new Laplacian matrix.

Graph Fourier Transform

The last point of Proposition prp-laplacian-graph, namely the fact that L is a real symmetric operator is fundamental. The spectral theorem tells us that there exists an orthonormal U=[u0un1]Rn×n and n real values λ0λn1R such that

L=UDUTwhereD=diag(λ0,,λn1)Rn×n.

Observe that U is not necessarly uniquely defined, and that λ0=0 associated to the eigenvector 1n.

Recall that the Fourier transform on L2(R) is defined as

F[g](ω)=g,exp2iπωL2(R),

where exp2iπω are the eigenvectors of the Laplacian. Analogously, we define the Fourier transform on a graph (GFT) as follow:

  • Definition 2. Let G a graph, L its normalized Laplacian and L=UDUT its eigendecomposition. Let z:VR, i.e, zRn a signal on G. Then, the graph Fourier transform of z is defined as

    F[z](λk)=z,ukRn=i=1nz(i)uk(i)fork=1,,n1.

We can rewrite it in matrix form as F[z]=Uz. An important feature of the GFT is that there exists an inverse GFT, defined similarly to the real case:

z(i)=k=0n1F[z](λk)uk(i)fori=1,,n.

Filtering on graphs

Given the definition of the GFT, how to filter a signal? The idea is to act on the eigenvalues of the eigendecomposition, and let eigenvectors untouched. So given a filter h:RR, the filtering of zRn by h is given by:

hz=F1[h(D)F[z]]=Uh(D)Uz,

where h(D) is applied pointwise. An important point: despite the notation, the filter hz is not a regular convolution. Observe that zhz is Sn-equivariant.

In practice, diagonalizing the Laplacian is too costly. What people use is polynomial, or analytical, filters of the Laplacian. Given a polynomial, or formal serie, h=k0βkXkR[[X]], we let (possibly not-well defined in the case of formal serie)

hz=k0βkLkz.

Note that from a theoretical perspect, it is still only application on the eigenvalues. Figure fig-conv-graph illustrates the filtering of a graph signal by a low-pass filter.

(image)

Figure 3: Convolution of a graph by a low-pass filter.

6 Graph Neural Network

(Spectral) Graph Convolutional Network

Up to some subtles or notational differences, you probably encounter CNNs as parametric function defined layer-wise by a combination of

  • non-linearities such as ReLU;

  • convolution layers mapping a tensor to another;

  • pooling (or subsampling) reducing the dimension using for instance a max- or mean-pooling;

  • fully connected layers (Multi-Layer Perceptron, MLP) to produce an output.

In the context of GNNs, non-linearities will be the same (typically Lipschitz function different from the identity), convolution will be used as defined in the previous section, MLPs will be the same. But what about pooling? This is (mostly) an open question in the GNNs on how to produce good and efficient pooling operation, also known as graph coarsening. These architectures were defined around 2015, by (??, a) and (??, a).

A GCN with M+1 layers is defined as follows:

  • 1. The signal at the input layer is some node features Z(0)=Z with dimension d0=dz and built up columns zj(0)Rn.

  • 2. For all =1,,L, the signal Z(+1)Rn×d+1 at layer +1 is obtained by propagation of the signal Z()Rn×d at layer with respect to analytic filters of the normalized Laplacian:

    (1)zj(+1)=ρ(i=1dhij()(L)zi()+bj()1n)Rn,j=1d+1.

  • 3. The output depends on the notion of invariance that one wishes to preserve.

    • Sn-equivariance (node classification). In this case, the output of the GCN is a node-signal ΦG(Z)Rn×dout defined as

      ΦG(Z)=MLPθ(Z(L)),

      where MLPθ is a MLP applied row-wise to the signal Z(L).

    • Sn-invariance (graph classification). In this case, the output is a single vector Φ¯G(Z)Rdout given as a global pooling

      Φ¯G(Z)=1ni=1nΦG(Z)i.

Invariance and Equivariance of GCNs

In the definition of GCNs, it was roughly said that we can made the output either Sn-equivariant or Sn-invariant. But what do we mean exactly by that? We said that a permutation σ acts onto the space of GCNs with the action σΦG defined as

ZRn×d0,(σΦG)(Z)=ΦσG(σZ),

where we defined the action on the graph and the signal as

σG=(σV,σE)σZ=((z1)σ(1)(zd0)σ(1)(z1)σ(n)(zd0)σ(n)) The action on the group is more easily defined in fact on the adjacency matrix. For any σSn, let Pσ be the associated permutation matrix. Then, one can defined

σA=PσAPσ1=PσAPσ.

Cutting the algebraic details, these two definitions are perfectly equivalent.

We can now state the result

  • Proposition 2. For any G, and filters h, ΦG is Sn-equivariant and Φ¯G is Sn-invariant: for all σSn, one has

    (σΦG)(Z)=σΦG(Z),(σΦ¯G)(Z)=Φ¯G(Z).

  • Proof. See for instance (??, , Proposition 1). The proof of Sn-equivariant is based on three facts:

    • 1. The map L:AIdD1/2AD1/2 is Sn-equivariant;

    • 2. Hence the map hL:Ah(L(A)) is Sn-equivariant;

    • 3. Permutations matrices commutes pointwise with entrywise activation function.

    To get the Sn-invariance, one has to see that agregaging with a sum is a universal way to construct a Sn-invariant function from Sn-equivariant function.

7 Another paradigm: Message-Passing Graph Neural Networks

Before ending this nose, let me mention that around 2017, (??, a) and (??, a) proposed a new architecture, or at least a new interpretation, that “forget” the spectral interpretation of GCNs. It is closer in the spirit to original graph neural network model proposed by (??, a), and is now the standard way to think of GNNs. I still believe that the spectral interpretation highlights important properties of GNNs, and is a good way to understand the behavior of GNNs.

Instead of a spectral interpretation, a neighborhood aggregation, or so-called message-passing architecture is proposed:

zj(+1)=γ()(zj(),iN(j)ρ()(zj(),zi(),ej,i)),

where

  • zj() is the node feature at layer of node j;

  • γ() (the update function) and ρ() (the message-passing function) are learnable functions such as traditional MLPs;

  • is an agregation function such as sum or maximum max;

  • ej,i is a (potential) edge features.

In many papers, this is presented in the equivalent form (often with multisets for the aggregation)

mj()=MSG()({zi():iN(j){j}})aj()=AGG()({mi():iN(j)})zj(+1)=UPD()(zj(),aj()).

Let me mention several important examples of MPGNNs:

  • Convolutional GNN. Wait what, didn’t we called them spectral GNNs? Yes, but we can look at them in a slightly more general way

    zj(+1)=γ()(zj(),iN(j)ci,jρ()(zi())),

    In practice, the community consider a more constrained version:

    zj(+1)=ReLU(1|N(j)|iN(j)W()zi()),

    where W() is a learnable matrix. This can be even more precisely specified (in matrix form) as

    Z+1=ReLU(D1AZW,),

    or in normalized form as

    Z+1=ReLU(D1/2AD1/2ZW,),

  • Attentional GNN.

    zj(+1)=γ()(zj(),iN(j)a(zj(),zi())),

    where a is a learnable self-attention function.

Without diving into the details, I want to attract your attention that when dealing with a graph with only self-loop, or when E=P2(V), we obtain two popular architectures in the litteratures.

  • When N(i)={i} – so not a classical graph, you have to add a self-loop – the convolutional architecture boils down to the Deep Sets architecture (??, a).

  • On the other extreme, when ij for all i,j (again with self-loop), the attentional GNN is linked to the Transformer (??, a) architecture.