What’s a Graph Neural Network?

February 15, 2024

This content is available as a PDF here.

There is a lot of expository materials on Graph Neural Networks (GNNs) out there. I want here to focus on a short “mathematical” introduction to it, in the sense to quickly arrive to the concept of equivariance and invariance to permutations, and how GNNs are designed to deal with it, in order to discuss in a future post our work (??, ????) on the convergence of GNNs to some “continuous” limit when the number of nodes goes to infinity.

1 Graphs

Graphs are ubiquitous structure ¹ in mathematics and computer science. Graphs may represent several interactions, such as brain connectivity, computer graphic meshes, protein interactions or social networks. In its most elementary definition, a graph is a couple $G = (V, E)$ where $V$ is a finite set of vertices or nodes $V = {v_{1}, \dots, v_{n}}$ , and $E$ is a subset of the unordered pairs $P_{2} (V)$ of $V$ , called edges $E = {e_{i_{1} j_{1}}, \dots, e_{i_{m} j_{m}}} \subseteq P_{2} (V)$ .

Beyond this formalism, it is quite frequent to enumerate the vertices $1, \dots, n$ and consider the associated adjacency matrix $A \in R^{n \times n}$ defined by $A = (a_{i, j})_{1 \leq i, j \leq n}$ where

$a_{i, j} = {\begin{cases} 1 & if i \sim j, \\ 0 & otherwise. \end{cases}$

where $i \sim j$ means that there is an edge between $i$ and $j$ . Note that $A$ is a real symmetric matrix.

The graph in Figure 1 can be represented by the adjacency matrix

$A = (\begin{matrix} 0 & 0 & 0 & 1 & 1 & 0 \\ 0 & 0 & 0 & 0 & 1 & 1 \\ 0 & 0 & 0 & 1 & 0 & 1 \\ 1 & 0 & 1 & 0 & 0 & 0 \\ 1 & 1 & 0 & 0 & 0 & 0 \\ 0 & 1 & 1 & 0 & 0 & 0 \end{matrix}) .$

A popular way to encode it concretely on a computer is to use the so-called adjacency list, that we can represent mathematically speaking as a couple $G = (V, N)$ where $V$ is a finite set and $N : V \to P_{2} (V)$ is the neighbor function such $j \in N (i)$ if, and only if, ${i, j} \in E$ .

The graph in Figure 1 can be represented by the adjacency list

$\begin{aligned} 1 & \mapsto {4, 5}, 2 \mapsto {5, 6}, 3 \mapsto {4, 6}, \\ 4 & \mapsto {1, 3}, 5 \mapsto {1, 2}, 6 \mapsto {2, 3} . \end{aligned}$

We also say that a vertex $i$ has a degree $d_{i}$ of $k$ when it has exactly $k$ neighbors. Alternatively, $d_{i}$ is defined as the cardinal of $N (i)$ .

Note that when speaking about graphs, one might also think of more general structures such as

• Digraphs (directed graphs) where $E$ is not a subset of $P_{2} (V)$ but a subset of $V \times V$ ;
• Graphs with self-loop, i.e., $E \subseteq P_{2} (V) \cup V$ ;
• Multi/Hyper-graph where $E$ might have repeated edges;
• or quivers, i.e., multi-digraphs, etc.

Another important generalization is the consideration of weighted (di)graph where instead of dealing with a couple $G = (V, E)$ , we add a (symmetric if undirected) function $ω : V \times V \to R_{> 0}$ . It is then common to use the weighted adjacency matrix $A^{ω}$ and weighted degree $d_{i}^{ω}$ , defined as

$a_{i, j}^{ω} = {\begin{cases} ω (i, j) & if i \sim j, \\ 0 & otherwise, \end{cases} and d_{i}^{ω} = \sum_{j \in N (i)} ω (ω_{i, j}) .$

Note that we recover the standard definition of a unweighted graph if $ω_{i, j} = 1$ when $i \sim j$ .

¹ I think at this point this sentence is a meme in machine learning. At least, it is one for me.

2 Graphs in supervised learning

As a reminder, we typically say that we are performing supervised learning when we have access to a training set

$T = {x_{i}, y_{i}}_{i = 1}^{n} \in (X \times Y) \sim μ i.i.d.,$

where $μ$ is a probability measure over $X \times Y$ . The goal is to learn from it a function $Φ : X \to Y$ that generalize well in the sense that for $(x, y) \sim μ$ , $Φ (x) \approx y$ . This is typically – but not only – achieved using Empirical Risk Minimization (ERM), that is solving the minimizing problem

$\arg min_{Φ \in F} \frac{1}{n} \sum_{i = 1}^{n} L (y_{i}, Φ (x_{i})),$

where $L : Y \times Y$ is an adequate loss function, e.g, the $ℓ^{2}$ loss $L (y, y^{'}) = ‖ y - y^{'} ‖_{2}^{2}$ , and $F$ is a candidate set of function $F \subseteq Y^{X}$ . Modern (deep) learning is often considered in the interpolating regime, i.e., we seek highly parameterized models where the $Φ$ is expected to achieve a 0-error on the training set: $\forall i \in {1, \dots, n}$ , $Φ (x_{i}) = y_{i}$ .

Graph classification

Classification of graphs aims to answer questions such as “Does this protein $G$ is useful (+1) or not (-1)?” Formally, it is an application (for binary classification)

$\bar{Φ} : {\begin{cases} G & \to {- 1, + 1} \\ G & \mapsto Φ (G) \end{cases}$

where $G$ is the set of all graphs with a finite number of vertices. Here, the object of interest is the domain, e.g, the graph. To learn this function $\bar{Φ}$ , we typically have access to a training set $(G_{i}, y_{i})_{i = 1}^{n} \in (G \times {- 1, 1})^{n}$ .

Node classification

In contrast, node classification aims to answer “Does this atom $i$ of charge $z (i)$ involved (+1) or not (-1) in the activity of the protein $G$ ?” Here, the map of interest has the form

$Φ : {\begin{cases} H (V) & \to {- 1, + 1} \\ z & \mapsto Φ (z) \end{cases}$

where $H (V)$ is the set of all functions $z : V \to R$ , or equivalently, node features $z \in R^{| V |}$ . To learn this function $Φ$ , the training set has the form $(z_{i}, y_{i})_{i = 1}^{n} \in (H (V) \times {- 1, 1})^{n}$ . If both task are built around the idea of a graph $G$ , they fundamentally differ as graph classification aims to learn a domain whereas node classification aims to learn a function over this domain. Note that $H (V)$ is isomorphic to $R^{| V |}$ , and thus can be endowed an Hilbert structure using the canonical inner product.

Similarly to the Euclidean case, classification admits a continuous counterpart called regression, and one can define node regression as the task

$Φ : {\begin{cases} H (V) & \to R \\ z & \mapsto Φ (z) \end{cases}$

Typical question would be “What is the energy $z (i)$ of the atom $i$ in the activity of the protein $G$ ?”

3 Invariance and Equivariance

In the context of graph learning, we are interested in the behavior of the function $Φ$ with respect to the structure of the graph. In particular, we are interested in the behavior of $Φ$ with respect to the permutation of the vertices.

Reminders on groups

A group is a couple $(G, \circ)$ where $G$ is a set and $\circ$ a binary operation on $G$ . This (internal) binary operation satisfies three properties

• Associativity. For any $g_{1}, g_{2}, g_{3} \in G$ , one has

$(g_{1} \circ g_{2}) \circ g_{3} = g_{1} \circ (g_{2} \circ g_{3}) .$

• Existence of a unit element. There exists a unique element $e \in G$ such that for any $g \in G$ , one has

$g \circ e = e \circ g = g .$

• Existence of an inverse. For any $g \in G$ , there exists a unique $g^{- 1} \in G$ such that

$g \circ g^{- 1} = g^{- 1} \circ g = e .$

Groups have a central place in the study of symmetry in algebra, and the notion of invariance emerges directly from it. Basic examples of groups includes

• The real line endowed with the addition $(R, +)$ ;
• The set of rotations endowed with the composition $(SO (n), \circ)$ ;
• The set of permutations endowed with the composition $(S_{n}, \circ)$ .

Group action

Without diving in the subtilites of representation theory, we can recall that one of the most important tool of group theory is the notion of group action. A (left) group action $\cdot$ of $(G, \circ)$ onto a set $X$ is a map from $G \times X$ to $X$ that maps $(g, x) \in G \times X$ to $g \cdot x \in X$ such that it is

• associative: for all $g_{1}, g_{2} \in G$ , $x \in X$ , one has $g_{1} \cdot (g_{2} \cdot x) = (g_{1} \circ g_{2}) \cdot x$ ;
• identity law: for all $x \in X$ , one has $e \cdot x = x$ .

A natural group action encountered in geometry and imaging is the (natural) action of the 2D (resp. 3D) rotations $SO (2)$ (resp. $SO (3)$ ) onto the ambient space $R^{2}$ (resp. $R^{3}$ ). Intuitively, this is simply applying a rotation to a point in $R^{n}$ , and by extension to a subset $S \subseteq R^{2}$ .

Another well-known example is the action of $S_{n}$ on $R^{n}$ defined by

$σ \cdot x = (\begin{matrix} x_{σ (1)} \\ ⋮ \\ x_{σ (n)} \end{matrix})$

Remark that a group can act on different sets, for instance $S_{n}$ can act on the finite set ${1, \dots, n}$ by $σ \cdot m = σ (m)$ , and a group can act with two different actions on the same set (replace $σ$ by $σ^{- 1}$ ).

These two examples of actions have a common property: they are linear group actions. Linear group actions are probably the most fundamental examples of actions, and a whole field is dedicated to their studies: group representations. A group action is said to be linear if $X$ is a vector space, for all $g \in G$ , $x, x^{'} \in X$ and for all $α, β \in R$ ,

$g \cdot (α x + β x^{'}) = α (g \cdot x) + β (g \cdot y) .$

Invariance and Equivariance

Consider a function $f : X \to Y$ . We say that $f$ is $G$ -invariant if for any $g \in G$ , $x \in X$ , one has

$f (g \cdot x) = f (x) .$

For instance, if $G = SO (2)$ is the group of 2D rotations and $X = R^{2}$ the Euclidean plane, then the Euclidean norm $f (x) = ‖ x ‖$ is $SO (2)$ -invariant: given any rotation $r_{θ} \in SO (2)$ , one has $‖ r_{θ} (x) ‖ = ‖ x ‖$ .

Consider now a group $G$ acting on $X$ and also² acting on $Y$ . We say that $f$ is $G$ -equivariant if for any $g \in G$ , $x \in X$ , one has

$f (g \cdot x) = g \cdot f (x) .$

Important examples includes function that index-based in $R^{n}$ . For instance, if $G = S_{n}$ and $X = Y = R^{n}$ , then $f (x) = \arg min_{1 \leq i \leq n} x_{i}$ is $S_{n}$ -equivariant: $f (σ \cdot x) = σ (\arg min_{1 \leq i \leq n} x_{i})$ .

² In full generality, we may consider different groups but we will not need this generalization for our purpose of studying graph neural networks.

4 A Fourier look at graph signal processing

At this point, we are quite far from concrete application in machine learning. A first question is: what are the invariance or equivariance that we want to enforce? In the following section, we will consider the important example of shift-invariance as a building block of Convolutional Neural Networks.

Convolution

The convolution is a fundamental operation in signal processing. Given two real functions $g, h$ , let say Lebesgue-integrable, one defines the convolution $g * h$ as

$(g * h) (t) = \int_{R} g (t - τ) h (τ) d τ .$

One may look at $g$ as a signal and $h$ as a filter eventhough the two functions may play reversed-role. An important interpretation of the convolution is to see it as a local averaging method: it is indeed a sort of generalization of a moving average.

For our purpose, the important property of convolution is that it is shift-equivariant. This means that if we shift the input signal, the output signal is shifted in the same way: if $v \in R$ , $g, h \in L^{1} (R)$ , then,

$(T_{v} \cdot g) * h = T_{v} \cdot (g * h) .$

where $T_{v}$ is the translation operator defined by $T_{v} \cdot g (t) = g (t - v)$ .

In traditional signal processing, such filters $h$ are given: Gaussian blur, sharpening, etc. In contrast, in machine learning, and especially for CNNs, such filter are learned during the training procedure (see course Introduction to Deep Learning).

Our issues here are

• What is a convolution on a graph?
• Even more basic, what is shift-equivariance on a graph?

There is no universal answer to these two questions.

Convolution and Fourier

For an integrable function $g : R \to R$ , its Fourier transform $F [g] : R \to C$ is defined as

$F [g] : ω \mapsto \int_{R} g (t) e^{- 2 i π ω t} d t .$

On the $L^{2} (R)$ space, $F [g]$ might be view as the inner product between $g$ and the complex exponentials

$F [g] (ω) = {⟨ g, \exp_{- 2 i π ω} ⟩}_{L^{2} (R)}$

The key property used in signal processing is that Fourier transform diagonalize the convolution: for any $g, h \in L^{2} (R)$ , then

$F [g * h] = F [g] F [h] .$

I show in Figure 2 an example of the convolution of two functions and its Fourier transform.

The goal hence is to define what is a “Fourier transform” on graphs. To this end, we take yet another interpretation: the complex exponentials $\exp_{- 2 i π ω} : t \mapsto e^{- 2 i π ω t}$ are the eigenfunctions of the Laplacian operator on the torus $S^{1} = R / Z$ . If $f = \exp_{- 2 i π ω}$ , then $f^{'} (t) = - 2 i π ω f (t)$ and

$f^{″} (t) = (- 2 i π ω)^{2} f (t) = - 4 π^{2} ω^{2} f (t) .$

5 Graph Laplacian

Definition 1. Graph Laplacians The (combinatorial) graph Laplacian of a graph $G$ is defined as

$L = D - A$

where $A$ is the adjacency matrix of $G$ , and $D = diag (d_{i})_{1 \leq i \leq n}$ is the degree matrix defined by $d_{i}$ being the degree of the vertex $i$ .

The (normalized) graph Laplacian of a graph $G$ is defined as

$L = D^{- 1 / 2} L D^{- 1 / 2} = Id - D^{- 1 / 2} A D^{- 1 / 2} .$

The combinatorial graph Laplacian is a semidefinite positive symmetric matrix associated to the quadratic form

$⟨ L z, z ⟩ = \sum_{i \sim j} (x_{i} - x_{j})^{2} .$

The spectral graph theory is more often concerned with the normalized version (??, a). It has several interesting properties (among many others):

Proposition 1. Let $G$ a graph and $L$ its associated normalized Laplacian. Then,
- • $L$ is a differential operator on the space of function $g : V \to R$ :
  
  $(L g) (i) = \frac{1}{\sqrt{d_{i}}} \sum_{j \sim i} (\frac{g (i)}{\sqrt{d_{i}}} - \frac{g (j)}{\sqrt{d_{j}}})$
- • There exists a matrix $M \in R^{n \times m}$ such that $L = M M^{⊤}$ and $M$ is called the normalized incidence matrix that reads
  
  $M_{i, e} = {\begin{cases} \pm 1 / \sqrt{d_{i}} & if i \in e \\ 0 & otherwise. \end{cases}$
- • $L$ is a real symmetric operator.

An other important fact is that seeing $L$ as a map of $G$ , we can say that it is $S_{n}$ -equivariant: if one reorders the nodes, it is sufficient to reorder lines and columns to get the new Laplacian matrix.

Graph Fourier Transform

The last point of Proposition prp-laplacian-graph, namely the fact that $L$ is a real symmetric operator is fundamental. The spectral theorem tells us that there exists an orthonormal $U = [u_{0} \dots u_{n - 1}] \in R^{n \times n}$ and $n$ real values $λ_{0} \leq \dots \leq λ_{n - 1} \in R$ such that

$L = U D U^{T} where D = diag (λ_{0}, \dots, λ_{n - 1}) \in R^{n \times n} .$

Observe that $U$ is not necessarly uniquely defined, and that $λ_{0} = 0$ associated to the eigenvector $1_{n}$ .

Recall that the Fourier transform on $L^{2} (R)$ is defined as

$F [g] (ω) = {⟨ g, \exp_{- 2 i π ω} ⟩}_{L^{2} (R)},$

where $\exp_{- 2 i π ω}$ are the eigenvectors of the Laplacian. Analogously, we define the Fourier transform on a graph (GFT) as follow:

Definition 2. Let $G$ a graph, $L$ its normalized Laplacian and $L = U D U^{T}$ its eigendecomposition. Let $z : V \to R$ , i.e, $z \in R^{n}$ a signal on $G$ . Then, the graph Fourier transform of $z$ is defined as

$F [z] (λ_{k}) = {⟨ z, u_{k}^{⊤} ⟩}_{R^{n}} = \sum_{i = 1}^{n} z (i) u_{k}^{⊤} (i) for k = 1, \dots, n - 1 .$

We can rewrite it in matrix form as $F [z] = U^{⊤} z$ . An important feature of the GFT is that there exists an inverse GFT, defined similarly to the real case:

$z (i) = \sum_{k = 0}^{n - 1} F [z] (λ_{k}) u_{k} (i) for i = 1, \dots, n .$

Filtering on graphs

Given the definition of the GFT, how to filter a signal? The idea is to act on the eigenvalues of the eigendecomposition, and let eigenvectors untouched. So given a filter $h : R \to R$ , the filtering of $z \in R^{n}$ by $h$ is given by:

$h * z = F^{- 1} [h (D) F [z]] = U h (D) U^{⊤} z,$

where $h (D)$ is applied pointwise. An important point: despite the notation, the filter $h * z$ is not a regular convolution. Observe that $z \mapsto h * z$ is $S_{n}$ -equivariant.

In practice, diagonalizing the Laplacian is too costly. What people use is polynomial, or analytical, filters of the Laplacian. Given a polynomial, or formal serie, $h = \sum_{k \geq 0} β_{k} X^{k} \in R [[X]]$ , we let (possibly not-well defined in the case of formal serie)

$h * z = \sum_{k \geq 0} β_{k} L^{k} z .$

Note that from a theoretical perspect, it is still only application on the eigenvalues. Figure fig-conv-graph illustrates the filtering of a graph signal by a low-pass filter.

6 Graph Neural Network

(Spectral) Graph Convolutional Network

Up to some subtles or notational differences, you probably encounter CNNs as parametric function defined layer-wise by a combination of

• non-linearities such as ReLU;
• convolution layers mapping a tensor to another;
• pooling (or subsampling) reducing the dimension using for instance a max- or mean-pooling;
• fully connected layers (Multi-Layer Perceptron, MLP) to produce an output.

In the context of GNNs, non-linearities will be the same (typically Lipschitz function different from the identity), convolution will be used as defined in the previous section, MLPs will be the same. But what about pooling? This is (mostly) an open question in the GNNs on how to produce good and efficient pooling operation, also known as graph coarsening. These architectures were defined around 2015, by (??, a) and (??, a).

A GCN with $M + 1$ layers is defined as follows:

1. The signal at the input layer is some node features $Z^{(0)} = Z$ with dimension $d_{0} = d_{z}$ and built up columns $z_{j}^{(0)} \in R^{n}$ .
2. For all $ℓ = 1, \dots, L$ , the signal $Z^{(ℓ + 1)} \in R^{n \times d_{ℓ + 1}}$ at layer $ℓ + 1$ is obtained by propagation of the signal $Z^{(ℓ)} \in R^{n \times d_{ℓ}}$ at layer $ℓ$ with respect to analytic filters of the normalized Laplacian:

$\begin{matrix} (1) & z_{j}^{(ℓ + 1)} = ρ (\sum_{i = 1}^{d_{ℓ}} h_{i j}^{(ℓ)} (L) z_{i}^{(ℓ)} + b_{j}^{(ℓ)} 1_{n}) \in R^{n}, \forall j = 1 \dots d_{ℓ + 1} . \end{matrix}$
3. The output depends on the notion of invariance that one wishes to preserve.
- • $S_{n}$ -equivariance (node classification). In this case, the output of the GCN is a node-signal $Φ_{G} (Z) \in R^{n \times d_{out}}$ defined as
  
  $Φ_{G} (Z) = {MLP}_{θ} (Z^{(L)}),$
  
  where ${MLP}_{θ}$ is a MLP applied row-wise to the signal $Z^{(L)}$ .
- • $S_{n}$ -invariance (graph classification). In this case, the output is a single vector ${\bar{Φ}}_{G} (Z) \in R^{d_{out}}$ given as a global pooling
  
  ${\bar{Φ}}_{G} (Z) = \frac{1}{n} \sum_{i = 1}^{n} Φ_{G} (Z)_{i} .$

Invariance and Equivariance of GCNs

In the definition of GCNs, it was roughly said that we can made the output either $S_{n}$ -equivariant or $S_{n}$ -invariant. But what do we mean exactly by that? We said that a permutation $σ$ acts onto the space of GCNs with the action $σ \cdot Φ_{G}$ defined as

$\forall Z \in R^{n \times d_{0}}, (σ \cdot Φ_{G}) (Z) = Φ_{σ \cdot G} (σ \cdot Z),$

where we defined the action on the graph and the signal as

$\begin{aligned} σ \cdot G & = (σ \cdot V, σ \cdot E) \\ σ \cdot Z & = (\begin{array}{c} (z_{1})_{σ (1)} & \dots & (z_{d_{0}})_{σ (1)} \\ ⋮ & ⋮ \\ (z_{1})_{σ (n)} & \dots & (z_{d_{0}})_{σ (n)} \end{array}) \end{aligned}$ The action on the group is more easily defined in fact on the adjacency matrix. For any $σ \in S_{n}$ , let $P_{σ}$ be the associated permutation matrix. Then, one can defined

$σ \cdot A = P_{σ} A P_{σ^{- 1}} = P_{σ} A P_{σ}^{⊤} .$

Cutting the algebraic details, these two definitions are perfectly equivalent.

We can now state the result

Proposition 2. For any $G$ , and filters $h$ , $Φ_{G}$ is $S_{n}$ -equivariant and ${\bar{Φ}}_{G}$ is $S_{n}$ -invariant: for all $σ \in S_{n}$ , one has

$\begin{aligned} (σ \cdot Φ_{G}) (Z) & = σ \cdot Φ_{G} (Z), \\ (σ \cdot {\bar{Φ}}_{G}) (Z) & = {\bar{Φ}}_{G} (Z) . \end{aligned}$

Proof. See for instance (??, , Proposition 1). The proof of $S_{n}$ -equivariant is based on three facts:
- 1. The map $L : A \mapsto Id - D^{- 1 / 2} A D^{- 1 / 2}$ is $S_{n}$ -equivariant;
- 2. Hence the map $h \circ L : A \mapsto h (L (A))$ is $S_{n}$ -equivariant;
- 3. Permutations matrices commutes pointwise with entrywise activation function.
To get the $S_{n}$ -invariance, one has to see that agregaging with a sum is a universal way to construct a $S_{n}$ -invariant function from $S_{n}$ -equivariant function. □

7 Another paradigm: Message-Passing Graph Neural Networks

Before ending this nose, let me mention that around 2017, (??, a) and (??, a) proposed a new architecture, or at least a new interpretation, that “forget” the spectral interpretation of GCNs. It is closer in the spirit to original graph neural network model proposed by (??, a), and is now the standard way to think of GNNs. I still believe that the spectral interpretation highlights important properties of GNNs, and is a good way to understand the behavior of GNNs.

Instead of a spectral interpretation, a neighborhood aggregation, or so-called message-passing architecture is proposed:

$z_{j}^{(ℓ + 1)} = γ^{(ℓ)} (z_{j}^{(ℓ)}, ◻_{i \in N (j)} ρ^{(ℓ)} (z_{j}^{(ℓ)}, z_{i}^{(ℓ)}, e_{j, i})),$

where

• $z_{j}^{(ℓ)}$ is the node feature at layer $ℓ$ of node $j$ ;
• $γ^{(ℓ)}$ (the update function) and $ρ^{(ℓ)}$ (the message-passing function) are learnable functions such as traditional MLPs;
• $◻$ is an agregation function such as sum $\sum$ or maximum $max$ ;
• $e_{j, i}$ is a (potential) edge features.

In many papers, this is presented in the equivalent form (often with multisets for the aggregation)

$\begin{aligned} m_{j}^{(ℓ)} & = {MSG}^{(ℓ)} ({z_{i}^{(ℓ)} : i \in N (j) \cup {j}}) \\ a_{j}^{(ℓ)} & = {AGG}^{(ℓ)} ({m_{i}^{(ℓ)} : i \in N (j)}) \\ z_{j}^{(ℓ + 1)} & = {UPD}^{(ℓ)} (z_{j}^{(ℓ)}, a_{j}^{(ℓ)}) . \end{aligned}$

Let me mention several important examples of MPGNNs:

• Convolutional GNN. Wait what, didn’t we called them spectral GNNs? Yes, but we can look at them in a slightly more general way

$z_{j}^{(ℓ + 1)} = γ^{(ℓ)} (z_{j}^{(ℓ)}, ◻_{i \in N (j)} c_{i, j} ρ^{(ℓ)} (z_{i}^{(ℓ)})),$

In practice, the community consider a more constrained version:

$z_{j}^{(ℓ + 1)} = ReLU (\frac{1}{| N (j) |} \sum_{i \in N (j)} W^{(ℓ)} z_{i}^{(ℓ)}),$

where $W^{(ℓ)}$ is a learnable matrix. This can be even more precisely specified (in matrix form) as

$Z^{ℓ + 1} = ReLU (D^{- 1} A Z^{ℓ} W^{ℓ, ⊤}),$

or in normalized form as

$Z^{ℓ + 1} = ReLU (D^{- 1 / 2} A D^{- 1 / 2} Z^{ℓ} W^{ℓ, ⊤}),$
• Attentional GNN.

$z_{j}^{(ℓ + 1)} = γ^{(ℓ)} (z_{j}^{(ℓ)}, ◻_{i \in N (j)} a (z_{j}^{(ℓ)}, z_{i}^{(ℓ)})),$

where $a$ is a learnable self-attention function.

Without diving into the details, I want to attract your attention that when dealing with a graph with only self-loop, or when $E = P_{2} (V)$ , we obtain two popular architectures in the litteratures.

• When $N (i) = {i}$ – so not a classical graph, you have to add a self-loop – the convolutional architecture boils down to the Deep Sets architecture (??, a).
• On the other extreme, when $i \sim j$ for all $i, j$ (again with self-loop), the attentional GNN is linked to the Transformer (??, a) architecture.