Graph Neural Networks
Neural Networks
•Thrives in situations where it is challenging to defined rules
Copyright By PowCoder代写 加微信 powcoder
•Algorithms that improve automatically through experience.
•The algorithm has a (large) number of parameters whose values need to be learned from the data.
• Inspired by neurons in biology.
http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf
Perceptron
What does a Perceptron do? (1)
• Suppose a NN initialized to weight 𝑤 be (0.1, 0.2, 0.3) & bias 𝑏
• Step0: Take an input 𝑥 (0.3, 0.6, 0.9)
𝑤1 = 0.1 0.6 𝑤2 = 0.2
𝑤3=0.3 𝑧 𝑦ො 𝑏 = 0.05
What does a Perceptron do? (2) • Step1: Calculate a weighted sum
𝑧=𝑤𝑇𝑥+𝑏; 𝑧=0.1×0.3+0.2×0.6+0.3×0.9+0.05 = 0.47
𝑤1 = 0.1 0.6 𝑤2 = 0.2
𝑤3=0.3 𝑧=0.47 𝑦ො 𝑏 = 0.05
What does a Perceptron do? (3) • Step2: Apply an activation function
𝑤1 = 0.1 0.6 𝑤2 = 0.2
𝑤3=0.3 𝑧=0.47 𝑦ො 𝑏 = 0.05
Mathematically: the computation of a neuron is shown below:
𝑧 = 𝑤𝑇𝑥 + 𝑏
We use a simple step rule as the activation function here.
Neural Networks
Black box X1
Output Y is 1 if at least two of the three inputs are equal to 1.
Neural Networks
Input nodes
1 0.3 node
Y X3 0.3 t=0.4
Y =I(0.3X1 +0.3X2 +0.3X3 −0.40)
whereI(z)=1 ifzistrue 0 otherwise
Neural Networks
• Model is an assembly of inter- connected nodes and weighted links
• Output node sums up each of its input value according to the weights of its links
• Compare output node against some threshold t
• The sign function (activation function) outputs a value +1 if its argument is positive and -1 otherwise.
Input nodes
X w2 2 w3
Perceptron Model
Y = I(w X − t) ii
Y=sign(wX −t) ii
Perceptron Learning
• 𝒚ෝ = 𝑠𝑖𝑔𝑛 𝑤 𝑥 + 𝑤 𝑥 + ⋯ + 𝑤 𝑥 + 𝑤 𝑥
𝑑 𝑑 𝑑−1 𝑑−1 1 1 0 0 =𝑠𝑖𝑔𝑛(𝒘∙𝒅)where𝑤0 =−𝑡,𝑥0 =1.
• 𝛌 is a parameter known as the learning rate and is between 0 and 1.
Perceptron Decision Boundary
• The previous slide shows a perception model which is linear.
• The figure on the right shows the decision boundary by 𝑦ො = 0.
• It is a linear hyperplane that separates the data into two classes, -1 and +1.
Nonlinear Hyperplane (XOR Problem)
• Consider an example of nonlinearly separable data by the XOR function. The linear perception model cannot fid the solution for it.
Two-Layers for XOR Problem • It uses a two-layer, feed-forward ANN.
Classic Architecture • (MLP)
𝑧 = 𝑊𝑇𝑋, h = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑧 ) 1 1 1 1
Weight from each layer
𝑧 = 𝑊 h , h = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑧 ) 22122
𝑌 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑧 ) 𝑛+1
Classic Architecture x1 x2 x3 x4 x5
Input Layer
Hidden Layer
S Activation O
i function i g(Si )
threshold, t
Output Layer
Training ANN means learning the weights of the neurons
This is a feed-forward ANN.
Activation Functions
Increased Expressive Power
From Perceptrons to NN
• Perceptrons are a basic unit of a neural network. • 1-layered neural network on the right
Structure:
• Input layer, output layer, • Middle are hidden layers.
Design Issues in NN Learning
• The number of nodes in the input layer should be
determined.
• Generally, the number of nodes in the output layer can be 2 for binary classification, and k for k-class classification.
• The network topology
• The number of hidden layers and hidden nodes • Feed-forward or recurrent network
• Activation function
• Weights and biases need to be initialized
Recap: Node Embeddings
Intuition: Map nodes to 𝑑-dimensional embeddings such that similar nodes in the graph are embedded close together
Input graph 2D node embeddings
Recap: Node Embeddings
Input network
d-dimensional embedding space
Need to define!
Recap: Two Key Components
Encoder: maps each node to a low-dimensional
node in the input graph
Similarity function: specifies how the relationships in vector space map to the relationships in the original network
d-dimensional embedding
similarity (𝑢,𝑣)
Similarity of 𝑢 and 𝑣 in the original network
dot product between node embeddings
Recap: “Shallow” Encoding
Simplest encoding approach: encoder is just an embedding-lookup
embedding matrix
embedding vector for a specific node
one column per node
Dimension/size of embeddings
Recap: “Shallow” Encoding
Limitations of shallow embedding methods:
▪𝑶(|𝑽|) parameters are needed:
▪No sharing of parameters between nodes ▪Every node has its own unique embedding
▪Inherently “transductive”:
▪Cannot generate embeddings for nodes that are not seen
during training
▪Do not incorporate node features:
▪Many graphs have features that we can and should leverage
Today: Deep Graph Encoders
Today: We will now discuss deep methods based on graph neural networks (GNNs):
multiple layers of
ENC 𝑣 = non-linear transformations
based on graph structure
Modern ML Toolbox
Images Text/Speech
Modern deep learning toolbox is designed for simple sequences & grids
Deep Graph Encoders
Output: Node embeddings. Also, we can embed subgraphs, graphs
But networks are far more complex!
▪Arbitrary size and complex topological structure (i.e., no spatial locality like grids)
▪No fixed node ordering or reference point ▪Often dynamic and have multimodal features
Tasks on Networks Tasks we will be able to solve:
Node classification
▪Predict a type of a given node
Link prediction
▪Predict whether two nodes are linked
Community detection
▪Identify densely linked clusters of nodes
Network similarity
▪How similar are two (sub)networks
Assume we have a graph 𝑮:
▪𝑉 is the vertex set
▪𝑨 is the adjacency matrix (assume binary)
▪𝑿 ∈ R:m×|V| is a matrix of node features
▪𝑣: a node in 𝑉; 𝑁 (𝑣): the set of neighbors of 𝑣.
▪Node features:
▪Social networks: User profile, User image
▪When there is no node feature in the graph dataset:
▪Indicator vectors (one-hot encoding of a node) ▪Vector of constant 1: [1, 1, …, 1]
A Naïve Approach
Join adjacency matrix and features Feed them into a deep neural net:
ABCDE Feat
ABB1001100 ?
C D E01010 10
Issues with this idea:
Issues with this idea: ▪𝑂(|𝑉|) parameters
▪Not applicable to graphs of different sizes ▪Sensitive to node ordering
CNN on an image:
Goal is to generalize convolutions beyond simple lattices Leverage node features/attributes (e.g., text, images)
What about Graphs?
Talk on Deep learning on graphs: successes, challenges by
What about Graphs?
Talk on Deep learning on graphs: successes, challenges by
What about Graphs?
Talk on Deep learning on graphs: successes, challenges by
What about Graphs?
Talk on Deep learning on graphs: successes, challenges by
Graphs look like this
1. No fixed notion of locality or sliding window on the graph
2. Graph is permutation invariant
Convolutional layer with 3×3 filter
Image Graph
Idea: transform information at the neighbors and combine it:
▪Add them up:
▪Transform“messages”h fromneighbors:𝑊h
A Computation Graph
Determine node Propagate and computation graph transform information
Learn how to propagate information across the graph to compute node features
Aggregate Neighbors
Key idea: Generate node embeddings based on local network neighborhoods
TARGET NODE
INPUT GRAPH
Aggregate Neighbors
Intuition: Nodes aggregate information from their neighbors using neural networks
TARGET NODE
INPUT GRAPH
Neural networks
Aggregate Neighbors
Intuition: Network neighborhood defines a computation graph
Every node defines a computation graph based on its neighborhood!
Deep: Many Layers
Model can be of arbitrary depth: Nodes have embeddings at each layer
Layer-0 embedding of node 𝑢 is its input feature, 𝑥𝑢
Layer-𝑘 embedding gets information from nodes that are K hops away
TARGET NODE
INPUT GRAPH
Neighborhood Aggregation
Neighborhood aggregation: Key distinctions are in how different approaches aggregate information across the layers
INPUT GRAPH
? What is in the box?
TARGET NODE
Neighborhood Aggregation
Basic approach: Average information from neighbors and apply a neural network
TARGET NODE
(1) average messages
from neighbors B
INPUT GRAPH
(2) apply neural network A
A GNN Layer
GNN Layer = Message + Aggregation
• Different instantiations under this perspective • GCN, GraphSAGE, GAT, …
GNN Layer 1
(2) Aggregation (1) Message
TARGET NODE
INPUT GRAPH
A Single GNN Layer
Message Computation
Message Aggregation
Message Aggregation Issue
A Single GNN Layer
Activation (Non-linearity)
Classical GNN Layers: GCN
Classical GNN Layers: GCN
The Maths: Deep Encoder
Basic approach: Average neighbor messages and apply a neural network
Model Parameters
We can feed these embeddings into any loss function and run SGD to train the weight parameters
h𝑙 :thehiddenrepresentationofnode𝑣atlayer𝑙 𝑣
𝑊 :weight matrix for neighborhood aggregation 𝑘
𝐵 : weight matrix for transforming hidden vector of 𝑘
Matrix Formulation
Matrix Formulation
Re-writing update function in matrix form:
Compute the output of the first graph convolutional layer based on the above formula
The matrix 𝐷−1:
Adjacent matrix A:
The matrix 𝐷−1𝐴:
Matrix 𝐻0 : Matrix 𝐷−1𝐴 :
Matrix 𝐷−1𝐴𝐻 :
Matrix 𝐷−1𝐴𝐻 : Matrix 𝑊0 :
Matrix 𝐷−1𝐴𝐻𝑊𝑇 :
Matrix 𝐵0 : Matrix 𝐻0 :
Matrix 𝐻𝐵𝑇 :
Example Matrix 𝐷−1𝐴𝐻𝑊𝑇 :
Matrix 𝐻𝐵𝑇 :
Matrix 𝐷−1𝐴𝐻𝑊𝑇 + 𝐻𝐵𝑇 :
Matrix 𝐷−1𝐴𝐻𝑊𝑇 + 𝐻𝐵𝑇 :
Matrix 𝜎(𝐷−1𝐴𝐻𝑊𝑇 + 𝐻𝐵𝑇) :
Train a GNN
Node embedding 𝒛𝑣 is a function of input graph Supervised setting: we want to minimize the
▪𝒚: node label
▪L could be L2 if 𝒚 is real number, or cross entropy if 𝒚 is categorical
Unsupervised setting:
▪No node label available
▪Use the graph structure as the supervision!
min L(𝒚, 𝑓(𝒛𝑣 )) Θ
Supervised Training
Directly train the model for a supervised task (e.g., node classification)
Safe or toxic drug?
Safe or toxic drug?
E.g., a drug-drug interaction network
Supervised Training
Directly train the model for a supervised task (e.g., node classification)
– Use cross entropy loss
Unsupervised Training
“Similar” nodes have similar embeddings
▪ Where y𝑢,𝑣 1 =when node 𝑢 and 𝑣 are similar ▪CE is the cross entropy
▪DEC is the decoder such as inner product
Node similarity can be anything from previous lectures, e.g., a loss based on:
▪Random walks (node2vec, DeepWalk, struc2vec) ▪Node proximity in the graph
Model Design: Overview
(1) Define a neighborhood aggregation function
(2) Define a loss function on the embeddings
Model Design: Overview
(3) Train on a set of nodes, i.e., a batch of compute graphs
Model Design: Overview
(4) Generate embeddings for nodes as needed
Even for nodes we never trained on!
Inductive Capability
The same aggregation parameters are shared for all nodes:
▪The number of model parameters is sublinear in |𝑉| and we can generalize to unseen nodes!
shared parameters
shared parameters
Compute graph for node A Compute graph for node B
INPUT GRAPH
Inductive Capability:
Inductive node embedding Generalize to entirely unseen graphs
E.g., train on protein interaction graph from model organism A and generate embeddings on newly collected data about organism B
Train on one graph
Generalize to new graph
Inductive Capability:
Many application settings constantly encounter previously unseen nodes:
▪ E.g., Reddit, YouTube, Google Scholar
Need to generate new embeddings “on the
New node arrives
Generate embedding for new node
Train with snapshot
Stacking GNN Layers
How to connect GNN layers into a GNN?
1. Stack layers sequentially
TARGET NODE
(3) Layer connectivity
GNN Layer 1
INPUT GRAPH
GNN Layer 2
Stacking GNN Layers
An Over-smoothing Problem
The Issue of stacking many GNN layers ▪GNN suffers from the over-smoothing problem
The over-smoothing problem: all the node embeddings converge to the same value
▪This is bad because we want to use node embeddings to differentiate nodes
Why does the over-smoothing problem happen?
Receptive Field of a GNN
Receptive field: the set of nodes that determine the embedding of a node of interest
▪In a 𝑲-layer GNN, each node has a receptive field of 𝑲-hop neighborhood
Receptive field for Receptive field for Receptive field for
1-layer GNN 2-layer GNN 3-layer GNN
Receptive Field of a GNN
Receptive field overlap for two nodes
▪The shared neighbors quickly grows when we increase the number of hops (num of GNN layers)
1-hop neighbor overlap Only 1 node
2-hop neighbor overlap About 20 nodes
3-hop neighbor overlap Almost all the nodes!
Receptive Field &Over-smoothing
We can explain over-smoothing via the notion of receptive field
▪We knew the embedding of a node is determined by its receptive field
▪If two nodes have highly-overlapped receptive fields, then their embeddings are highly similar
▪Stack many GNN layers→nodes will have highly- overlapped receptive fields→node embeddings will be highly similar→suffer from the over- smoothing problem
Next: how do we overcome over-smoothing problem? 82
Over-smoothing
Design GNN Layer Connectivity
What do we learn from the over-smoothing problem?
Lesson 1: Be cautious when adding GNN layers
▪Unlike neural networks in other domains (CNN for image classification), adding more GNN layers do not always help
▪Step 1: Analyze the necessary receptive field to solve your problem. E.g., by computing the diameter of the graph
▪ Step 2: Set number of GNN layers 𝐿 to be a bit more than the receptive field we like. Do not set 𝑳 to be unnecessarily large!
Expressive Power for Shallow GNNs
Question: How to enhance the expressive power of a GNN, if the number of GNN layers is small?
Solution 1: Increase the expressive power within each GNN layer
▪In our previous examples, each transformation or aggregation function only include one linear layer
▪We can make aggregation / transformation become a deep neural network!
If needed, each box could include a 3-layer MLP
(2) Aggregation (1) Transformation
Learning Outcome
Generate node embeddings by aggregating neighborhood information
Key distinctions are in how different approaches aggregate information across the layers
Acknowledgement: , Stanford University 86
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com