DATA 202 Module 7: Networks and Graph Data

Introduction

Many real-world systems are best understood as networks: social connections, biological interactions, transportation routes, financial transactions, knowledge structures. Graph data science provides tools to analyze these relational systems—finding communities, identifying influential nodes, predicting missing connections, and propagating information through networks.

This module explores network analysis and graph machine learning, from classical algorithms to modern graph neural networks.

Part 1: Network Fundamentals

Graphs and Networks

A graph G = (V, E) consists of:

Vertices (Nodes) V: Entities (people, proteins, cities)
Edges (Links) E: Relationships (friendships, interactions, routes)

Types:

Undirected: Symmetric relationships (friendship)
Directed: Asymmetric (following, citations)
Weighted: Edges have values (distance, strength)
Bipartite: Two node types, edges between types (users and products)
Multiplex: Multiple edge types between same nodes

Network Representations

Adjacency Matrix: N×N matrix A where A[i,j] = 1 if edge exists

Dense storage, fast operations
O(N²) space

Edge List: List of (source, target) pairs

Sparse storage
Efficient for large networks

Adjacency List: For each node, list of neighbors

Efficient neighbor access
Good balance

NetworkX Basics

import networkx as nx
import matplotlib.pyplot as plt

# Create graph
G = nx.Graph()
G.add_edges_from([(1, 2), (1, 3), (2, 3), (3, 4), (4, 5)])

# Basic properties
print(f"Nodes: {G.number_of_nodes()}")
print(f"Edges: {G.number_of_edges()}")
print(f"Degree of node 3: {G.degree(3)}")

# Visualization
nx.draw(G, with_labels=True, node_color='lightblue')
plt.show()

Part 2: Network Analysis

Centrality Measures

Degree Centrality: Number of connections

nx.degree_centrality(G)

Betweenness Centrality: How often node lies on shortest paths

nx.betweenness_centrality(G)

Closeness Centrality: Average distance to all other nodes

nx.closeness_centrality(G)

PageRank: Importance based on importance of neighbors

nx.pagerank(G)

Eigenvector Centrality: Connected to well-connected nodes

nx.eigenvector_centrality(G)

Community Detection

Finding densely connected groups:

Louvain Algorithm: Optimize modularity greedily

from community import community_louvain
partition = community_louvain.best_partition(G)

Label Propagation: Iterative label spreading

communities = nx.community.label_propagation_communities(G)

Girvan-Newman: Edge betweenness removal

communities = nx.community.girvan_newman(G)

Network Properties

Clustering Coefficient: Probability neighbors are connected Path Length: Average distance between nodes Diameter: Maximum shortest path Density: Fraction of possible edges that exist Assortativity: Tendency of similar nodes to connect

Part 3: Graph Machine Learning

Node Embeddings

Learn vector representations of nodes:

DeepWalk: Random walks + Word2Vec

Generate random walks from each node
Treat walks as sentences
Apply Skip-gram to learn embeddings

Node2Vec: Biased random walks

Parameters control exploration vs. exploitation
Captures structural roles and communities

from node2vec import Node2Vec

node2vec = Node2Vec(G, dimensions=64, walk_length=30, num_walks=200)
model = node2vec.fit(window=10, min_count=1)
embeddings = {node: model.wv[str(node)] for node in G.nodes()}

Link Prediction

Predict missing or future edges:

Classical Features:

Common Neighbors
Jaccard Coefficient
Adamic-Adar Index
Preferential Attachment

preds = nx.adamic_adar_index(G, [(1, 4), (2, 5)])
for u, v, score in preds:
    print(f"Edge ({u}, {v}): {score:.3f}")

Graph Neural Networks

Message Passing: Nodes aggregate information from neighbors

GCN (Graph Convolutional Network): \(H^{(l+1)} = \sigma(\tilde{A} H^{(l)} W^{(l)})\)

GraphSAGE: Sample and aggregate GAT (Graph Attention Network): Attention over neighbors GIN (Graph Isomorphism Network): More expressive aggregation

Part 4: Applications

Identify influencers
Detect communities
Track information spread
Recommend connections

Biological Networks

Protein-protein interactions
Gene regulatory networks
Drug-target interactions
Predict drug side effects

Knowledge Graphs

Entities and relationships
Question answering
Recommendation
Examples: Wikidata, DBpedia, Google Knowledge Graph

Financial Networks

Transaction networks for fraud detection
Interbank lending for systemic risk
Supply chain analysis

DEEP DIVE: The PageRank Revolution

How Google Ranked the Web

In 1996, the web was chaotic. Search engines ranked pages by keyword matching—easily gamed with invisible text and keyword stuffing. Finding quality information was a nightmare.

Two Stanford PhD students, Larry Page and Sergey Brin, had a different idea: use the structure of the web itself. Their insight was simple yet profound: a page is important if important pages link to it.

The Algorithm

PageRank treats the web as a graph where pages are nodes and hyperlinks are directed edges. The importance of a page is defined recursively:

\[PR(p) = \frac{1-d}{N} + d \sum_{q \in B_p} \frac{PR(q)}{L(q)}\]

Where:

d is a damping factor (~0.85)
B_p is the set of pages linking to p
L(q) is the number of outgoing links from q

Interpretation: Imagine a random surfer following links. PageRank is the probability they land on each page in the long run.

Why It Worked

PageRank was:

Democratic: Every link is a vote
Weighted: Votes from important pages matter more
Robust: Hard to game without getting many important sites to link to you
Scalable: Computed efficiently via iteration

Beyond Search

PageRank principles now appear everywhere:

Twitter’s “Who to Follow” suggestions
Academic citation analysis
Fraud detection
Biological network analysis

The insight that network structure encodes importance transformed not just search but our understanding of complex systems.

Part 1: Load and Explore Network

import networkx as nx
import pandas as pd
import matplotlib.pyplot as plt

# Load a social network (Zachary's Karate Club)
G = nx.karate_club_graph()

print(f"Nodes: {G.number_of_nodes()}")
print(f"Edges: {G.number_of_edges()}")
print(f"Density: {nx.density(G):.4f}")
print(f"Clustering: {nx.average_clustering(G):.4f}")

# Degree distribution
degrees = [d for n, d in G.degree()]
plt.hist(degrees, bins=range(max(degrees)+2))
plt.xlabel('Degree')
plt.ylabel('Count')
plt.title('Degree Distribution')
plt.show()

Part 2: Centrality Analysis

# Compute centralities
centralities = pd.DataFrame({
    'degree': nx.degree_centrality(G),
    'betweenness': nx.betweenness_centrality(G),
    'closeness': nx.closeness_centrality(G),
    'eigenvector': nx.eigenvector_centrality(G),
    'pagerank': nx.pagerank(G)
})

print("Top 5 nodes by PageRank:")
print(centralities.nlargest(5, 'pagerank'))

# Visualize with size by centrality
pos = nx.spring_layout(G, seed=42)
node_size = [3000 * centralities.loc[n, 'pagerank'] for n in G.nodes()]
nx.draw(G, pos, node_size=node_size, with_labels=True)
plt.title('Node size = PageRank')
plt.show()

Part 3: Community Detection

from community import community_louvain

# Detect communities
partition = community_louvain.best_partition(G)

# Visualize
colors = [partition[n] for n in G.nodes()]
nx.draw(G, pos, node_color=colors, cmap=plt.cm.Set1, with_labels=True)
plt.title('Communities (Louvain)')
plt.show()

# Compare to ground truth (club split)
club_labels = [G.nodes[n]['club'] for n in G.nodes()]
print("Ground truth vs detected:")
for n in G.nodes():
    print(f"Node {n}: {club_labels[n-1]} -> Community {partition[n]}")

Recommended Resources

Libraries

NetworkX: Pure Python, extensive algorithms
igraph: Fast C library with Python bindings
PyTorch Geometric: Graph neural networks
DGL: Deep graph library
Gephi: Graph visualization software

Books

Networks, Crowds, and Markets by Easley and Kleinberg (free online)
Network Science by Barabási (free online)
Graph Representation Learning by Hamilton

Datasets

Stanford Network Analysis Project (SNAP)
Konect
Network Repository

Module 7 explores networks and graph data—from classical network analysis to modern graph neural networks. Understanding the structure of relationships unlocks insights impossible to gain from tabular data alone.