DATA 202 Module 7: Networks and Graph Data

Introduction

Many real-world systems are best understood as networks: social connections, biological interactions, transportation routes, financial transactions, knowledge structures. Graph data science provides tools to analyze these relational systems—finding communities, identifying influential nodes, predicting missing connections, and propagating information through networks.

This module explores network analysis and graph machine learning, from classical algorithms to modern graph neural networks.


Part 1: Network Fundamentals

Graphs and Networks

A graph G = (V, E) consists of:

Types:

Network Representations

Adjacency Matrix: N×N matrix A where A[i,j] = 1 if edge exists

Edge List: List of (source, target) pairs

Adjacency List: For each node, list of neighbors

NetworkX Basics

import networkx as nx
import matplotlib.pyplot as plt

# Create graph
G = nx.Graph()
G.add_edges_from([(1, 2), (1, 3), (2, 3), (3, 4), (4, 5)])

# Basic properties
print(f"Nodes: {G.number_of_nodes()}")
print(f"Edges: {G.number_of_edges()}")
print(f"Degree of node 3: {G.degree(3)}")

# Visualization
nx.draw(G, with_labels=True, node_color='lightblue')
plt.show()

Part 2: Network Analysis

Centrality Measures

Degree Centrality: Number of connections

nx.degree_centrality(G)

Betweenness Centrality: How often node lies on shortest paths

nx.betweenness_centrality(G)

Closeness Centrality: Average distance to all other nodes

nx.closeness_centrality(G)

PageRank: Importance based on importance of neighbors

nx.pagerank(G)

Eigenvector Centrality: Connected to well-connected nodes

nx.eigenvector_centrality(G)

Community Detection

Finding densely connected groups:

Louvain Algorithm: Optimize modularity greedily

from community import community_louvain
partition = community_louvain.best_partition(G)

Label Propagation: Iterative label spreading

communities = nx.community.label_propagation_communities(G)

Girvan-Newman: Edge betweenness removal

communities = nx.community.girvan_newman(G)

Network Properties

Clustering Coefficient: Probability neighbors are connected Path Length: Average distance between nodes Diameter: Maximum shortest path Density: Fraction of possible edges that exist Assortativity: Tendency of similar nodes to connect


Part 3: Graph Machine Learning

Node Embeddings

Learn vector representations of nodes:

DeepWalk: Random walks + Word2Vec

  1. Generate random walks from each node
  2. Treat walks as sentences
  3. Apply Skip-gram to learn embeddings

Node2Vec: Biased random walks

from node2vec import Node2Vec

node2vec = Node2Vec(G, dimensions=64, walk_length=30, num_walks=200)
model = node2vec.fit(window=10, min_count=1)
embeddings = {node: model.wv[str(node)] for node in G.nodes()}

Predict missing or future edges:

Classical Features:

preds = nx.adamic_adar_index(G, [(1, 4), (2, 5)])
for u, v, score in preds:
    print(f"Edge ({u}, {v}): {score:.3f}")

Graph Neural Networks

Message Passing: Nodes aggregate information from neighbors

GCN (Graph Convolutional Network): \(H^{(l+1)} = \sigma(\tilde{A} H^{(l)} W^{(l)})\)

GraphSAGE: Sample and aggregate GAT (Graph Attention Network): Attention over neighbors GIN (Graph Isomorphism Network): More expressive aggregation


Part 4: Applications

Social Network Analysis

Biological Networks

Knowledge Graphs

Financial Networks


DEEP DIVE: The PageRank Revolution

How Google Ranked the Web

In 1996, the web was chaotic. Search engines ranked pages by keyword matching—easily gamed with invisible text and keyword stuffing. Finding quality information was a nightmare.

Two Stanford PhD students, Larry Page and Sergey Brin, had a different idea: use the structure of the web itself. Their insight was simple yet profound: a page is important if important pages link to it.

The Algorithm

PageRank treats the web as a graph where pages are nodes and hyperlinks are directed edges. The importance of a page is defined recursively:

\[PR(p) = \frac{1-d}{N} + d \sum_{q \in B_p} \frac{PR(q)}{L(q)}\]

Where:

Interpretation: Imagine a random surfer following links. PageRank is the probability they land on each page in the long run.

Why It Worked

PageRank was:

PageRank principles now appear everywhere:

The insight that network structure encodes importance transformed not just search but our understanding of complex systems.


HANDS-ON EXERCISE: Social Network Analysis

Part 1: Load and Explore Network

import networkx as nx
import pandas as pd
import matplotlib.pyplot as plt

# Load a social network (Zachary's Karate Club)
G = nx.karate_club_graph()

print(f"Nodes: {G.number_of_nodes()}")
print(f"Edges: {G.number_of_edges()}")
print(f"Density: {nx.density(G):.4f}")
print(f"Clustering: {nx.average_clustering(G):.4f}")

# Degree distribution
degrees = [d for n, d in G.degree()]
plt.hist(degrees, bins=range(max(degrees)+2))
plt.xlabel('Degree')
plt.ylabel('Count')
plt.title('Degree Distribution')
plt.show()

Part 2: Centrality Analysis

# Compute centralities
centralities = pd.DataFrame({
    'degree': nx.degree_centrality(G),
    'betweenness': nx.betweenness_centrality(G),
    'closeness': nx.closeness_centrality(G),
    'eigenvector': nx.eigenvector_centrality(G),
    'pagerank': nx.pagerank(G)
})

print("Top 5 nodes by PageRank:")
print(centralities.nlargest(5, 'pagerank'))

# Visualize with size by centrality
pos = nx.spring_layout(G, seed=42)
node_size = [3000 * centralities.loc[n, 'pagerank'] for n in G.nodes()]
nx.draw(G, pos, node_size=node_size, with_labels=True)
plt.title('Node size = PageRank')
plt.show()

Part 3: Community Detection

from community import community_louvain

# Detect communities
partition = community_louvain.best_partition(G)

# Visualize
colors = [partition[n] for n in G.nodes()]
nx.draw(G, pos, node_color=colors, cmap=plt.cm.Set1, with_labels=True)
plt.title('Communities (Louvain)')
plt.show()

# Compare to ground truth (club split)
club_labels = [G.nodes[n]['club'] for n in G.nodes()]
print("Ground truth vs detected:")
for n in G.nodes():
    print(f"Node {n}: {club_labels[n-1]} -> Community {partition[n]}")

Libraries

Books

Datasets


Module 7 explores networks and graph data—from classical network analysis to modern graph neural networks. Understanding the structure of relationships unlocks insights impossible to gain from tabular data alone.