DATA 202 Module 8: Foundation Models and Generative AI

Introduction

A paradigm shift has occurred in AI. Instead of training specialized models for each task, we now train massive models on vast datasets and adapt them for downstream applications. These foundation models—GPT, BERT, CLIP, Stable Diffusion—serve as the foundation for countless applications.

This module explores the foundation model paradigm: how these models work, how to use them effectively, and what their emergence means for data science practice.

Part 1: The Foundation Model Paradigm

What Makes a Foundation Model?

Foundation models share key characteristics:

Trained on massive, diverse datasets
Large scale (billions of parameters)
Self-supervised learning (no manual labels)
Transfer to many downstream tasks
Emergent capabilities at scale

The Training Recipe

Pre-training: Learn general representations from unlabeled data

Language models: Predict next token
Vision models: Contrastive learning, masked image modeling
Multimodal: Align representations across modalities

Fine-tuning: Adapt to specific tasks with labeled data

Full fine-tuning: Update all parameters
Parameter-efficient: Update subset (LoRA, adapters)
Prompt tuning: Learn task-specific prompts

In-context learning: Task specification through examples

No gradient updates
Model learns from context provided in prompt

The Scaling Laws

Research has shown predictable relationships:

More data → better performance
More parameters → better performance
More compute → better performance

Chinchilla scaling: Optimal to scale data and parameters together

Part 2: Large Language Models

How LLMs Work

Tokenization: Break text into tokens

from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokens = tokenizer.encode("Hello, world!")
# [15496, 11, 995, 0]

Architecture: Transformer decoder

Self-attention: Each position attends to all previous positions
Feed-forward layers: Transform representations
Layer normalization: Stabilize training

Training: Predict next token $\mathcal{L} = -\sum_t \log P(x_t | x_{<t})$

Using LLMs

Through APIs:

import openai

response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain gradient descent in simple terms."}
    ]
)
print(response.choices[0].message.content)

Local inference:

from transformers import pipeline

generator = pipeline("text-generation", model="meta-llama/Llama-2-7b-hf")
output = generator("The key to machine learning is", max_length=50)

Prompt Engineering

The art of crafting effective prompts:

Zero-shot: Direct instruction

Classify this review as positive or negative: "Great product!"

Few-shot: Provide examples

Review: "Terrible service" → Negative
Review: "Love it!" → Positive
Review: "Waste of money" → ?

Chain-of-thought: Request reasoning

Solve step by step: If a train travels 60 mph for 2.5 hours, how far does it go?

Role prompting: Set context

You are an expert data scientist. Explain...

Part 3: Multimodal Models

Vision-Language Models

CLIP (OpenAI, 2021): Connects images and text

Train on 400M image-text pairs
Contrastive learning aligns embeddings
Zero-shot image classification via text prompts

from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

inputs = processor(
    text=["a photo of a cat", "a photo of a dog"],
    images=image,
    return_tensors="pt"
)
outputs = model(**inputs)
probs = outputs.logits_per_image.softmax(dim=1)

Vision-Language Understanding

GPT-4V, Gemini, Claude: LLMs that understand images

Describe images
Answer questions about images
Reason about visual content

Text-to-Image Generation

Stable Diffusion, DALL-E, Midjourney: Generate images from text

Diffusion models: Gradually denoise random noise
Text conditioning: CLIP embeddings guide generation

from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-1")
image = pipe("A serene lake at sunset, oil painting style").images[0]
image.save("output.png")

Part 4: RAG and Knowledge Integration

Retrieval-Augmented Generation

LLMs have knowledge cutoffs and can hallucinate. RAG grounds generation in retrieved documents:

Index: Embed documents into vector database
Retrieve: Find relevant documents for query
Generate: Provide documents as context for LLM

from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA

# Create vector store from documents
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(documents, embeddings)

# Create QA chain
qa = RetrievalQA.from_chain_type(
    llm=OpenAI(),
    retriever=vectorstore.as_retriever()
)

answer = qa.run("What is the company's return policy?")

Vector Databases

Store and search embeddings efficiently:

Pinecone: Managed vector database
Weaviate: Open-source, multimodal
Chroma: Lightweight, Python-native
Faiss: Facebook’s similarity search
Milvus: Scalable open-source

Agents and Tool Use

LLMs as reasoning engines that use tools:

Search the web
Execute code
Query databases
Call APIs

from langchain.agents import create_openai_functions_agent

tools = [search_tool, calculator_tool, code_executor_tool]
agent = create_openai_functions_agent(llm, tools, prompt)
result = agent.invoke({"input": "What's the current weather in Beirut?"})

Part 5: Fine-Tuning and Customization

When to Fine-Tune

Fine-tuning makes sense when:

Task requires specialized knowledge
Consistent style or format needed
Performance gains justify cost
Data privacy requires local model

Parameter-Efficient Fine-Tuning

LoRA (Low-Rank Adaptation):

Add small trainable matrices to attention layers
Freeze original weights
Dramatic reduction in trainable parameters

from peft import get_peft_model, LoraConfig

config = LoraConfig(
    r=8,  # Rank
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05
)
model = get_peft_model(base_model, config)
# Only ~0.1% of parameters are trainable

Instruction Tuning

Train models to follow instructions:

Collect instruction-output pairs
Fine-tune on diverse instructions
Results in more helpful, harmless models

DEEP DIVE: The GPT Series and the AI Revolution

From GPT-1 to GPT-4

GPT-1 (2018): 117M parameters. Showed language model pre-training + fine-tuning works.

GPT-2 (2019): 1.5B parameters. Zero-shot task performance. OpenAI initially withheld due to misuse concerns.

GPT-3 (2020): 175B parameters. Few-shot learning. The paper “Language Models are Few-Shot Learners” demonstrated emergent capabilities at scale.

GPT-4 (2023): Unknown size (rumored 1T+). Multimodal (vision). Passes professional exams. Powers ChatGPT Plus.

The Emergence of Capabilities

At certain scales, new abilities appear suddenly:

Arithmetic: Emerges around 10B parameters
Chain-of-thought reasoning: Emerges around 100B
Theory of mind: Debated, but appears in larger models

This emergence is both exciting and concerning—we can’t predict what abilities will appear at the next scale.

The Industry Transformation

ChatGPT’s release in November 2022 was a watershed:

100M users in 2 months (fastest ever)
Microsoft invested $10B in OpenAI
Google declared “code red”
Startups raised billions
Every company exploring AI integration

The foundation model paradigm is reshaping software development, knowledge work, and potentially education, healthcare, and law.

HANDS-ON EXERCISE: Building with Foundation Models

Part 1: Text Generation with Open Models

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "microsoft/phi-2"  # Small but capable
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)

def generate(prompt, max_length=200):
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(**inputs, max_length=max_length, do_sample=True)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

print(generate("Explain machine learning to a 10-year-old:"))

Part 2: Image Understanding with CLIP

from transformers import CLIPProcessor, CLIPModel
from PIL import Image

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

def classify_image(image_path, labels):
    image = Image.open(image_path)
    inputs = processor(text=labels, images=image, return_tensors="pt", padding=True)
    outputs = model(**inputs)
    probs = outputs.logits_per_image.softmax(dim=1)
    return {label: prob.item() for label, prob in zip(labels, probs[0])}

labels = ["a cat", "a dog", "a bird", "a car"]
result = classify_image("image.jpg", labels)

Part 3: RAG System

from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS

# Load and split documents
loader = TextLoader("documents.txt")
documents = loader.load()
splitter = CharacterTextSplitter(chunk_size=1000)
docs = splitter.split_documents(documents)

# Create vector store
embeddings = HuggingFaceEmbeddings()
db = FAISS.from_documents(docs, embeddings)

# Query
query = "What is the main topic?"
results = db.similarity_search(query, k=3)
for doc in results:
    print(doc.page_content[:200])

Recommended Resources

Documentation

Courses

Papers

“Attention Is All You Need” (Vaswani et al., 2017)
“Language Models are Few-Shot Learners” (Brown et al., 2020)
“Training Compute-Optimal Large Language Models” (Hoffmann et al., 2022)

Module 8 explores foundation models and generative AI—the paradigm shift from task-specific models to massive pre-trained systems that can be adapted for countless applications. Understanding how to use and customize these models is essential for modern data science.