DATA 202 Module 9: Model Deployment and MLOps

Introduction

A model in a Jupyter notebook helps no one. The value of machine learning comes when models make predictions in production—powering applications, informing decisions, serving users. This transition from prototype to production is where many data science projects fail.

This module covers the engineering practices needed to deploy and maintain ML systems: containerization, serving infrastructure, monitoring, and the emerging discipline of MLOps.

Part 1: From Notebook to Production

The Deployment Gap

Why do so many ML projects never reach production?

Technical Challenges:

Dependency management
Environment differences (dev vs. prod)
Scalability requirements
Latency constraints
Model versioning

Organizational Challenges:

Handoff between data science and engineering
Unclear ownership
Lack of monitoring
No feedback loop

The ML Lifecycle

Development: Experimentation, model training
Validation: Testing, review, approval
Deployment: Serving in production
Monitoring: Performance tracking, drift detection
Retraining: Model updates based on new data

MLOps applies DevOps principles to this lifecycle.

Part 2: Packaging and Containerization

Environment Management

Reproducible environments are essential:

requirements.txt:

scikit-learn==1.3.0
pandas==2.0.3
numpy==1.24.3

Conda environment.yml:

name: ml-app
dependencies:
  - python=3.10
  - scikit-learn=1.3.0
  - pandas=2.0.3

Docker for ML

Docker packages code with dependencies in isolated containers.

Dockerfile:

FROM python:3.10-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install -r requirements.txt

COPY model.pkl .
COPY app.py .

EXPOSE 8000
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]

Build and run:

docker build -t ml-model .
docker run -p 8000:8000 ml-model

Part 3: Model Serving

Serving Approaches

Batch Inference: Process large datasets offline

Daily predictions
Stored in database
Simple infrastructure

Real-Time Inference: On-demand predictions

API endpoints
Low latency requirements
Scalability challenges

Edge Inference: Run on device

Mobile apps
IoT devices
Privacy preservation

Building APIs with FastAPI

from fastapi import FastAPI
from pydantic import BaseModel
import joblib

app = FastAPI()
model = joblib.load("model.pkl")

class PredictionInput(BaseModel):
    feature1: float
    feature2: float
    feature3: float

class PredictionOutput(BaseModel):
    prediction: float
    probability: float

@app.post("/predict", response_model=PredictionOutput)
def predict(input: PredictionInput):
    features = [[input.feature1, input.feature2, input.feature3]]
    pred = model.predict(features)[0]
    prob = model.predict_proba(features)[0].max()
    return PredictionOutput(prediction=pred, probability=prob)

Scaling Inference

Horizontal scaling: Multiple instances behind load balancer Model optimization: Quantization, pruning, distillation Caching: Cache repeated predictions Batching: Process multiple requests together

Serving Platforms

TensorFlow Serving: Google’s solution for TF models
TorchServe: PyTorch model serving
Triton Inference Server: NVIDIA, multiple frameworks
Seldon Core: Kubernetes-native
BentoML: Framework-agnostic, easy deployment

Part 4: Monitoring and Maintenance

What to Monitor

System Metrics:

Latency (p50, p95, p99)
Throughput (requests/second)
Error rates
Resource usage (CPU, memory, GPU)

Model Metrics:

Prediction distributions
Feature distributions
Actual outcomes (when available)
Accuracy over time

Data and Model Drift

Data Drift: Input distribution changes

New user demographics
Seasonal patterns
Market shifts

Concept Drift: Relationship between input and output changes

User preferences evolve
What “spam” means changes

Detection:

from evidently.report import Report
from evidently.metric_preset import DataDriftPreset

report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=train_df, current_data=prod_df)
report.save_html("drift_report.html")

When to Retrain

Triggers for retraining:

Scheduled (weekly, monthly)
Performance degradation
Significant drift detected
New data available

Part 5: MLOps Best Practices

Version Control

Track not just code but:

Data: DVC, Delta Lake
Models: MLflow, Weights & Biases
Experiments: MLflow, Neptune
Pipelines: Kubeflow, Airflow

CI/CD for ML

Continuous Integration:

Automated testing
Data validation
Model validation

Continuous Deployment:

Automated model deployment
Canary releases
A/B testing

Feature Stores

Centralized feature management:

Consistent features between training and serving
Feature reuse across models
Point-in-time correct features

Platforms: Feast, Tecton, Hopsworks

DEEP DIVE: When ML Systems Fail in Production

The Knight Capital Disaster

On August 1, 2012, Knight Capital Group lost $440 million in 45 minutes due to a software deployment failure. While not specifically ML, it illustrates deployment risks.

A technician forgot to update a server with new code. Old code interpreted new trading signals incorrectly. Automated systems amplified the error, buying high and selling low at massive scale.

ML-Specific Failures

Tay Chatbot (Microsoft, 2016): Learned from user interactions. Within hours, trolls trained it to produce offensive content. Shut down within 16 hours.

Amazon Hiring Tool: Trained on historical hiring data. Learned to penalize women because past hires were mostly men. Abandoned after attempted fixes failed.

Healthcare Algorithm Bias: The Optum algorithm (discussed in ethics module) showed racial bias discovered years after deployment—only caught through external research.

Lessons

Testing isn’t enough: Production conditions differ from test
Monitor continuously: Catch problems early
Plan for rollback: Quick reversion when things go wrong
Human oversight: Critical decisions need human review
Adversarial thinking: Consider how systems can be attacked or misused

HANDS-ON EXERCISE: Deploying a Model

Part 1: Create Serving Application

# app.py
from fastapi import FastAPI
from pydantic import BaseModel
import joblib
import numpy as np

app = FastAPI(title="ML Model API")

# Load model at startup
model = joblib.load("model.pkl")

class Features(BaseModel):
    sepal_length: float
    sepal_width: float
    petal_length: float
    petal_width: float

@app.post("/predict")
def predict(features: Features):
    X = np.array([[
        features.sepal_length,
        features.sepal_width,
        features.petal_length,
        features.petal_width
    ]])
    prediction = model.predict(X)[0]
    probabilities = model.predict_proba(X)[0].tolist()
    return {
        "prediction": int(prediction),
        "probabilities": probabilities
    }

@app.get("/health")
def health():
    return {"status": "healthy"}

Part 2: Dockerize

FROM python:3.10-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY model.pkl .
COPY app.py .

EXPOSE 8000
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]

Part 3: Add Monitoring

import time
from prometheus_client import Counter, Histogram, start_http_server

# Metrics
REQUESTS = Counter('prediction_requests_total', 'Total prediction requests')
LATENCY = Histogram('prediction_latency_seconds', 'Prediction latency')

@app.post("/predict")
def predict(features: Features):
    REQUESTS.inc()
    start = time.time()

    # ... prediction code ...

    LATENCY.observe(time.time() - start)
    return result

# Start metrics server
start_http_server(9090)

Recommended Resources

Books

Designing Machine Learning Systems by Chip Huyen
Building Machine Learning Pipelines by Hapke and Nelson
Machine Learning Engineering by Andriy Burkov

Tools

MLflow: Experiment tracking and model registry
Kubeflow: Kubernetes ML pipelines
Weights & Biases: Experiment tracking
Evidently: ML monitoring
Great Expectations: Data validation

Platforms

AWS SageMaker: End-to-end ML
Google Vertex AI: GCP ML platform
Azure ML: Microsoft ML service
Databricks: Unified analytics platform

Module 9 covers model deployment and MLOps—the engineering practices that turn notebook prototypes into production systems. From containerization to monitoring to handling failures, we learn what it takes to keep ML systems running reliably in the real world.