DATA 202 Module 4: OCR and Document AI

Introduction

Much of the world’s knowledge is locked in documents—scanned papers, PDFs, handwritten notes, historical records, receipts, and forms. Extracting information from these documents requires Optical Character Recognition (OCR) and increasingly sophisticated Document AI systems.

This module explores the technologies for converting document images to structured data: from classical OCR to modern deep learning approaches that understand document layout, tables, and meaning.


Part 1: From Images to Text

The OCR Pipeline

Traditional OCR follows a pipeline:

  1. Image Preprocessing: Deskew, denoise, binarize
  2. Layout Analysis: Detect text regions, columns, paragraphs
  3. Line Segmentation: Separate individual text lines
  4. Character Segmentation: Isolate characters
  5. Character Recognition: Classify each character
  6. Post-processing: Language models, spell checking

Classical Approaches

Template Matching: Compare character images against known templates

Feature-Based: Extract features (strokes, curves), classify

Hidden Markov Models: Model character sequences

The Tesseract Story

Tesseract, now the world’s most-used open-source OCR engine, has a remarkable history:

import pytesseract
from PIL import Image

# Basic OCR
text = pytesseract.image_to_string(Image.open('document.png'))

# With language specification
text = pytesseract.image_to_string(
    Image.open('arabic_doc.png'),
    lang='ara'
)

# Get detailed information
data = pytesseract.image_to_data(
    Image.open('document.png'),
    output_type=pytesseract.Output.DICT
)

Part 2: Deep Learning Revolution

End-to-End Recognition

Modern OCR uses deep learning for end-to-end recognition:

CNNs for Visual Features: Extract visual representations RNNs/Transformers for Sequences: Model character sequences CTC Loss: Handle variable-length sequences without explicit alignment

Key Architectures

CRNN (Convolutional Recurrent Neural Network):

Attention-Based Models:

Transformer-Based OCR:

Cloud OCR Services

Google Cloud Vision: General OCR, document parsing AWS Textract: Form and table extraction Azure Computer Vision: OCR and document analysis Google Document AI: Specialized document processors


Part 3: Document Understanding

Beyond OCR: Document AI

OCR extracts text. Document AI understands structure:

Layout Analysis: Headers, paragraphs, captions, tables Table Extraction: Rows, columns, cells Form Understanding: Field-value pairs Key Information Extraction: Specific data from documents

LayoutLM and Document Transformers

LayoutLM (Microsoft) revolutionized document understanding:

from transformers import LayoutLMv3Processor, LayoutLMv3ForTokenClassification

processor = LayoutLMv3Processor.from_pretrained("microsoft/layoutlmv3-base")
model = LayoutLMv3ForTokenClassification.from_pretrained(
    "microsoft/layoutlmv3-base"
)

# Process document image
encoding = processor(image, return_tensors="pt")
outputs = model(**encoding)

Handwriting Recognition

Handwriting recognition (HTR) is harder than printed text:

Modern approaches:


Part 4: Practical Document Processing

PDF Extraction

PDFs may be:

import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    for page in pdf.pages:
        # Extract text
        text = page.extract_text()

        # Extract tables
        tables = page.extract_tables()
        for table in tables:
            df = pd.DataFrame(table[1:], columns=table[0])

Preprocessing for Better OCR

Image quality dramatically affects OCR:

import cv2
import numpy as np

def preprocess_for_ocr(image_path):
    img = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)

    # Denoise
    img = cv2.fastNlMeansDenoising(img, h=10)

    # Binarize (Otsu's method)
    _, img = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)

    # Deskew
    coords = np.column_stack(np.where(img > 0))
    angle = cv2.minAreaRect(coords)[-1]
    if angle < -45:
        angle = 90 + angle
    (h, w) = img.shape[:2]
    center = (w // 2, h // 2)
    M = cv2.getRotationMatrix2D(center, angle, 1.0)
    img = cv2.warpAffine(img, M, (w, h), borderValue=255)

    return img

Handling Arabic Script

Arabic presents unique challenges:

# Tesseract with Arabic
text = pytesseract.image_to_string(
    image,
    lang='ara',
    config='--psm 3'  # Automatic page segmentation
)

DEEP DIVE: The Digitization of Human Knowledge

The Dream of Universal Access

In 2004, Google announced an audacious project: scan every book ever printed and make them searchable. The Google Books project would eventually scan over 40 million books—a significant fraction of humanity’s written knowledge.

But scanning is just imaging. Making books searchable required OCR at unprecedented scale:

reCAPTCHA: Humans as OCR Workers

When computers struggled with difficult words, Google found an ingenious solution: reCAPTCHA.

Remember those distorted word puzzles used to prove you’re human? They served a dual purpose. One word tested you; the other was an OCR failure that needed human help. By distributing difficult words across millions of CAPTCHA challenges, Google crowdsourced OCR correction.

reCAPTCHA digitized 13 million articles from the New York Times archives and helped correct Google Books. Users unknowingly contributed to the digitization of human knowledge, one CAPTCHA at a time.

The Continuing Challenge

Despite advances, OCR remains imperfect:

Each archive, each language, each historical period presents new challenges. The dream of universal access to recorded knowledge remains a work in progress.


HANDS-ON EXERCISE: Building a Document Processing Pipeline

Overview

Students will:

  1. Preprocess document images
  2. Apply OCR with Tesseract
  3. Extract structured information
  4. Handle multi-language documents

Part 1: Basic OCR

import pytesseract
from PIL import Image
import cv2
import numpy as np

# Load and preprocess image
def preprocess_image(image_path):
    img = cv2.imread(image_path)
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    denoised = cv2.fastNlMeansDenoising(gray)
    _, binary = cv2.threshold(denoised, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
    return binary

# OCR with confidence scores
def ocr_with_confidence(image_path):
    img = preprocess_image(image_path)
    data = pytesseract.image_to_data(img, output_type=pytesseract.Output.DICT)

    results = []
    for i, word in enumerate(data['text']):
        if word.strip():
            results.append({
                'word': word,
                'confidence': data['conf'][i],
                'bbox': (data['left'][i], data['top'][i],
                        data['width'][i], data['height'][i])
            })
    return results

results = ocr_with_confidence('document.png')
high_conf = [r for r in results if r['confidence'] > 80]

Part 2: Table Extraction

import pdfplumber
import pandas as pd

def extract_tables_from_pdf(pdf_path):
    tables = []
    with pdfplumber.open(pdf_path) as pdf:
        for i, page in enumerate(pdf.pages):
            page_tables = page.extract_tables()
            for j, table in enumerate(page_tables):
                if table and len(table) > 1:
                    df = pd.DataFrame(table[1:], columns=table[0])
                    df['page'] = i + 1
                    df['table_num'] = j + 1
                    tables.append(df)
    return tables

Part 3: Key Information Extraction

import re

def extract_invoice_info(text):
    """Extract key fields from invoice text."""
    info = {}

    # Invoice number
    inv_match = re.search(r'Invoice\s*#?\s*:?\s*(\w+)', text, re.IGNORECASE)
    if inv_match:
        info['invoice_number'] = inv_match.group(1)

    # Date
    date_match = re.search(r'\d{1,2}[/-]\d{1,2}[/-]\d{2,4}', text)
    if date_match:
        info['date'] = date_match.group()

    # Total amount
    total_match = re.search(r'Total:?\s*\$?\s*([\d,]+\.?\d*)', text, re.IGNORECASE)
    if total_match:
        info['total'] = float(total_match.group(1).replace(',', ''))

    return info

Libraries

Cloud Services

Papers


Module 4 explores OCR and Document AI—the technologies for extracting information from document images. From classical character recognition to modern deep learning systems that understand document structure, we learn to unlock the knowledge trapped in unstructured documents.