DATA 202 Module 4: OCR and Document AI

Introduction

Much of the world’s knowledge is locked in documents—scanned papers, PDFs, handwritten notes, historical records, receipts, and forms. Extracting information from these documents requires Optical Character Recognition (OCR) and increasingly sophisticated Document AI systems.

This module explores the technologies for converting document images to structured data: from classical OCR to modern deep learning approaches that understand document layout, tables, and meaning.

Part 1: From Images to Text

The OCR Pipeline

Traditional OCR follows a pipeline:

Image Preprocessing: Deskew, denoise, binarize
Layout Analysis: Detect text regions, columns, paragraphs
Line Segmentation: Separate individual text lines
Character Segmentation: Isolate characters
Character Recognition: Classify each character
Post-processing: Language models, spell checking

Classical Approaches

Template Matching: Compare character images against known templates

Fast but brittle
Fails with font variation

Feature-Based: Extract features (strokes, curves), classify

More robust to variation
Still struggles with noise

Hidden Markov Models: Model character sequences

Better context integration
Foundation for early commercial systems

The Tesseract Story

Tesseract, now the world’s most-used open-source OCR engine, has a remarkable history:

Developed at HP Labs 1985-1994
Considered best OCR engine of its era
Abandoned as HP exited the OCR market
Released as open source in 2005
Now maintained by Google, powered by LSTM

import pytesseract
from PIL import Image

# Basic OCR
text = pytesseract.image_to_string(Image.open('document.png'))

# With language specification
text = pytesseract.image_to_string(
    Image.open('arabic_doc.png'),
    lang='ara'
)

# Get detailed information
data = pytesseract.image_to_data(
    Image.open('document.png'),
    output_type=pytesseract.Output.DICT
)

Part 2: Deep Learning Revolution

End-to-End Recognition

Modern OCR uses deep learning for end-to-end recognition:

CNNs for Visual Features: Extract visual representations RNNs/Transformers for Sequences: Model character sequences CTC Loss: Handle variable-length sequences without explicit alignment

Key Architectures

CRNN (Convolutional Recurrent Neural Network):

CNN extracts features from image strips
BiLSTM processes sequence
CTC decodes to text

Attention-Based Models:

Encoder processes full image
Decoder attends to relevant regions
Generates characters autoregressively

Transformer-Based OCR:

Vision Transformer (ViT) for encoding
Text decoder for generation
TrOCR, Donut, etc.

Cloud OCR Services

Google Cloud Vision: General OCR, document parsing AWS Textract: Form and table extraction Azure Computer Vision: OCR and document analysis Google Document AI: Specialized document processors

Part 3: Document Understanding

Beyond OCR: Document AI

OCR extracts text. Document AI understands structure:

Layout Analysis: Headers, paragraphs, captions, tables Table Extraction: Rows, columns, cells Form Understanding: Field-value pairs Key Information Extraction: Specific data from documents

LayoutLM and Document Transformers

LayoutLM (Microsoft) revolutionized document understanding:

Combines text, layout position, and visual features
Pre-trained on large document datasets
Fine-tunes for specific extraction tasks

from transformers import LayoutLMv3Processor, LayoutLMv3ForTokenClassification

processor = LayoutLMv3Processor.from_pretrained("microsoft/layoutlmv3-base")
model = LayoutLMv3ForTokenClassification.from_pretrained(
    "microsoft/layoutlmv3-base"
)

# Process document image
encoding = processor(image, return_tensors="pt")
outputs = model(**encoding)

Handwriting Recognition

Handwriting recognition (HTR) is harder than printed text:

Huge variation between writers
Connected cursive scripts
Historical documents have archaic styles

Modern approaches:

Transformer-based models
Writer adaptation techniques
Large-scale pretraining

Part 4: Practical Document Processing

PDF Extraction

PDFs may be:

Text-based: Text extractable directly
Image-based: Scanned images requiring OCR
Mixed: Combination of text and images

import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    for page in pdf.pages:
        # Extract text
        text = page.extract_text()

        # Extract tables
        tables = page.extract_tables()
        for table in tables:
            df = pd.DataFrame(table[1:], columns=table[0])

Preprocessing for Better OCR

Image quality dramatically affects OCR:

import cv2
import numpy as np

def preprocess_for_ocr(image_path):
    img = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)

    # Denoise
    img = cv2.fastNlMeansDenoising(img, h=10)

    # Binarize (Otsu's method)
    _, img = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)

    # Deskew
    coords = np.column_stack(np.where(img > 0))
    angle = cv2.minAreaRect(coords)[-1]
    if angle < -45:
        angle = 90 + angle
    (h, w) = img.shape[:2]
    center = (w // 2, h // 2)
    M = cv2.getRotationMatrix2D(center, angle, 1.0)
    img = cv2.warpAffine(img, M, (w, h), borderValue=255)

    return img

Handling Arabic Script

Arabic presents unique challenges:

Right-to-left direction
Connected letters with multiple forms
Diacritics (harakat)
Mixed with left-to-right elements

# Tesseract with Arabic
text = pytesseract.image_to_string(
    image,
    lang='ara',
    config='--psm 3'  # Automatic page segmentation
)

DEEP DIVE: The Digitization of Human Knowledge

The Dream of Universal Access

In 2004, Google announced an audacious project: scan every book ever printed and make them searchable. The Google Books project would eventually scan over 40 million books—a significant fraction of humanity’s written knowledge.

But scanning is just imaging. Making books searchable required OCR at unprecedented scale:

Degraded historical texts
Hundreds of languages
Gothic typefaces, handwriting, marginalia
Billions of pages

reCAPTCHA: Humans as OCR Workers

When computers struggled with difficult words, Google found an ingenious solution: reCAPTCHA.

Remember those distorted word puzzles used to prove you’re human? They served a dual purpose. One word tested you; the other was an OCR failure that needed human help. By distributing difficult words across millions of CAPTCHA challenges, Google crowdsourced OCR correction.

reCAPTCHA digitized 13 million articles from the New York Times archives and helped correct Google Books. Users unknowingly contributed to the digitization of human knowledge, one CAPTCHA at a time.

The Continuing Challenge

Despite advances, OCR remains imperfect:

Historical documents with unfamiliar fonts
Handwritten manuscripts
Low-quality scans
Languages without training data
Complex layouts

Each archive, each language, each historical period presents new challenges. The dream of universal access to recorded knowledge remains a work in progress.

HANDS-ON EXERCISE: Building a Document Processing Pipeline

Overview

Students will:

Preprocess document images
Apply OCR with Tesseract
Extract structured information
Handle multi-language documents

Part 1: Basic OCR

import pytesseract
from PIL import Image
import cv2
import numpy as np

# Load and preprocess image
def preprocess_image(image_path):
    img = cv2.imread(image_path)
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    denoised = cv2.fastNlMeansDenoising(gray)
    _, binary = cv2.threshold(denoised, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
    return binary

# OCR with confidence scores
def ocr_with_confidence(image_path):
    img = preprocess_image(image_path)
    data = pytesseract.image_to_data(img, output_type=pytesseract.Output.DICT)

    results = []
    for i, word in enumerate(data['text']):
        if word.strip():
            results.append({
                'word': word,
                'confidence': data['conf'][i],
                'bbox': (data['left'][i], data['top'][i],
                        data['width'][i], data['height'][i])
            })
    return results

results = ocr_with_confidence('document.png')
high_conf = [r for r in results if r['confidence'] > 80]

Part 2: Table Extraction

import pdfplumber
import pandas as pd

def extract_tables_from_pdf(pdf_path):
    tables = []
    with pdfplumber.open(pdf_path) as pdf:
        for i, page in enumerate(pdf.pages):
            page_tables = page.extract_tables()
            for j, table in enumerate(page_tables):
                if table and len(table) > 1:
                    df = pd.DataFrame(table[1:], columns=table[0])
                    df['page'] = i + 1
                    df['table_num'] = j + 1
                    tables.append(df)
    return tables

Part 3: Key Information Extraction

import re

def extract_invoice_info(text):
    """Extract key fields from invoice text."""
    info = {}

    # Invoice number
    inv_match = re.search(r'Invoice\s*#?\s*:?\s*(\w+)', text, re.IGNORECASE)
    if inv_match:
        info['invoice_number'] = inv_match.group(1)

    # Date
    date_match = re.search(r'\d{1,2}[/-]\d{1,2}[/-]\d{2,4}', text)
    if date_match:
        info['date'] = date_match.group()

    # Total amount
    total_match = re.search(r'Total:?\s*\$?\s*([\d,]+\.?\d*)', text, re.IGNORECASE)
    if total_match:
        info['total'] = float(total_match.group(1).replace(',', ''))

    return info

Recommended Resources

Libraries

Tesseract OCR: https://github.com/tesseract-ocr/tesseract
EasyOCR: Simple deep learning OCR
PaddleOCR: Multilingual OCR
pdfplumber: PDF extraction
LayoutParser: Document layout analysis

Cloud Services

Google Cloud Vision API
AWS Textract
Azure Form Recognizer
Google Document AI

Papers

“LayoutLMv3” - Microsoft’s document understanding model
“TrOCR” - Transformer OCR
“Donut” - Document understanding transformer

Module 4 explores OCR and Document AI—the technologies for extracting information from document images. From classical character recognition to modern deep learning systems that understand document structure, we learn to unlock the knowledge trapped in unstructured documents.