Batch Book Translation Workflow
Process books through the complete pipeline: Crop → OCR → Translate
Roadmap Reference
See .claude/ROADMAP.md for the translation priority list.
Priority 1 = UNTRANSLATED - These are highest priority for processing:
- Kircher encyclopedias (Oedipus, Musurgia, Ars Magna Lucis)
- Fludd: Utriusque Cosmi Historia
- Theatrum Chemicum, Musaeum Hermeticum
- Cardano: De Subtilitate
- Della Porta: Magia Naturalis
- Lomazzo, Poliziano, Landino
bash1# Get roadmap with priorities 2curl -s "https://sourcelibrary.org/api/books/roadmap" | jq '.books[] | select(.priority == 1) | {title, notes}'
Roadmap source: src/app/api/books/roadmap/route.ts
Overview
This workflow handles the full processing pipeline for historical book scans:
- Generate Cropped Images - For split two-page spreads, extract individual pages
- OCR - Extract text from page images using Gemini vision
- Translate - Translate OCR'd text with prior page context for continuity
API Endpoints
| Endpoint | Purpose |
|---|---|
GET /api/books | List all books |
GET /api/books/BOOK_ID | Get book with all pages |
POST /api/jobs/queue-books | Queue pages for Lambda worker processing (primary path) |
GET /api/jobs | List processing jobs |
POST /api/jobs/JOB_ID/retry | Retry failed pages in a job |
POST /api/jobs/JOB_ID/cancel | Cancel a running job |
POST /api/books/BOOK_ID/batch-ocr-async | Submit Gemini Batch API OCR job (50% cheaper, ~24h) |
POST /api/books/BOOK_ID/batch-translate-async | Submit Gemini Batch API translation job |
Processing Options
Option 1: Lambda Workers via Job System (Primary Path)
The primary processing path uses AWS Lambda workers via SQS queues. Each page is processed independently with automatic job tracking.
bash1# Queue OCR for a book's pages 2curl -s -X POST "https://sourcelibrary.org/api/jobs/queue-books" \ 3 -H "Content-Type: application/json" \ 4 -d '{"bookIds": ["BOOK_ID"], "action": "ocr"}' 5 6# Queue translation 7curl -s -X POST "https://sourcelibrary.org/api/jobs/queue-books" \ 8 -H "Content-Type: application/json" \ 9 -d '{"bookIds": ["BOOK_ID"], "action": "translation"}' 10 11# Queue image extraction 12curl -s -X POST "https://sourcelibrary.org/api/jobs/queue-books" \ 13 -H "Content-Type: application/json" \ 14 -d '{"bookIds": ["BOOK_ID"], "action": "image_extraction"}'
IMPORTANT: Always use gemini-3-flash-preview for all OCR and translation tasks. Do NOT use gemini-2.5-flash.
Option 2: Gemini Batch API (50% Cheaper, Automated Pipeline)
The post-import-pipeline cron uses Gemini Batch API for automated processing of newly imported books. Results arrive in ~24 hours at 50% cost.
| Job Type | API | Model | Cost |
|---|---|---|---|
| Single page | Realtime (Lambda) | gemini-3-flash-preview | Full price |
| batch_ocr | Batch API | gemini-3-flash-preview | 50% off |
| batch_translate | Batch API | gemini-3-flash-preview | 50% off |
OCR Output Format
OCR uses Markdown output with semantic tags:
Markdown Formatting
# ## ###for headings (bigger text = bigger heading)**bold**,*italic*for emphasis->centered text<-for centered lines (NOT for headings)> blockquotesfor quotes/prayers---for dividers- Tables only for actual tabular data
Metadata Tags (hidden from readers)
| Tag | Purpose |
|---|---|
<lang>X</lang> | Detected language |
<page-num>N</page-num> | Page/folio number |
<header>X</header> | Running headers |
<sig>X</sig> | Printer's marks (A2, B1) |
<meta>X</meta> | Hidden metadata |
<warning>X</warning> | Quality issues |
<vocab>X</vocab> | Key terms for indexing |
Inline Annotations (visible to readers)
| Tag | Purpose |
|---|---|
<margin>X</margin> | Marginal notes (before paragraph) |
<gloss>X</gloss> | Interlinear annotations |
<insert>X</insert> | Boxed text, additions |
<unclear>X</unclear> | Illegible readings |
<note>X</note> | Interpretive notes |
<term>X</term> | Technical vocabulary |
<image-desc>X</image-desc> | Describe illustrations |
Critical OCR Rules
- Preserve original spelling, capitalization, punctuation
- Page numbers/headers/signatures go in metadata tags only
- IGNORE partial text at edges (from facing page in spread)
- Describe images/diagrams with
<image-desc>, never tables - End with
<vocab>key terms, names, concepts</vocab>
Step 1: Analyze Book Status
First, check what work is needed for a book:
bash1# Get book and analyze page status 2curl -s "https://sourcelibrary.org/api/books/BOOK_ID" > /tmp/book.json 3 4# Count pages by status (IMPORTANT: check length > 0, not just existence - empty strings are truthy!) 5jq '{ 6 title: .title, 7 total_pages: (.pages | length), 8 split_pages: [.pages[] | select(.crop)] | length, 9 needs_crop: [.pages[] | select(.crop) | select(.cropped_photo | not)] | length, 10 has_ocr: [.pages[] | select((.ocr.data // "") | length > 0)] | length, 11 needs_ocr: [.pages[] | select((.ocr.data // "") | length == 0)] | length, 12 has_translation: [.pages[] | select((.translation.data // "") | length > 0)] | length, 13 needs_translation: [.pages[] | select((.ocr.data // "") | length > 0) | select((.translation.data // "") | length == 0)] | length 14}' /tmp/book.json
Detecting Bad OCR
Pages that were OCR'd before cropped images were generated have incorrect OCR (contains both pages of the spread). Detect these:
bash1# Find pages with crop data + OCR but missing cropped_photo at OCR time 2# These often contain "two-page" or "spread" in the OCR text 3jq '[.pages[] | select(.crop) | select(.ocr.data) | 4 select(.ocr.data | test("two-page|spread"; "i"))] | length' /tmp/book.json
Step 2: Generate Cropped Images
For books with split two-page spreads, generate individual page images:
bash1# Get page IDs needing crops 2CROP_IDS=$(jq '[.pages[] | select(.crop) | select(.cropped_photo | not) | .id]' /tmp/book.json) 3 4# Create crop job 5curl -s -X POST "https://sourcelibrary.org/api/jobs" \ 6 -H "Content-Type: application/json" \ 7 -d "{ 8 \"type\": \"generate_cropped_images\", 9 \"book_id\": \"BOOK_ID\", 10 \"book_title\": \"BOOK_TITLE\", 11 \"page_ids\": $CROP_IDS 12 }"
Process the job:
bash1# Trigger processing (40 pages per request, auto-continues) 2curl -s -X POST "https://sourcelibrary.org/api/jobs/JOB_ID/process"
Step 3: OCR Pages
Option A: Using Job System (for large batches)
bash1# Get page IDs needing OCR (check for empty strings, not just null) 2OCR_IDS=$(jq '[.pages[] | select((.ocr.data // "") | length == 0) | .id]' /tmp/book.json) 3 4# Create OCR job 5curl -s -X POST "https://sourcelibrary.org/api/jobs" \ 6 -H "Content-Type: application/json" \ 7 -d "{ 8 \"type\": \"batch_ocr\", 9 \"book_id\": \"BOOK_ID\", 10 \"book_title\": \"BOOK_TITLE\", 11 \"model\": \"gemini-3-flash-preview\", 12 \"language\": \"Latin\", 13 \"page_ids\": $OCR_IDS 14 }"
Option B: Using Lambda Workers with Page IDs
bash1# OCR specific pages (including overwrite) 2curl -s -X POST "https://sourcelibrary.org/api/jobs/queue-books" \ 3 -H "Content-Type: application/json" \ 4 -d '{ 5 "bookIds": ["BOOK_ID"], 6 "action": "ocr", 7 "pageIds": ["PAGE_ID_1", "PAGE_ID_2"], 8 "overwrite": true 9 }'
Lambda workers automatically use cropped_photo when available.
Step 4: Translate Pages
Option A: Using Job System
bash1# Get page IDs needing translation (must have OCR content, check for empty strings) 2TRANS_IDS=$(jq '[.pages[] | select((.ocr.data // "") | length > 0) | select((.translation.data // "") | length == 0) | .id]' /tmp/book.json) 3 4# Create translation job 5curl -s -X POST "https://sourcelibrary.org/api/jobs" \ 6 -H "Content-Type: application/json" \ 7 -d "{ 8 \"type\": \"batch_translate\", 9 \"book_id\": \"BOOK_ID\", 10 \"book_title\": \"BOOK_TITLE\", 11 \"model\": \"gemini-3-flash-preview\", 12 \"language\": \"Latin\", 13 \"page_ids\": $TRANS_IDS 14 }"
Option B: Using Lambda Workers (Recommended)
Lambda FIFO queue automatically provides previous page context for translation continuity:
bash1# Queue translation for pages that have OCR but no translation 2curl -s -X POST "https://sourcelibrary.org/api/jobs/queue-books" \ 3 -H "Content-Type: application/json" \ 4 -d '{"bookIds": ["BOOK_ID"], "action": "translation"}'
The translation Lambda worker processes pages sequentially via FIFO queue and fetches the previous page's translation for context.
Complete Book Processing Script
Process a single book through the full pipeline using Lambda workers:
bash1#!/bin/bash 2BOOK_ID="YOUR_BOOK_ID" 3BASE_URL="https://sourcelibrary.org" 4 5# 1. Fetch book data 6echo "Fetching book..." 7BOOK=$(curl -s "$BASE_URL/api/books/$BOOK_ID") 8TITLE=$(echo "$BOOK" | jq -r '.title[0:40]') 9echo "Processing: $TITLE" 10 11# 2. Queue OCR (Lambda workers handle all pages automatically) 12NEEDS_OCR=$(echo "$BOOK" | jq '[.pages[] | select((.ocr.data // "") | length == 0)] | length') 13if [ "$NEEDS_OCR" != "0" ]; then 14 echo "Queueing OCR for $NEEDS_OCR pages..." 15 curl -s -X POST "$BASE_URL/api/jobs/queue-books" \ 16 -H "Content-Type: application/json" \ 17 -d "{\"bookIds\": [\"$BOOK_ID\"], \"action\": \"ocr\"}" 18 echo "OCR job queued!" 19fi 20 21# 3. Queue translation (after OCR completes — check /jobs page) 22NEEDS_TRANS=$(echo "$BOOK" | jq '[.pages[] | select((.ocr.data // "") | length > 0) | select((.translation.data // "") | length == 0)] | length') 23if [ "$NEEDS_TRANS" != "0" ]; then 24 echo "Queueing translation for $NEEDS_TRANS pages..." 25 curl -s -X POST "$BASE_URL/api/jobs/queue-books" \ 26 -H "Content-Type: application/json" \ 27 -d "{\"bookIds\": [\"$BOOK_ID\"], \"action\": \"translation\"}" 28 echo "Translation job queued!" 29fi 30 31echo "Jobs queued! Monitor progress at $BASE_URL/jobs"
Fixing Bad OCR
When pages were OCR'd before cropped images existed, they contain text from both pages. Fix with:
bash1# 1. Generate cropped images first (Step 2 above) 2 3# 2. Find pages with bad OCR 4BAD_OCR_IDS=$(jq '[.pages[] | select(.crop) | select(.ocr.data) | 5 select(.ocr.data | test("two-page|spread"; "i")) | .id]' /tmp/book.json) 6 7# 3. Re-OCR with overwrite via Lambda workers 8curl -s -X POST "https://sourcelibrary.org/api/jobs/queue-books" \ 9 -H "Content-Type: application/json" \ 10 -d "{\"bookIds\": [\"BOOK_ID\"], \"action\": \"ocr\", \"pageIds\": $BAD_OCR_IDS, \"overwrite\": true}"
Processing All Books
Use the Lambda worker job system for bulk processing:
bash1#!/bin/bash 2BASE_URL="https://sourcelibrary.org" 3 4# Get all book IDs 5BOOK_IDS=$(curl -s "$BASE_URL/api/books" | jq -r '[.[].id]') 6 7# Queue OCR for all books (Lambda workers handle parallelism and rate limiting) 8curl -s -X POST "$BASE_URL/api/jobs/queue-books" \ 9 -H "Content-Type: application/json" \ 10 -d "{\"bookIds\": $BOOK_IDS, \"action\": \"ocr\"}" 11 12# After OCR completes, queue translation 13curl -s -X POST "$BASE_URL/api/jobs/queue-books" \ 14 -H "Content-Type: application/json" \ 15 -d "{\"bookIds\": $BOOK_IDS, \"action\": \"translation\"}"
Monitor progress at https://sourcelibrary.org/jobs
Monitoring Progress
Check overall library status:
bash1curl -s "https://sourcelibrary.org/api/books" | jq '[.[] | { 2 title: .title[0:30], 3 pages: .pages_count, 4 ocr: .ocr_count, 5 translated: .translation_count 6}] | sort_by(-.pages)'
Troubleshooting
Empty Strings vs Null (CRITICAL)
In jq, empty strings "" are truthy! This means:
select(.ocr.data)matches pages with""(WRONG)select(.ocr.data | not)does NOT match pages with""(WRONG)- Use
select((.ocr.data // "") | length == 0)to find missing/empty OCR - Use
select((.ocr.data // "") | length > 0)to find pages WITH OCR content
Rate Limits (429 errors)
Gemini API Tiers
| Tier | RPM | How to Qualify |
|---|---|---|
| Free | 15 | Default |
| Tier 1 | 300 | Enable billing + $50 spend |
| Tier 2 | 1000 | $250 spend |
| Tier 3 | 2000 | $1000 spend |
Optimal Sleep Times by Tier
| Tier | Max RPM | Safe Sleep Time | Effective Rate |
|---|---|---|---|
| Free | 15 | 4.0s | ~15/min |
| Tier 1 | 300 | 0.4s | ~150/min |
| Tier 2 | 1000 | 0.12s | ~500/min |
| Tier 3 | 2000 | 0.06s | ~1000/min |
Note: Use ~50% of max rate to leave headroom for bursts.
API Key Rotation
The system supports multiple API keys for higher throughput:
- Set
GEMINI_API_KEY(primary) - Set
GEMINI_API_KEY_2,GEMINI_API_KEY_3, ... up toGEMINI_API_KEY_10 - Keys rotate automatically with 60s cooldown after rate limit
With N keys at Tier 1, you get N × 300 RPM = N × 150 safe req/min
Function Timeouts
- Jobs have
maxDuration=300sfor Vercel Pro - If hitting timeouts, reduce
CROP_CHUNK_SIZEin job processing
Missing Cropped Photos
- Check if crop job completed successfully
- Verify page has
cropdata withxStartandxEnd - Re-run crop generation for specific pages
Bad OCR Detection
Look for these patterns in OCR text indicating wrong image was used:
- "two-page spread"
- "left page" / "right page" descriptions
- Duplicate text blocks
- References to facing pages