What is batch-translate?

Ideal for Historical Text Analysis Agents requiring automated book translation pipelines batch-translate is a workflow process that automates book translation through a series of steps including cropping, optical character recognition, and translation.

How do I install batch-translate?

Run the command: npx killer-skills add Embassy-of-the-Free-Mind/sourcelibrary-v2/batch-translate. It works with Cursor, Windsurf, VS Code, Claude Code, and 15+ other IDEs.

What are the use cases for batch-translate?

Key use cases include: Automating translation of unprocessed Kircher encyclopedias, Generating translated versions of Theatrum Chemicum and Musaeum Hermeticum, Processing Cardano: De Subtilitate and Della Porta: Magia Naturalis for analysis.

Which IDEs are compatible with batch-translate?

This skill is compatible with Cursor, Windsurf, VS Code, Claude Code, GitHub Copilot, JetBrains, Cline, Roo Code, and many more. Use the Killer-Skills CLI for universal one-command installation.

Are there any limitations for batch-translate?

Requires access to historical text repositories. Priority processing based on .claude/ROADMAP.md. Limited to specific historical texts listed in the roadmap.

Batch Book Translation Workflow

Name: batch-translate
Availability: InStock
Rating: 2.9 (1 reviews)
Author: Embassy-of-the-Free-Mind

Process books through the complete pipeline: Crop → OCR → Translate

Roadmap Reference

See .claude/ROADMAP.md for the translation priority list.

Priority 1 = UNTRANSLATED - These are highest priority for processing:

Kircher encyclopedias (Oedipus, Musurgia, Ars Magna Lucis)
Fludd: Utriusque Cosmi Historia
Theatrum Chemicum, Musaeum Hermeticum
Cardano: De Subtilitate
Della Porta: Magia Naturalis
Lomazzo, Poliziano, Landino

bash
1# Get roadmap with priorities
2curl -s "https://sourcelibrary.org/api/books/roadmap" | jq '.books[] | select(.priority == 1) | {title, notes}'

Roadmap source: src/app/api/books/roadmap/route.ts

Overview

This workflow handles the full processing pipeline for historical book scans:

Generate Cropped Images - For split two-page spreads, extract individual pages
OCR - Extract text from page images using Gemini vision
Translate - Translate OCR'd text with prior page context for continuity

API Endpoints

Endpoint	Purpose
`GET /api/books`	List all books
`GET /api/books/BOOK_ID`	Get book with all pages
`POST /api/jobs/queue-books`	Queue pages for Lambda worker processing (primary path)
`GET /api/jobs`	List processing jobs
`POST /api/jobs/JOB_ID/retry`	Retry failed pages in a job
`POST /api/jobs/JOB_ID/cancel`	Cancel a running job
`POST /api/books/BOOK_ID/batch-ocr-async`	Submit Gemini Batch API OCR job (50% cheaper, ~24h)
`POST /api/books/BOOK_ID/batch-translate-async`	Submit Gemini Batch API translation job

Processing Options

Option 1: Lambda Workers via Job System (Primary Path)

The primary processing path uses AWS Lambda workers via SQS queues. Each page is processed independently with automatic job tracking.

bash
1# Queue OCR for a book's pages
2curl -s -X POST "https://sourcelibrary.org/api/jobs/queue-books" \
3  -H "Content-Type: application/json" \
4  -d '{"bookIds": ["BOOK_ID"], "action": "ocr"}'
5
6# Queue translation
7curl -s -X POST "https://sourcelibrary.org/api/jobs/queue-books" \
8  -H "Content-Type: application/json" \
9  -d '{"bookIds": ["BOOK_ID"], "action": "translation"}'
10
11# Queue image extraction
12curl -s -X POST "https://sourcelibrary.org/api/jobs/queue-books" \
13  -H "Content-Type: application/json" \
14  -d '{"bookIds": ["BOOK_ID"], "action": "image_extraction"}'

IMPORTANT: Always use gemini-3-flash-preview for all OCR and translation tasks. Do NOT use gemini-2.5-flash.

Option 2: Gemini Batch API (50% Cheaper, Automated Pipeline)

The post-import-pipeline cron uses Gemini Batch API for automated processing of newly imported books. Results arrive in ~24 hours at 50% cost.

Job Type	API	Model	Cost
Single page	Realtime (Lambda)	gemini-3-flash-preview	Full price
batch_ocr	Batch API	gemini-3-flash-preview	50% off
batch_translate	Batch API	gemini-3-flash-preview	50% off

OCR Output Format

OCR uses Markdown output with semantic tags:

Markdown Formatting

# ## ### for headings (bigger text = bigger heading)
**bold**, *italic* for emphasis
->centered text<- for centered lines (NOT for headings)
> blockquotes for quotes/prayers
--- for dividers
Tables only for actual tabular data

Metadata Tags (hidden from readers)

Tag	Purpose
`<lang>X</lang>`	Detected language
`<page-num>N</page-num>`	Page/folio number
`<header>X</header>`	Running headers
`<sig>X</sig>`	Printer's marks (A2, B1)
`<meta>X</meta>`	Hidden metadata
`<warning>X</warning>`	Quality issues
`<vocab>X</vocab>`	Key terms for indexing

Inline Annotations (visible to readers)

Tag	Purpose
`<margin>X</margin>`	Marginal notes (before paragraph)
`<gloss>X</gloss>`	Interlinear annotations
`<insert>X</insert>`	Boxed text, additions
`<unclear>X</unclear>`	Illegible readings
`<note>X</note>`	Interpretive notes
`<term>X</term>`	Technical vocabulary
`<image-desc>X</image-desc>`	Describe illustrations

Critical OCR Rules

Preserve original spelling, capitalization, punctuation
Page numbers/headers/signatures go in metadata tags only
IGNORE partial text at edges (from facing page in spread)
Describe images/diagrams with <image-desc>, never tables
End with <vocab>key terms, names, concepts</vocab>

Step 1: Analyze Book Status

First, check what work is needed for a book:

bash
1# Get book and analyze page status
2curl -s "https://sourcelibrary.org/api/books/BOOK_ID" > /tmp/book.json
3
4# Count pages by status (IMPORTANT: check length > 0, not just existence - empty strings are truthy!)
5jq '{
6  title: .title,
7  total_pages: (.pages | length),
8  split_pages: [.pages[] | select(.crop)] | length,
9  needs_crop: [.pages[] | select(.crop) | select(.cropped_photo | not)] | length,
10  has_ocr: [.pages[] | select((.ocr.data // "") | length > 0)] | length,
11  needs_ocr: [.pages[] | select((.ocr.data // "") | length == 0)] | length,
12  has_translation: [.pages[] | select((.translation.data // "") | length > 0)] | length,
13  needs_translation: [.pages[] | select((.ocr.data // "") | length > 0) | select((.translation.data // "") | length == 0)] | length
14}' /tmp/book.json

Detecting Bad OCR

Pages that were OCR'd before cropped images were generated have incorrect OCR (contains both pages of the spread). Detect these:

bash
1# Find pages with crop data + OCR but missing cropped_photo at OCR time
2# These often contain "two-page" or "spread" in the OCR text
3jq '[.pages[] | select(.crop) | select(.ocr.data) |
4  select(.ocr.data | test("two-page|spread"; "i"))] | length' /tmp/book.json

Step 2: Generate Cropped Images

For books with split two-page spreads, generate individual page images:

bash
1# Get page IDs needing crops
2CROP_IDS=$(jq '[.pages[] | select(.crop) | select(.cropped_photo | not) | .id]' /tmp/book.json)
3
4# Create crop job
5curl -s -X POST "https://sourcelibrary.org/api/jobs" \
6  -H "Content-Type: application/json" \
7  -d "{
8    \"type\": \"generate_cropped_images\",
9    \"book_id\": \"BOOK_ID\",
10    \"book_title\": \"BOOK_TITLE\",
11    \"page_ids\": $CROP_IDS
12  }"

Process the job:

bash
1# Trigger processing (40 pages per request, auto-continues)
2curl -s -X POST "https://sourcelibrary.org/api/jobs/JOB_ID/process"

Step 3: OCR Pages

Option A: Using Job System (for large batches)

bash
1# Get page IDs needing OCR (check for empty strings, not just null)
2OCR_IDS=$(jq '[.pages[] | select((.ocr.data // "") | length == 0) | .id]' /tmp/book.json)
3
4# Create OCR job
5curl -s -X POST "https://sourcelibrary.org/api/jobs" \
6  -H "Content-Type: application/json" \
7  -d "{
8    \"type\": \"batch_ocr\",
9    \"book_id\": \"BOOK_ID\",
10    \"book_title\": \"BOOK_TITLE\",
11    \"model\": \"gemini-3-flash-preview\",
12    \"language\": \"Latin\",
13    \"page_ids\": $OCR_IDS
14  }"

Option B: Using Lambda Workers with Page IDs

bash
1# OCR specific pages (including overwrite)
2curl -s -X POST "https://sourcelibrary.org/api/jobs/queue-books" \
3  -H "Content-Type: application/json" \
4  -d '{
5    "bookIds": ["BOOK_ID"],
6    "action": "ocr",
7    "pageIds": ["PAGE_ID_1", "PAGE_ID_2"],
8    "overwrite": true
9  }'

Lambda workers automatically use cropped_photo when available.

Step 4: Translate Pages

Option A: Using Job System

bash
1# Get page IDs needing translation (must have OCR content, check for empty strings)
2TRANS_IDS=$(jq '[.pages[] | select((.ocr.data // "") | length > 0) | select((.translation.data // "") | length == 0) | .id]' /tmp/book.json)
3
4# Create translation job
5curl -s -X POST "https://sourcelibrary.org/api/jobs" \
6  -H "Content-Type: application/json" \
7  -d "{
8    \"type\": \"batch_translate\",
9    \"book_id\": \"BOOK_ID\",
10    \"book_title\": \"BOOK_TITLE\",
11    \"model\": \"gemini-3-flash-preview\",
12    \"language\": \"Latin\",
13    \"page_ids\": $TRANS_IDS
14  }"

Option B: Using Lambda Workers (Recommended)

Lambda FIFO queue automatically provides previous page context for translation continuity:

bash
1# Queue translation for pages that have OCR but no translation
2curl -s -X POST "https://sourcelibrary.org/api/jobs/queue-books" \
3  -H "Content-Type: application/json" \
4  -d '{"bookIds": ["BOOK_ID"], "action": "translation"}'

The translation Lambda worker processes pages sequentially via FIFO queue and fetches the previous page's translation for context.

Complete Book Processing Script

Process a single book through the full pipeline using Lambda workers:

bash
1#!/bin/bash
2BOOK_ID="YOUR_BOOK_ID"
3BASE_URL="https://sourcelibrary.org"
4
5# 1. Fetch book data
6echo "Fetching book..."
7BOOK=$(curl -s "$BASE_URL/api/books/$BOOK_ID")
8TITLE=$(echo "$BOOK" | jq -r '.title[0:40]')
9echo "Processing: $TITLE"
10
11# 2. Queue OCR (Lambda workers handle all pages automatically)
12NEEDS_OCR=$(echo "$BOOK" | jq '[.pages[] | select((.ocr.data // "") | length == 0)] | length')
13if [ "$NEEDS_OCR" != "0" ]; then
14  echo "Queueing OCR for $NEEDS_OCR pages..."
15  curl -s -X POST "$BASE_URL/api/jobs/queue-books" \
16    -H "Content-Type: application/json" \
17    -d "{\"bookIds\": [\"$BOOK_ID\"], \"action\": \"ocr\"}"
18  echo "OCR job queued!"
19fi
20
21# 3. Queue translation (after OCR completes — check /jobs page)
22NEEDS_TRANS=$(echo "$BOOK" | jq '[.pages[] | select((.ocr.data // "") | length > 0) | select((.translation.data // "") | length == 0)] | length')
23if [ "$NEEDS_TRANS" != "0" ]; then
24  echo "Queueing translation for $NEEDS_TRANS pages..."
25  curl -s -X POST "$BASE_URL/api/jobs/queue-books" \
26    -H "Content-Type: application/json" \
27    -d "{\"bookIds\": [\"$BOOK_ID\"], \"action\": \"translation\"}"
28  echo "Translation job queued!"
29fi
30
31echo "Jobs queued! Monitor progress at $BASE_URL/jobs"

Fixing Bad OCR

When pages were OCR'd before cropped images existed, they contain text from both pages. Fix with:

bash
1# 1. Generate cropped images first (Step 2 above)
2
3# 2. Find pages with bad OCR
4BAD_OCR_IDS=$(jq '[.pages[] | select(.crop) | select(.ocr.data) |
5  select(.ocr.data | test("two-page|spread"; "i")) | .id]' /tmp/book.json)
6
7# 3. Re-OCR with overwrite via Lambda workers
8curl -s -X POST "https://sourcelibrary.org/api/jobs/queue-books" \
9  -H "Content-Type: application/json" \
10  -d "{\"bookIds\": [\"BOOK_ID\"], \"action\": \"ocr\", \"pageIds\": $BAD_OCR_IDS, \"overwrite\": true}"

Processing All Books

Use the Lambda worker job system for bulk processing:

bash
1#!/bin/bash
2BASE_URL="https://sourcelibrary.org"
3
4# Get all book IDs
5BOOK_IDS=$(curl -s "$BASE_URL/api/books" | jq -r '[.[].id]')
6
7# Queue OCR for all books (Lambda workers handle parallelism and rate limiting)
8curl -s -X POST "$BASE_URL/api/jobs/queue-books" \
9  -H "Content-Type: application/json" \
10  -d "{\"bookIds\": $BOOK_IDS, \"action\": \"ocr\"}"
11
12# After OCR completes, queue translation
13curl -s -X POST "$BASE_URL/api/jobs/queue-books" \
14  -H "Content-Type: application/json" \
15  -d "{\"bookIds\": $BOOK_IDS, \"action\": \"translation\"}"

Monitor progress at https://sourcelibrary.org/jobs

Monitoring Progress

Check overall library status:

bash
1curl -s "https://sourcelibrary.org/api/books" | jq '[.[] | {
2  title: .title[0:30],
3  pages: .pages_count,
4  ocr: .ocr_count,
5  translated: .translation_count
6}] | sort_by(-.pages)'

Troubleshooting

Empty Strings vs Null (CRITICAL)

In jq, empty strings "" are truthy! This means:

select(.ocr.data) matches pages with "" (WRONG)
select(.ocr.data | not) does NOT match pages with "" (WRONG)
Use select((.ocr.data // "") | length == 0) to find missing/empty OCR
Use select((.ocr.data // "") | length > 0) to find pages WITH OCR content

Rate Limits (429 errors)

Gemini API Tiers

Tier	RPM	How to Qualify
Free	15	Default
Tier 1	300	Enable billing + $50 spend
Tier 2	1000	$250 spend
Tier 3	2000	$1000 spend

Optimal Sleep Times by Tier

Tier	Max RPM	Safe Sleep Time	Effective Rate
Free	15	4.0s	~15/min
Tier 1	300	0.4s	~150/min
Tier 2	1000	0.12s	~500/min
Tier 3	2000	0.06s	~1000/min

Note: Use ~50% of max rate to leave headroom for bursts.

API Key Rotation

The system supports multiple API keys for higher throughput:

Set GEMINI_API_KEY (primary)
Set GEMINI_API_KEY_2, GEMINI_API_KEY_3, ... up to GEMINI_API_KEY_10
Keys rotate automatically with 60s cooldown after rate limit

With N keys at Tier 1, you get N × 300 RPM = N × 150 safe req/min

Function Timeouts

Jobs have maxDuration=300s for Vercel Pro
If hitting timeouts, reduce CROP_CHUNK_SIZE in job processing

Missing Cropped Photos

Check if crop job completed successfully
Verify page has crop data with xStart and xEnd
Re-run crop generation for specific pages

Bad OCR Detection

Look for these patterns in OCR text indicating wrong image was used:

"two-page spread"
"left page" / "right page" descriptions
Duplicate text blocks
References to facing pages

About this Skill

Features

# Core Topics

↓ Quality Score

Agent Capability Analysis

Ideal Agent Persona

Core Value

↓ Capabilities Granted for batch-translate MCP Server

! Prerequisites & Limits

# Tags