What is extracting-pdf-text?

Perfect for Language Model Agents needing to extract text from PDFs for advanced natural language processing tasks. extracting-pdf-text is a skill for extracting text from PDFs using tools like PyMuPDF, pdfplumber, and Mistral OCR for LLM consumption.

How do I install extracting-pdf-text?

Run the command: npx killer-skills add miwtoo/credit-card-extraction/extracting-pdf-text. It works with Cursor, Windsurf, VS Code, Claude Code, and 19+ other IDEs.

What are the use cases for extracting-pdf-text?

Key use cases include: Extracting text from simple text PDFs for language model training data, Parsing tables from PDFs using pdfplumber for data analysis, Performing OCR on scanned/image PDFs with pytesseract for text extraction.

Which IDEs are compatible with extracting-pdf-text?

This skill is compatible with Cursor, Windsurf, VS Code, Trae, Claude Code, OpenClaw, Aider, Codex, OpenCode, Goose, Cline, Roo Code, Kiro, Augment Code, Continue, GitHub Copilot, Sourcegraph Cody, and Amazon Q Developer. Use the Killer-Skills CLI for universal one-command installation.

Are there any limitations for extracting-pdf-text?

Requires Python environment. Limited to PDF formats. Dependent on library compatibility (PyMuPDF, pdfplumber, pytesseract).

Extracting PDF Text for LLMs

Name: extracting-pdf-text
Availability: InStock
Author: miwtoo

This skill provides tools and guidance for extracting text from PDFs in formats suitable for language model consumption.

Quick Decision Guide

PDF Type	Best Approach	Script
Simple text PDF	PyMuPDF	`scripts/extract_pymupdf.py`
PDF with tables	pdfplumber	`scripts/extract_pdfplumber.py`
Scanned/image PDF (local)	pytesseract	`scripts/extract_with_ocr.py`
Complex layout, highest accuracy	Mistral OCR API	`scripts/extract_mistral_ocr.py`
End-to-end RAG pipeline	marker-pdf	`pip install marker-pdf`

Recommended Workflow

Try PyMuPDF first - fastest, handles most text-based PDFs well
If tables are mangled - switch to pdfplumber
If scanned/image-based - use Mistral OCR API (best accuracy) or local OCR (free but slower)

Local Extraction (No API Required)

PyMuPDF - Fast General Extraction

Best for: Text-heavy PDFs, speed-critical workflows, basic structure preservation.

bash
1uv run scripts/extract_pymupdf.py input.pdf output.md

The script outputs markdown with preserved headings and paragraphs. For LLM-optimized output, it uses pymupdf4llm which formats text for RAG systems.

pdfplumber - Table Extraction

Best for: PDFs with tables, financial documents, structured data.

bash
1uv run scripts/extract_pdfplumber.py input.pdf output.md

Tables are converted to markdown format. Note: pdfplumber works best on machine-generated PDFs, not scanned documents.

Local OCR - Scanned Documents

Best for: Scanned PDFs when API access is unavailable.

bash
1uv run scripts/extract_with_ocr.py input.pdf output.txt

Requires: pytesseract, pdf2image, and Tesseract installed (brew install tesseract on macOS).

API-Based Extraction

Mistral OCR API

Best for: Complex layouts, scanned documents, highest accuracy, multilingual content, math formulas.

Pricing: ~1000 pages per dollar (very cost-effective)

bash
1export MISTRAL_API_KEY="your-key"
2uv run scripts/extract_mistral_ocr.py input.pdf output.md

Features:

Outputs clean markdown
Preserves document structure (headings, lists, tables)
Handles images, math equations, multilingual text
95%+ accuracy on complex documents

For detailed API options and other services, see references/api-services.md.

Output Format Recommendations

For LLM consumption, markdown is preferred:

Preserves semantic structure (headings become context boundaries)
Tables remain readable
Compatible with most RAG chunking strategies

For detailed comparisons of local tools, see references/local-tools.md.

extracting-pdf-text — pdf text extraction extracting-pdf-text, credit-card-extraction, miwtoo, community, pdf text extraction, ai agent skill, ide skills, agent automation, Mistral OCR integration, PyMuPDF scripting, pdfplumber table analysis, pytesseract OCR

# Core Topics

↓ Quality Score

Agent Capability Analysis

Ideal Agent Persona

Core Value

↓ Capabilities Granted for extracting-pdf-text

! Prerequisites & Limits

# Tags

Extracting PDF Text for LLMs

Quick Decision Guide

Recommended Workflow

Local Extraction (No API Required)

PyMuPDF - Fast General Extraction

pdfplumber - Table Extraction

Local OCR - Scanned Documents

API-Based Extraction

Mistral OCR API

Output Format Recommendations

FAQ & Installation Steps

? Frequently Asked Questions

What is extracting-pdf-text?

How do I install extracting-pdf-text?

What are the use cases for extracting-pdf-text?

Which IDEs are compatible with extracting-pdf-text?

Are there any limitations for extracting-pdf-text?

↓ How To Install

Related Skills

Looking for an alternative to extracting-pdf-text or another community skill for your workflow? Explore these related open-source skills.

widget-generator

linear

testing

zustand

extracting-pdf-text — pdf text extraction extracting-pdf-text, credit-card-extraction, miwtoo, community, pdf text extraction, ai agent skill, ide skills, agent automation, Mistral OCR integration, PyMuPDF scripting, pdfplumber table analysis, pytesseract OCR

About this Skill

Features

# Core Topics

↓ Quality Score

Agent Capability Analysis

Ideal Agent Persona

Core Value

↓ Capabilities Granted for extracting-pdf-text

! Prerequisites & Limits

# Tags

Extracting PDF Text for LLMs

Quick Decision Guide

Recommended Workflow

Local Extraction (No API Required)

PyMuPDF - Fast General Extraction

pdfplumber - Table Extraction

Local OCR - Scanned Documents

API-Based Extraction

Mistral OCR API

Output Format Recommendations

FAQ & Installation Steps

? Frequently Asked Questions

What is extracting-pdf-text?

How do I install extracting-pdf-text?

What are the use cases for extracting-pdf-text?

Which IDEs are compatible with extracting-pdf-text?

Are there any limitations for extracting-pdf-text?

↓ How To Install

Related Skills

Looking for an alternative to extracting-pdf-text or another community skill for your workflow? Explore these related open-source skills.

widget-generator

linear

testing

zustand