extracting-pdf-text — pdf text extraction extracting-pdf-text, credit-card-extraction, miwtoo, community, pdf text extraction, ai agent skill, ide skills, agent automation, Mistral OCR integration, PyMuPDF scripting, pdfplumber table analysis, pytesseract OCR

v1.0.0
GitHub

About this Skill

Perfect for Language Model Agents needing to extract text from PDFs for advanced natural language processing tasks. extracting-pdf-text is a skill for extracting text from PDFs using tools like PyMuPDF, pdfplumber, and Mistral OCR for LLM consumption.

Features

Supports PyMuPDF for simple text PDFs
Utilizes pdfplumber for PDFs with tables
Employs pytesseract for scanned document OCR
Integrates with Mistral OCR for complex layouts
Provides scripts for extraction using extract_pymupdf.py and extract_pdfplumber.py

# Core Topics

miwtoo miwtoo
[0]
[0]
Updated: 3/12/2026

Quality Score

Top 5%
60
Excellent
Based on code quality & docs
Installation
SYS Universal Install (Auto-Detect)
> npx killer-skills add miwtoo/credit-card-extraction/extracting-pdf-text
Supports 19+ Platforms
Cursor
Windsurf
VS Code
Trae
Claude
OpenClaw
+12 more

Agent Capability Analysis

The extracting-pdf-text skill by miwtoo is an open-source community AI agent skill for Claude Code and other IDE workflows, helping agents execute tasks with better context, repeatability, and domain-specific guidance. Optimized for pdf text extraction, Mistral OCR integration, PyMuPDF scripting.

Ideal Agent Persona

Perfect for Language Model Agents needing to extract text from PDFs for advanced natural language processing tasks.

Core Value

Empowers agents to extract text from PDFs using libraries like PyMuPDF, pdfplumber, and pytesseract, supporting various formats and tools for language model consumption, including simple text PDFs, PDFs with tables, and scanned/image PDFs.

Capabilities Granted for extracting-pdf-text

Extracting text from simple text PDFs for language model training data
Parsing tables from PDFs using pdfplumber for data analysis
Performing OCR on scanned/image PDFs with pytesseract for text extraction

! Prerequisites & Limits

  • Requires Python environment
  • Limited to PDF formats
  • Dependent on library compatibility (PyMuPDF, pdfplumber, pytesseract)
Project
SKILL.md
2.7 KB
.cursorrules
1.2 KB
package.json
240 B
Ready
UTF-8

# Tags

[No tags]
SKILL.md
Readonly

Extracting PDF Text for LLMs

This skill provides tools and guidance for extracting text from PDFs in formats suitable for language model consumption.

Quick Decision Guide

PDF TypeBest ApproachScript
Simple text PDFPyMuPDFscripts/extract_pymupdf.py
PDF with tablespdfplumberscripts/extract_pdfplumber.py
Scanned/image PDF (local)pytesseractscripts/extract_with_ocr.py
Complex layout, highest accuracyMistral OCR APIscripts/extract_mistral_ocr.py
End-to-end RAG pipelinemarker-pdfpip install marker-pdf

Recommended Workflow

  1. Try PyMuPDF first - fastest, handles most text-based PDFs well
  2. If tables are mangled - switch to pdfplumber
  3. If scanned/image-based - use Mistral OCR API (best accuracy) or local OCR (free but slower)

Local Extraction (No API Required)

PyMuPDF - Fast General Extraction

Best for: Text-heavy PDFs, speed-critical workflows, basic structure preservation.

bash
1uv run scripts/extract_pymupdf.py input.pdf output.md

The script outputs markdown with preserved headings and paragraphs. For LLM-optimized output, it uses pymupdf4llm which formats text for RAG systems.

pdfplumber - Table Extraction

Best for: PDFs with tables, financial documents, structured data.

bash
1uv run scripts/extract_pdfplumber.py input.pdf output.md

Tables are converted to markdown format. Note: pdfplumber works best on machine-generated PDFs, not scanned documents.

Local OCR - Scanned Documents

Best for: Scanned PDFs when API access is unavailable.

bash
1uv run scripts/extract_with_ocr.py input.pdf output.txt

Requires: pytesseract, pdf2image, and Tesseract installed (brew install tesseract on macOS).

API-Based Extraction

Mistral OCR API

Best for: Complex layouts, scanned documents, highest accuracy, multilingual content, math formulas.

Pricing: ~1000 pages per dollar (very cost-effective)

bash
1export MISTRAL_API_KEY="your-key" 2uv run scripts/extract_mistral_ocr.py input.pdf output.md

Features:

  • Outputs clean markdown
  • Preserves document structure (headings, lists, tables)
  • Handles images, math equations, multilingual text
  • 95%+ accuracy on complex documents

For detailed API options and other services, see references/api-services.md.

Output Format Recommendations

For LLM consumption, markdown is preferred:

  • Preserves semantic structure (headings become context boundaries)
  • Tables remain readable
  • Compatible with most RAG chunking strategies

For detailed comparisons of local tools, see references/local-tools.md.

FAQ & Installation Steps

These questions and steps mirror the structured data on this page for better search understanding.

? Frequently Asked Questions

What is extracting-pdf-text?

Perfect for Language Model Agents needing to extract text from PDFs for advanced natural language processing tasks. extracting-pdf-text is a skill for extracting text from PDFs using tools like PyMuPDF, pdfplumber, and Mistral OCR for LLM consumption.

How do I install extracting-pdf-text?

Run the command: npx killer-skills add miwtoo/credit-card-extraction/extracting-pdf-text. It works with Cursor, Windsurf, VS Code, Claude Code, and 19+ other IDEs.

What are the use cases for extracting-pdf-text?

Key use cases include: Extracting text from simple text PDFs for language model training data, Parsing tables from PDFs using pdfplumber for data analysis, Performing OCR on scanned/image PDFs with pytesseract for text extraction.

Which IDEs are compatible with extracting-pdf-text?

This skill is compatible with Cursor, Windsurf, VS Code, Trae, Claude Code, OpenClaw, Aider, Codex, OpenCode, Goose, Cline, Roo Code, Kiro, Augment Code, Continue, GitHub Copilot, Sourcegraph Cody, and Amazon Q Developer. Use the Killer-Skills CLI for universal one-command installation.

Are there any limitations for extracting-pdf-text?

Requires Python environment. Limited to PDF formats. Dependent on library compatibility (PyMuPDF, pdfplumber, pytesseract).

How To Install

  1. 1. Open your terminal

    Open the terminal or command line in your project directory.

  2. 2. Run the install command

    Run: npx killer-skills add miwtoo/credit-card-extraction/extracting-pdf-text. The CLI will automatically detect your IDE or AI agent and configure the skill.

  3. 3. Start using the skill

    The skill is now active. Your AI agent can use extracting-pdf-text immediately in the current project.

Related Skills

Looking for an alternative to extracting-pdf-text or another community skill for your workflow? Explore these related open-source skills.

View All

widget-generator

Logo of f
f

Generate customizable widget plugins for the prompts.chat feed system

149.6k
0
Design

linear

Logo of lobehub
lobehub

Linear issue management. MUST USE when: (1) user mentions LOBE-xxx issue IDs (e.g. LOBE-4540), (2) user says linear, linear issue, link linear, (3) creating PRs that reference Linear issues. Provides

73.4k
0
Communication

testing

Logo of lobehub
lobehub

Testing guide using Vitest. Use when writing tests (.test.ts, .test.tsx), fixing failing tests, improving test coverage, or debugging test issues. Triggers on test creation, test debugging, mock setup

73.3k
0
Communication

zustand

Logo of lobehub
lobehub

Zustand state management guide. Use when working with store code (src/store/**), implementing actions, managing state, or creating slices. Triggers on Zustand store development, state management questions, or action implementation.

72.8k
0
Communication