PDF Processing Guide
Overview
This guide covers essential PDF processing operations using Python libraries and command-line tools. For advanced features, JavaScript libraries, and detailed examples, see reference.md. If you need to fill out a PDF form, read forms.md and follow its instructions.
Quick Start
python1from pypdf import PdfReader, PdfWriter 2 3# Read a PDF 4reader = PdfReader("document.pdf") 5print(f"Pages: {len(reader.pages)}") 6 7# Extract text 8text = "" 9for page in reader.pages: 10 text += page.extract_text()
Python Libraries
pypdf - Basic Operations
Merge PDFs
python1from pypdf import PdfWriter, PdfReader 2 3writer = PdfWriter() 4for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]: 5 reader = PdfReader(pdf_file) 6 for page in reader.pages: 7 writer.add_page(page) 8 9with open("merged.pdf", "wb") as output: 10 writer.write(output)
Split PDF
python1reader = PdfReader("input.pdf") 2for i, page in enumerate(reader.pages): 3 writer = PdfWriter() 4 writer.add_page(page) 5 with open(f"page_{i+1}.pdf", "wb") as output: 6 writer.write(output)
Extract Metadata
python1reader = PdfReader("document.pdf") 2meta = reader.metadata 3print(f"Title: {meta.title}") 4print(f"Author: {meta.author}") 5print(f"Subject: {meta.subject}") 6print(f"Creator: {meta.creator}")
Rotate Pages
python1reader = PdfReader("input.pdf") 2writer = PdfWriter() 3 4page = reader.pages[0] 5page.rotate(90) # Rotate 90 degrees clockwise 6writer.add_page(page) 7 8with open("rotated.pdf", "wb") as output: 9 writer.write(output)
pdfplumber - Text and Table Extraction
Extract Text with Layout
python1import pdfplumber 2 3with pdfplumber.open("document.pdf") as pdf: 4 for page in pdf.pages: 5 text = page.extract_text() 6 print(text)
Extract Tables
python1with pdfplumber.open("document.pdf") as pdf: 2 for i, page in enumerate(pdf.pages): 3 tables = page.extract_tables() 4 for j, table in enumerate(tables): 5 print(f"Table {j+1} on page {i+1}:") 6 for row in table: 7 print(row)
Advanced Table Extraction
python1import pandas as pd 2 3with pdfplumber.open("document.pdf") as pdf: 4 all_tables = [] 5 for page in pdf.pages: 6 tables = page.extract_tables() 7 for table in tables: 8 if table: # Check if table is not empty 9 df = pd.DataFrame(table[1:], columns=table[0]) 10 all_tables.append(df) 11 12# Combine all tables 13if all_tables: 14 combined_df = pd.concat(all_tables, ignore_index=True) 15 combined_df.to_excel("extracted_tables.xlsx", index=False)
reportlab - Create PDFs
Basic PDF Creation
python1from reportlab.lib.pagesizes import letter 2from reportlab.pdfgen import canvas 3 4c = canvas.Canvas("hello.pdf", pagesize=letter) 5width, height = letter 6 7# Add text 8c.drawString(100, height - 100, "Hello World!") 9c.drawString(100, height - 120, "This is a PDF created with reportlab") 10 11# Add a line 12c.line(100, height - 140, 400, height - 140) 13 14# Save 15c.save()
Create PDF with Multiple Pages
python1from reportlab.lib.pagesizes import letter 2from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak 3from reportlab.lib.styles import getSampleStyleSheet 4 5doc = SimpleDocTemplate("report.pdf", pagesize=letter) 6styles = getSampleStyleSheet() 7story = [] 8 9# Add content 10title = Paragraph("Report Title", styles['Title']) 11story.append(title) 12story.append(Spacer(1, 12)) 13 14body = Paragraph("This is the body of the report. " * 20, styles['Normal']) 15story.append(body) 16story.append(PageBreak()) 17 18# Page 2 19story.append(Paragraph("Page 2", styles['Heading1'])) 20story.append(Paragraph("Content for page 2", styles['Normal'])) 21 22# Build PDF 23doc.build(story)
Command-Line Tools
pdftotext (poppler-utils)
bash1# Extract text 2pdftotext input.pdf output.txt 3 4# Extract text preserving layout 5pdftotext -layout input.pdf output.txt 6 7# Extract specific pages 8pdftotext -f 1 -l 5 input.pdf output.txt # Pages 1-5
qpdf
bash1# Merge PDFs 2qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf 3 4# Split pages 5qpdf input.pdf --pages . 1-5 -- pages1-5.pdf 6qpdf input.pdf --pages . 6-10 -- pages6-10.pdf 7 8# Rotate pages 9qpdf input.pdf output.pdf --rotate=+90:1 # Rotate page 1 by 90 degrees 10 11# Remove password 12qpdf --password=mypassword --decrypt encrypted.pdf decrypted.pdf
pdftk (if available)
bash1# Merge 2pdftk file1.pdf file2.pdf cat output merged.pdf 3 4# Split 5pdftk input.pdf burst 6 7# Rotate 8pdftk input.pdf rotate 1east output rotated.pdf
Common Tasks
Extract Text from Scanned PDFs
python1# Requires: pip install pytesseract pdf2image 2import pytesseract 3from pdf2image import convert_from_path 4 5# Convert PDF to images 6images = convert_from_path('scanned.pdf') 7 8# OCR each page 9text = "" 10for i, image in enumerate(images): 11 text += f"Page {i+1}:\n" 12 text += pytesseract.image_to_string(image) 13 text += "\n\n" 14 15print(text)
Add Watermark
python1from pypdf import PdfReader, PdfWriter 2 3# Create watermark (or load existing) 4watermark = PdfReader("watermark.pdf").pages[0] 5 6# Apply to all pages 7reader = PdfReader("document.pdf") 8writer = PdfWriter() 9 10for page in reader.pages: 11 page.merge_page(watermark) 12 writer.add_page(page) 13 14with open("watermarked.pdf", "wb") as output: 15 writer.write(output)
Extract Images
bash1# Using pdfimages (poppler-utils) 2pdfimages -j input.pdf output_prefix 3 4# This extracts all images as output_prefix-000.jpg, output_prefix-001.jpg, etc.
Password Protection
python1from pypdf import PdfReader, PdfWriter 2 3reader = PdfReader("input.pdf") 4writer = PdfWriter() 5 6for page in reader.pages: 7 writer.add_page(page) 8 9# Add password 10writer.encrypt("userpassword", "ownerpassword") 11 12with open("encrypted.pdf", "wb") as output: 13 writer.write(output)
Quick Reference
| Task | Best Tool | Command/Code |
|---|---|---|
| Merge PDFs | pypdf | writer.add_page(page) |
| Split PDFs | pypdf | One page per file |
| Extract text | pdfplumber | page.extract_text() |
| Extract tables | pdfplumber | page.extract_tables() |
| Create PDFs | reportlab | Canvas or Platypus |
| Command line merge | qpdf | qpdf --empty --pages ... |
| OCR scanned PDFs | pytesseract | Convert to image first |
| Fill PDF forms | pdf-lib or pypdf (see forms.md) | See forms.md |
Next Steps
- For advanced pypdfium2 usage, see reference.md
- For JavaScript libraries (pdf-lib), see reference.md
- If you need to fill out a PDF form, follow the instructions in forms.md
- PDF Processing v1.1 - Enhanced
🔄 Workflow
Kaynak: pypdf Documentation & pdfplumber Guide
Aşama 1: Analysis & Selection
- Library Choice: Metin çıkarma için
pdfplumber, manipülasyon (merge/split) içinpypdf, oluşturma içinreportlabseç. - Encryption Check: Dosya şifreli mi?
reader.is_encryptedkontrolü yap. - Layout Inspection: PDF "text-based" mi yoksa "image-based" (scanned) mi? OCR gerekebilir mi?
Aşama 2: Processing
- Extraction: Metni sayfa sayfa (lazy loading) işle, belleği şişirme.
- Table Handling: Tablo çıkarımı için
extract_tables()kullan ve pandas DataFrame'e dönüştür. - Transformation: Sayfa döndürme veya birleştirme işlemlerini
PdfWriterobjesi üzerinde bellekte yap.
Aşama 3: Validation & Output
- Integrity: Oluşturulan PDF bozuk mu? (Basit bir
PdfReader(output)testi yap). - Metadata: Yazar, Konu gibi metadata bilgilerini temizle veya güncelle.
- Access: Hassas veriler içeren PDF'leri şifrele (
encrypt).
Kontrol Noktaları
| Aşama | Doğrulama |
|---|---|
| 1 | OCR gerekiyorsa tesseract yüklü mü? |
| 2 | Tablo çıkarımında sütun kaymaları (misalignment) var mı? |
| 3 | Büyük dosyalarda (100MB+) memory leak oluşuyor mu? |