What is content-hash-cache-pattern?

Ideal for Data Processing Agents handling expensive file operations like PDF parsing and text extraction. content-hash-cache-pattern is an MCP skill that caches expensive file processing results using SHA-256 content hashes as cache keys. This provides a path-independent cache that survives file moves/renames and automatically invalidates when file content changes.

How do I install content-hash-cache-pattern?

Run the command: npx killer-skills add affaan-m/everything-claude-code/content-hash-cache-pattern. It works with Cursor, Windsurf, VS Code, Claude Code, and 15+ other IDEs.

What are the use cases for content-hash-cache-pattern?

Key use cases include: Optimizing PDF parsing pipelines, Accelerating text extraction workflows, Building content-based cache systems, Implementing CLI tools with cache control.

Which IDEs are compatible with content-hash-cache-pattern?

This skill is compatible with Cursor, Windsurf, VS Code, Claude Code, GitHub Copilot, JetBrains, Cline, Roo Code, and many more. Use the Killer-Skills CLI for universal one-command installation.

Are there any limitations for content-hash-cache-pattern?

Requires SHA-256 hash computation overhead. Depends on filesystem access for content reading. Needs cache storage management implementation.

Content-Hash File Cache Pattern

Name: content-hash-cache-pattern
Availability: InStock
Rating: 4.3 (61988 reviews)
Author: affaan-m

Cache expensive file processing results (PDF parsing, text extraction, image analysis) using SHA-256 content hashes as cache keys. Unlike path-based caching, this approach survives file moves/renames and auto-invalidates when content changes.

When to Activate

Building file processing pipelines (PDF, images, text extraction)
Processing cost is high and same files are processed repeatedly
Need a --cache/--no-cache CLI option
Want to add caching to existing pure functions without modifying them

Core Pattern

1. Content-Hash Based Cache Key

Use file content (not path) as the cache key:

python
1import hashlib
2from pathlib import Path
3
4_HASH_CHUNK_SIZE = 65536  # 64KB chunks for large files
5
6def compute_file_hash(path: Path) -> str:
7    """SHA-256 of file contents (chunked for large files)."""
8    if not path.is_file():
9        raise FileNotFoundError(f"File not found: {path}")
10    sha256 = hashlib.sha256()
11    with open(path, "rb") as f:
12        while True:
13            chunk = f.read(_HASH_CHUNK_SIZE)
14            if not chunk:
15                break
16            sha256.update(chunk)
17    return sha256.hexdigest()

Why content hash? File rename/move = cache hit. Content change = automatic invalidation. No index file needed.

2. Frozen Dataclass for Cache Entry

python
1from dataclasses import dataclass
2
3@dataclass(frozen=True, slots=True)
4class CacheEntry:
5    file_hash: str
6    source_path: str
7    document: ExtractedDocument  # The cached result

3. File-Based Cache Storage

Each cache entry is stored as {hash}.json — O(1) lookup by hash, no index file required.

python
1import json
2from typing import Any
3
4def write_cache(cache_dir: Path, entry: CacheEntry) -> None:
5    cache_dir.mkdir(parents=True, exist_ok=True)
6    cache_file = cache_dir / f"{entry.file_hash}.json"
7    data = serialize_entry(entry)
8    cache_file.write_text(json.dumps(data, ensure_ascii=False), encoding="utf-8")
9
10def read_cache(cache_dir: Path, file_hash: str) -> CacheEntry | None:
11    cache_file = cache_dir / f"{file_hash}.json"
12    if not cache_file.is_file():
13        return None
14    try:
15        raw = cache_file.read_text(encoding="utf-8")
16        data = json.loads(raw)
17        return deserialize_entry(data)
18    except (json.JSONDecodeError, ValueError, KeyError):
19        return None  # Treat corruption as cache miss

4. Service Layer Wrapper (SRP)

Keep the processing function pure. Add caching as a separate service layer.

python
1def extract_with_cache(
2    file_path: Path,
3    *,
4    cache_enabled: bool = True,
5    cache_dir: Path = Path(".cache"),
6) -> ExtractedDocument:
7    """Service layer: cache check -> extraction -> cache write."""
8    if not cache_enabled:
9        return extract_text(file_path)  # Pure function, no cache knowledge
10
11    file_hash = compute_file_hash(file_path)
12
13    # Check cache
14    cached = read_cache(cache_dir, file_hash)
15    if cached is not None:
16        logger.info("Cache hit: %s (hash=%s)", file_path.name, file_hash[:12])
17        return cached.document
18
19    # Cache miss -> extract -> store
20    logger.info("Cache miss: %s (hash=%s)", file_path.name, file_hash[:12])
21    doc = extract_text(file_path)
22    entry = CacheEntry(file_hash=file_hash, source_path=str(file_path), document=doc)
23    write_cache(cache_dir, entry)
24    return doc

Key Design Decisions

Decision	Rationale
SHA-256 content hash	Path-independent, auto-invalidates on content change
`{hash}.json` file naming	O(1) lookup, no index file needed
Service layer wrapper	SRP: extraction stays pure, cache is a separate concern
Manual JSON serialization	Full control over frozen dataclass serialization
Corruption returns `None`	Graceful degradation, re-processes on next run
`cache_dir.mkdir(parents=True)`	Lazy directory creation on first write

Best Practices

Hash content, not paths — paths change, content identity doesn't
Chunk large files when hashing — avoid loading entire files into memory
Keep processing functions pure — they should know nothing about caching
Log cache hit/miss with truncated hashes for debugging
Handle corruption gracefully — treat invalid cache entries as misses, never crash

Anti-Patterns to Avoid

python
1# BAD: Path-based caching (breaks on file move/rename)
2cache = {"/path/to/file.pdf": result}
3
4# BAD: Adding cache logic inside the processing function (SRP violation)
5def extract_text(path, *, cache_enabled=False, cache_dir=None):
6    if cache_enabled:  # Now this function has two responsibilities
7        ...
8
9# BAD: Using dataclasses.asdict() with nested frozen dataclasses
10# (can cause issues with complex nested types)
11data = dataclasses.asdict(entry)  # Use manual serialization instead

When to Use

File processing pipelines (PDF parsing, OCR, text extraction, image analysis)
CLI tools that benefit from --cache/--no-cache options
Batch processing where the same files appear across runs
Adding caching to existing pure functions without modifying them

When NOT to Use

Data that must always be fresh (real-time feeds)
Cache entries that would be extremely large (consider streaming instead)
Results that depend on parameters beyond file content (e.g., different extraction configs)

# Core Topics

↓ Quality Score

Agent Capability Analysis

Ideal Agent Persona

Core Value

↓ Capabilities Granted for content-hash-cache-pattern MCP Server

! Prerequisites & Limits

# Tags

Content-Hash File Cache Pattern

When to Activate

Core Pattern

1. Content-Hash Based Cache Key

2. Frozen Dataclass for Cache Entry

3. File-Based Cache Storage

4. Service Layer Wrapper (SRP)

Key Design Decisions

Best Practices

Anti-Patterns to Avoid

When to Use

When NOT to Use

Related Skills

Looking for an alternative to content-hash-cache-pattern or building a Categories.official AI Agent? Explore these related open-source MCP Servers.

flags

extract-errors

fix

flow

Related Collections

About this Skill

Features

# Core Topics

↓ Quality Score

Agent Capability Analysis

Ideal Agent Persona

Core Value

↓ Capabilities Granted for content-hash-cache-pattern MCP Server

! Prerequisites & Limits

# Tags

Content-Hash File Cache Pattern

When to Activate

Core Pattern

1. Content-Hash Based Cache Key

2. Frozen Dataclass for Cache Entry

3. File-Based Cache Storage

4. Service Layer Wrapper (SRP)

Key Design Decisions

Best Practices

Anti-Patterns to Avoid

When to Use

When NOT to Use

Related Skills

Looking for an alternative to content-hash-cache-pattern or building a Categories.official AI Agent? Explore these related open-source MCP Servers.

flags

extract-errors

fix

flow

Related Collections