KS
Killer-Skills

content-hash-cache-pattern — content-hash-cache-pattern setup guide content-hash-cache-pattern setup guide, how to use content-hash-cache-pattern, what is content-hash-cache-pattern, content-hash-cache-pattern vs path-based caching, content-hash-cache-pattern install, SHA-256 cache key implementation, MCP file processing cache, auto-invalidating file cache, PDF parsing cache pattern, text extraction caching with content hash

Verified
v1.0.0
GitHub

About this Skill

Ideal for Data Processing Agents handling expensive file operations like PDF parsing and text extraction. content-hash-cache-pattern is an MCP skill that caches expensive file processing results using SHA-256 content hashes as cache keys. This provides a path-independent cache that survives file moves/renames and automatically invalidates when file content changes.

Features

Uses SHA-256 content hashes as cache keys for deterministic lookups
Provides path-independent caching that survives file moves and renames
Auto-invalidates cache entries when source file content changes
Designed for caching expensive PDF parsing and text extraction results
Enables implementation of --cache/--no-cache CLI options
Separates caching logic into a service layer for cleaner architecture

# Core Topics

affaan-m affaan-m
[62.0k]
[7678]
Updated: 3/6/2026

Quality Score

Top 5%
86
Excellent
Based on code quality & docs
Installation
SYS Universal Install (Auto-Detect)
Cursor IDE Windsurf IDE VS Code IDE
> npx killer-skills add affaan-m/everything-claude-code/content-hash-cache-pattern

Agent Capability Analysis

The content-hash-cache-pattern MCP Server by affaan-m is an open-source Categories.official integration for Claude and other AI agents, enabling seamless task automation and capability expansion. Optimized for content-hash-cache-pattern setup guide, how to use content-hash-cache-pattern, what is content-hash-cache-pattern.

Ideal Agent Persona

Ideal for Data Processing Agents handling expensive file operations like PDF parsing and text extraction.

Core Value

Enables agents to implement a robust caching system using SHA-256 content hashes instead of file paths, ensuring cache persistence through file moves/renames and automatic invalidation when content changes. This pattern separates caching logic from service layers and is particularly valuable for building efficient file processing pipelines with optional --cache/--no-cache CLI control.

Capabilities Granted for content-hash-cache-pattern MCP Server

Optimizing PDF parsing pipelines
Accelerating text extraction workflows
Building content-based cache systems
Implementing CLI tools with cache control

! Prerequisites & Limits

  • Requires SHA-256 hash computation overhead
  • Depends on filesystem access for content reading
  • Needs cache storage management implementation
Project
SKILL.md
5.3 KB
.cursorrules
1.2 KB
package.json
240 B
Ready
UTF-8
SKILL.md
Readonly

Content-Hash File Cache Pattern

Cache expensive file processing results (PDF parsing, text extraction, image analysis) using SHA-256 content hashes as cache keys. Unlike path-based caching, this approach survives file moves/renames and auto-invalidates when content changes.

When to Activate

  • Building file processing pipelines (PDF, images, text extraction)
  • Processing cost is high and same files are processed repeatedly
  • Need a --cache/--no-cache CLI option
  • Want to add caching to existing pure functions without modifying them

Core Pattern

1. Content-Hash Based Cache Key

Use file content (not path) as the cache key:

python
1import hashlib 2from pathlib import Path 3 4_HASH_CHUNK_SIZE = 65536 # 64KB chunks for large files 5 6def compute_file_hash(path: Path) -> str: 7 """SHA-256 of file contents (chunked for large files).""" 8 if not path.is_file(): 9 raise FileNotFoundError(f"File not found: {path}") 10 sha256 = hashlib.sha256() 11 with open(path, "rb") as f: 12 while True: 13 chunk = f.read(_HASH_CHUNK_SIZE) 14 if not chunk: 15 break 16 sha256.update(chunk) 17 return sha256.hexdigest()

Why content hash? File rename/move = cache hit. Content change = automatic invalidation. No index file needed.

2. Frozen Dataclass for Cache Entry

python
1from dataclasses import dataclass 2 3@dataclass(frozen=True, slots=True) 4class CacheEntry: 5 file_hash: str 6 source_path: str 7 document: ExtractedDocument # The cached result

3. File-Based Cache Storage

Each cache entry is stored as {hash}.json — O(1) lookup by hash, no index file required.

python
1import json 2from typing import Any 3 4def write_cache(cache_dir: Path, entry: CacheEntry) -> None: 5 cache_dir.mkdir(parents=True, exist_ok=True) 6 cache_file = cache_dir / f"{entry.file_hash}.json" 7 data = serialize_entry(entry) 8 cache_file.write_text(json.dumps(data, ensure_ascii=False), encoding="utf-8") 9 10def read_cache(cache_dir: Path, file_hash: str) -> CacheEntry | None: 11 cache_file = cache_dir / f"{file_hash}.json" 12 if not cache_file.is_file(): 13 return None 14 try: 15 raw = cache_file.read_text(encoding="utf-8") 16 data = json.loads(raw) 17 return deserialize_entry(data) 18 except (json.JSONDecodeError, ValueError, KeyError): 19 return None # Treat corruption as cache miss

4. Service Layer Wrapper (SRP)

Keep the processing function pure. Add caching as a separate service layer.

python
1def extract_with_cache( 2 file_path: Path, 3 *, 4 cache_enabled: bool = True, 5 cache_dir: Path = Path(".cache"), 6) -> ExtractedDocument: 7 """Service layer: cache check -> extraction -> cache write.""" 8 if not cache_enabled: 9 return extract_text(file_path) # Pure function, no cache knowledge 10 11 file_hash = compute_file_hash(file_path) 12 13 # Check cache 14 cached = read_cache(cache_dir, file_hash) 15 if cached is not None: 16 logger.info("Cache hit: %s (hash=%s)", file_path.name, file_hash[:12]) 17 return cached.document 18 19 # Cache miss -> extract -> store 20 logger.info("Cache miss: %s (hash=%s)", file_path.name, file_hash[:12]) 21 doc = extract_text(file_path) 22 entry = CacheEntry(file_hash=file_hash, source_path=str(file_path), document=doc) 23 write_cache(cache_dir, entry) 24 return doc

Key Design Decisions

DecisionRationale
SHA-256 content hashPath-independent, auto-invalidates on content change
{hash}.json file namingO(1) lookup, no index file needed
Service layer wrapperSRP: extraction stays pure, cache is a separate concern
Manual JSON serializationFull control over frozen dataclass serialization
Corruption returns NoneGraceful degradation, re-processes on next run
cache_dir.mkdir(parents=True)Lazy directory creation on first write

Best Practices

  • Hash content, not paths — paths change, content identity doesn't
  • Chunk large files when hashing — avoid loading entire files into memory
  • Keep processing functions pure — they should know nothing about caching
  • Log cache hit/miss with truncated hashes for debugging
  • Handle corruption gracefully — treat invalid cache entries as misses, never crash

Anti-Patterns to Avoid

python
1# BAD: Path-based caching (breaks on file move/rename) 2cache = {"/path/to/file.pdf": result} 3 4# BAD: Adding cache logic inside the processing function (SRP violation) 5def extract_text(path, *, cache_enabled=False, cache_dir=None): 6 if cache_enabled: # Now this function has two responsibilities 7 ... 8 9# BAD: Using dataclasses.asdict() with nested frozen dataclasses 10# (can cause issues with complex nested types) 11data = dataclasses.asdict(entry) # Use manual serialization instead

When to Use

  • File processing pipelines (PDF parsing, OCR, text extraction, image analysis)
  • CLI tools that benefit from --cache/--no-cache options
  • Batch processing where the same files appear across runs
  • Adding caching to existing pure functions without modifying them

When NOT to Use

  • Data that must always be fresh (real-time feeds)
  • Cache entries that would be extremely large (consider streaming instead)
  • Results that depend on parameters beyond file content (e.g., different extraction configs)

Related Skills

Looking for an alternative to content-hash-cache-pattern or building a Categories.official AI Agent? Explore these related open-source MCP Servers.

View All

flags

Logo of facebook
facebook

flags is a feature flag management system that enables developers to check flag states, compare channels, and debug feature behavior differences across release channels.

243.6k
0
Design

extract-errors

Logo of facebook
facebook

extract-errors is a skill that assists in extracting and managing error codes in React applications using yarn extract-errors command.

243.6k
0
Design

fix

Logo of facebook
facebook

fix is a technical skill that resolves lint errors, formatting issues, and ensures code quality in declarative, frontend, and UI projects

243.6k
0
Design

flow

Logo of facebook
facebook

Flow is a type checking system for JavaScript, used to validate React code and ensure consistency across applications

243.6k
0
Design