Cost-Aware LLM Pipeline
Patterns for controlling LLM API costs while maintaining quality. Combines model routing, budget tracking, retry logic, and prompt caching into a composable pipeline.
When to Activate
- Building applications that call LLM APIs (Claude, GPT, etc.)
- Processing batches of items with varying complexity
- Need to stay within a budget for API spend
- Optimizing cost without sacrificing quality on complex tasks
Core Concepts
1. Model Routing by Task Complexity
Automatically select cheaper models for simple tasks, reserving expensive models for complex ones.
python1MODEL_SONNET = "claude-sonnet-4-6" 2MODEL_HAIKU = "claude-haiku-4-5-20251001" 3 4_SONNET_TEXT_THRESHOLD = 10_000 # chars 5_SONNET_ITEM_THRESHOLD = 30 # items 6 7def select_model( 8 text_length: int, 9 item_count: int, 10 force_model: str | None = None, 11) -> str: 12 """Select model based on task complexity.""" 13 if force_model is not None: 14 return force_model 15 if text_length >= _SONNET_TEXT_THRESHOLD or item_count >= _SONNET_ITEM_THRESHOLD: 16 return MODEL_SONNET # Complex task 17 return MODEL_HAIKU # Simple task (3-4x cheaper)
2. Immutable Cost Tracking
Track cumulative spend with frozen dataclasses. Each API call returns a new tracker — never mutates state.
python1from dataclasses import dataclass 2 3@dataclass(frozen=True, slots=True) 4class CostRecord: 5 model: str 6 input_tokens: int 7 output_tokens: int 8 cost_usd: float 9 10@dataclass(frozen=True, slots=True) 11class CostTracker: 12 budget_limit: float = 1.00 13 records: tuple[CostRecord, ...] = () 14 15 def add(self, record: CostRecord) -> "CostTracker": 16 """Return new tracker with added record (never mutates self).""" 17 return CostTracker( 18 budget_limit=self.budget_limit, 19 records=(*self.records, record), 20 ) 21 22 @property 23 def total_cost(self) -> float: 24 return sum(r.cost_usd for r in self.records) 25 26 @property 27 def over_budget(self) -> bool: 28 return self.total_cost > self.budget_limit
3. Narrow Retry Logic
Retry only on transient errors. Fail fast on authentication or bad request errors.
python1from anthropic import ( 2 APIConnectionError, 3 InternalServerError, 4 RateLimitError, 5) 6 7_RETRYABLE_ERRORS = (APIConnectionError, RateLimitError, InternalServerError) 8_MAX_RETRIES = 3 9 10def call_with_retry(func, *, max_retries: int = _MAX_RETRIES): 11 """Retry only on transient errors, fail fast on others.""" 12 for attempt in range(max_retries): 13 try: 14 return func() 15 except _RETRYABLE_ERRORS: 16 if attempt == max_retries - 1: 17 raise 18 time.sleep(2 ** attempt) # Exponential backoff 19 # AuthenticationError, BadRequestError etc. → raise immediately
4. Prompt Caching
Cache long system prompts to avoid resending them on every request.
python1messages = [ 2 { 3 "role": "user", 4 "content": [ 5 { 6 "type": "text", 7 "text": system_prompt, 8 "cache_control": {"type": "ephemeral"}, # Cache this 9 }, 10 { 11 "type": "text", 12 "text": user_input, # Variable part 13 }, 14 ], 15 } 16]
Composition
Combine all four techniques in a single pipeline function:
python1def process(text: str, config: Config, tracker: CostTracker) -> tuple[Result, CostTracker]: 2 # 1. Route model 3 model = select_model(len(text), estimated_items, config.force_model) 4 5 # 2. Check budget 6 if tracker.over_budget: 7 raise BudgetExceededError(tracker.total_cost, tracker.budget_limit) 8 9 # 3. Call with retry + caching 10 response = call_with_retry(lambda: client.messages.create( 11 model=model, 12 messages=build_cached_messages(system_prompt, text), 13 )) 14 15 # 4. Track cost (immutable) 16 record = CostRecord(model=model, input_tokens=..., output_tokens=..., cost_usd=...) 17 tracker = tracker.add(record) 18 19 return parse_result(response), tracker
Pricing Reference (2025-2026)
| Model | Input ($/1M tokens) | Output ($/1M tokens) | Relative Cost |
|---|---|---|---|
| Haiku 4.5 | $0.80 | $4.00 | 1x |
| Sonnet 4.6 | $3.00 | $15.00 | ~4x |
| Opus 4.5 | $15.00 | $75.00 | ~19x |
Best Practices
- Start with the cheapest model and only route to expensive models when complexity thresholds are met
- Set explicit budget limits before processing batches — fail early rather than overspend
- Log model selection decisions so you can tune thresholds based on real data
- Use prompt caching for system prompts over 1024 tokens — saves both cost and latency
- Never retry on authentication or validation errors — only transient failures (network, rate limit, server error)
Anti-Patterns to Avoid
- Using the most expensive model for all requests regardless of complexity
- Retrying on all errors (wastes budget on permanent failures)
- Mutating cost tracking state (makes debugging and auditing difficult)
- Hardcoding model names throughout the codebase (use constants or config)
- Ignoring prompt caching for repetitive system prompts
When to Use
- Any application calling Claude, OpenAI, or similar LLM APIs
- Batch processing pipelines where cost adds up quickly
- Multi-model architectures that need intelligent routing
- Production systems that need budget guardrails