Claude Prompt Caching Guide: Cut API Costs by Up to 90%

As you build production applications with Claude, you will notice a pattern: many API requests send exactly the same content at the start. Your system prompt is the same. The company knowledge base you include is the same. The background documentation is the same. Only the user's specific question at the end differs.
Without prompt caching, you pay to process those thousands of identical tokens on every single API call. With prompt caching, you pay full price once to cache that content, and then subsequent requests that reuse the same cached prefix cost 90% less for those tokens. For document-heavy applications, agent systems, and high-volume customer service deployments, this difference is transformative.
What is Claude Prompt Caching?
Claude prompt caching lets you mark stable sections of your prompt — typically the system prompt and large reference documents — with cache_control: {"type": "ephemeral"}. On the first request, Claude caches those sections. On every subsequent request that includes the same content in the same position, Claude reads from the cache at 10% of the standard input token cost. Cache entries live for at least 5 minutes, resetting each time they are accessed.
How Prompt Caching Works
Prompt caching stores a snapshot of specified parts of your prompt on Anthropic's infrastructure after the first request that uses those sections. When a subsequent request begins with the same cached content, Claude retrieves the cached state rather than reprocessing the tokens from scratch.
The pricing model:
- Cache write: 25% more expensive than standard input tokens (the first request that creates the cache)
- Cache read: 90% cheaper than standard input tokens (every subsequent request that hits the cache)
- Cache lifetime: Cached content is retained for a minimum of 5 minutes. Every time the cache is accessed, the lifetime resets for another 5 minutes
For a 100,000-token system prompt that is reused across 1,000 requests, the maths is compelling:
- Without caching: 1,000 × 100,000 tokens = 100 million tokens billed at input rate
- With caching: 1 × 100,000 at cache write rate + 999 × 100,000 at 10% of input rate = approximately 10 million equivalent tokens
Marking Content for Caching
You enable caching on specific content blocks by adding a cache_control parameter with type: "ephemeral". This tells Claude to cache up to and including that content block.
Caching Strategies for Different Applications
Strategy 1 — Cache the System Prompt Only
Best for: Applications with a fixed, large system prompt and variable user messages.
Strategy 2 — Cache a Long Document
Best for: Document Q&A systems where many users ask questions about the same document.
Cache the Stable Parts, Not the Variable Parts
The key to effective caching is placing the cache_control marker at the boundary between stable and variable content. Everything before the marker gets cached. The user's specific question — which is different on every request — comes after the marker and is not cached. Only add cache_control to the parts of your prompt that are genuinely stable across multiple requests.
Strategy 3 — Cache Conversation History for Long Agents
Best for: Multi-turn agent conversations where the history grows long and the stable system prompt repeats.
Cache Control Rules and Limits
- Minimum cacheable size: Cached content must be at least 1,024 tokens for Sonnet and Haiku, and at least 2,048 tokens for Opus. Below this threshold, cache_control is accepted but caching does not occur
- Maximum cache breakpoints: You can have up to 4 cache_control markers in a single request, allowing you to cache different sections independently
- Cache prefix matching: Caching works by exact prefix matching. The cached content must appear identically at the same position in the request. Even a single character difference creates a cache miss
- Images can be cached: Base64-encoded images, PDFs, and other media included in the prompt can be marked with cache_control and will be cached along with surrounding text
Multiple Cache Breakpoints
For complex prompts with multiple stable sections followed by variable content:
Monitor Cache Hit Rate in Production
Use the cache_read_input_tokens field in the response usage object to monitor your cache hit rate. If you are sending requests you expect to hit the cache but cache_read_input_tokens remains zero, your prompt content is not matching exactly — check for subtle differences like trailing spaces, dynamic timestamps, or variable content you did not realise was in the cached section.
Combining Caching with the Batch API
For maximum cost reduction on non-time-sensitive workloads, combine prompt caching with the Batch API:
- Prompt caching: 90% reduction on repeated input tokens
- Batch API: 50% reduction on all token costs plus eliminating the per-request overhead
Running large document analysis workloads through Batch API requests with cached system prompts can reduce costs by over 90% compared to individual standard requests.
Practical Impact: Cost Calculation
For a customer support application handling 10,000 daily conversations, each including a 20,000-token knowledge base:
- Without caching: 10,000 × 20,000 = 200 million input tokens daily
- With caching (assuming 90% cache hit rate): 10,000 × 2,000 effective tokens = 20 million input tokens daily
- Result: 90% reduction in input token costs for the knowledge base portion
Summary
Prompt caching is one of the highest-leverage optimisations available in the Claude API. For any application that sends repeated content — system prompts, knowledge bases, documents, conversation history — implement caching before you worry about any other optimisation.
Implementation checklist:
- Add
cache_control: {"type": "ephemeral"}to stable system prompt sections - Ensure cached content exceeds the minimum token threshold (1,024 for Sonnet)
- Place cache markers at the boundary between static and dynamic content
- Monitor cache_read_input_tokens in responses to verify caching is working
- Consider combining with Batch API for non-time-sensitive workloads
With prompt caching and the full agents module covered, let us do a rapid-fire refresher before moving into projects: AI Agents Refresher: Key Concepts, Patterns, and Pitfalls.
Prompt caching pairs especially well with Claude RAG applications — cache the system prompt and retrieved document chunks that repeat across user queries. For the full cost picture including Batch API savings, review the Claude API pricing guide.
The Anthropic prompt caching documentation has the authoritative details on minimum token thresholds, maximum cache breakpoints, and the extended 1-hour cache option for less frequently accessed content.
This post is part of the Anthropic AI Tutorial Series. Previous post: Claude Model Context Protocol (MCP): Connect Claude to Any Tool.
External references:
