Prompt Caching with Claude: Cut API Costs by Up to 90%

As you build production applications with Claude, you will notice a pattern: many API requests send exactly the same content at the start. Your system prompt is the same. The company knowledge base you include is the same. The background documentation is the same. Only the user's specific question at the end differs.
Without prompt caching, you pay to process those thousands of identical tokens on every single API call. With prompt caching, you pay full price once to cache that content, and then subsequent requests that reuse the same cached prefix cost 90% less for those tokens. For document-heavy applications, agent systems, and high-volume customer service deployments, this difference is transformative.
How Prompt Caching Works
Prompt caching stores a snapshot of specified parts of your prompt on Anthropic's infrastructure after the first request that uses those sections. When a subsequent request begins with the same cached content, Claude retrieves the cached state rather than reprocessing the tokens from scratch.
The pricing model:
- Cache write: 25% more expensive than standard input tokens (the first request that creates the cache)
- Cache read: 90% cheaper than standard input tokens (every subsequent request that hits the cache)
- Cache lifetime: Cached content is retained for a minimum of 5 minutes. Every time the cache is accessed, the lifetime resets for another 5 minutes
For a 100,000-token system prompt that is reused across 1,000 requests, the maths is compelling:
- Without caching: 1,000 × 100,000 tokens = 100 million tokens billed at input rate
- With caching: 1 × 100,000 at cache write rate + 999 × 100,000 at 10% of input rate = approximately 10 million equivalent tokens
Marking Content for Caching
You enable caching on specific content blocks by adding a cache_control parameter with type: "ephemeral". This tells Claude to cache up to and including that content block.
1import anthropic
2
3client = anthropic.Anthropic()
4
5# Example: Caching a large system prompt
6response = client.messages.create(
7 model="claude-sonnet-4-6",
8 max_tokens=1024,
9 system=[
10 {
11 "type": "text",
12 "text": """You are an expert IT support agent for Meridian Systems.
13
14--- COMPANY KNOWLEDGE BASE ---
15[INSERT YOUR FULL 50,000 WORD KNOWLEDGE BASE HERE]
16--- END KNOWLEDGE BASE ---
17
18Your role is to help customers resolve technical issues quickly and accurately.
19Always cite the relevant knowledge base section in your responses.""",
20 "cache_control": {"type": "ephemeral"} # Cache this entire system prompt
21 }
22 ],
23 messages=[
24 {
25 "role": "user",
26 "content": "My laptop can't connect to the company VPN. What should I try?"
27 }
28 ]
29)
30
31# Check cache usage in the response
32usage = response.usage
33print(f"Input tokens: {usage.input_tokens}")
34print(f"Cache write tokens: {usage.cache_creation_input_tokens}")
35print(f"Cache read tokens: {usage.cache_read_input_tokens}")Caching Strategies for Different Applications
Strategy 1 — Cache the System Prompt Only
Best for: Applications with a fixed, large system prompt and variable user messages.
1system_with_cache = [
2 {
3 "type": "text",
4 "text": large_system_prompt,
5 "cache_control": {"type": "ephemeral"}
6 }
7]
8
9response = client.messages.create(
10 model="claude-sonnet-4-6",
11 max_tokens=1024,
12 system=system_with_cache,
13 messages=[{"role": "user", "content": user_question}]
14)Strategy 2 — Cache a Long Document
Best for: Document Q&A systems where many users ask questions about the same document.
1response = client.messages.create(
2 model="claude-sonnet-4-6",
3 max_tokens=1024,
4 messages=[
5 {
6 "role": "user",
7 "content": [
8 {
9 "type": "text",
10 "text": f"Here is the full technical specification document:\n\n{full_document_text}",
11 "cache_control": {"type": "ephemeral"} # Cache the document
12 },
13 {
14 "type": "text",
15 "text": "What are the network latency requirements specified in section 4?"
16 # This question is not cached — it varies per user
17 }
18 ]
19 }
20 ]
21)Cache the Stable Parts, Not the Variable Parts
The key to effective caching is placing the cache_control marker at the boundary between stable and variable content. Everything before the marker gets cached. The user's specific question — which is different on every request — comes after the marker and is not cached. Only add cache_control to the parts of your prompt that are genuinely stable across multiple requests.
Strategy 3 — Cache Conversation History for Long Agents
Best for: Multi-turn agent conversations where the history grows long and the stable system prompt repeats.
1def run_cached_agent_turn(messages: list, new_user_message: str) -> dict:
2 """
3 Run one turn of an agent conversation with caching applied to history.
4 """
5 # Mark the existing stable conversation history for caching
6 cached_messages = []
7
8 for i, msg in enumerate(messages):
9 if i == len(messages) - 1:
10 # Last historical message — mark as cache point
11 if isinstance(msg["content"], str):
12 cached_messages.append({
13 **msg,
14 "content": [
15 {
16 "type": "text",
17 "text": msg["content"],
18 "cache_control": {"type": "ephemeral"}
19 }
20 ]
21 })
22 else:
23 cached_messages.append(msg)
24 else:
25 cached_messages.append(msg)
26
27 # Add the new user message (not cached)
28 cached_messages.append({"role": "user", "content": new_user_message})
29
30 response = client.messages.create(
31 model="claude-sonnet-4-6",
32 max_tokens=4096,
33 system=[{"type": "text", "text": system_prompt, "cache_control": {"type": "ephemeral"}}],
34 messages=cached_messages
35 )
36
37 return responseCache Control Rules and Limits
- Minimum cacheable size: Cached content must be at least 1,024 tokens for Sonnet and Haiku, and at least 2,048 tokens for Opus. Below this threshold, cache_control is accepted but caching does not occur
- Maximum cache breakpoints: You can have up to 4 cache_control markers in a single request, allowing you to cache different sections independently
- Cache prefix matching: Caching works by exact prefix matching. The cached content must appear identically at the same position in the request. Even a single character difference creates a cache miss
- Images can be cached: Base64-encoded images, PDFs, and other media included in the prompt can be marked with cache_control and will be cached along with surrounding text
Multiple Cache Breakpoints
For complex prompts with multiple stable sections followed by variable content:
1system = [
2 {
3 "type": "text",
4 "text": "You are an expert legal document analyst.\n\n" + base_instructions,
5 "cache_control": {"type": "ephemeral"} # Cache 1: System instructions
6 }
7]
8
9messages = [
10 {
11 "role": "user",
12 "content": [
13 {
14 "type": "text",
15 "text": f"CONTRACT TEXT:\n\n{contract_text}",
16 "cache_control": {"type": "ephemeral"} # Cache 2: The specific contract
17 },
18 {
19 "type": "text",
20 "text": user_specific_question # Not cached — varies per user
21 }
22 ]
23 }
24]Monitor Cache Hit Rate in Production
Use the cache_read_input_tokens field in the response usage object to monitor your cache hit rate. If you are sending requests you expect to hit the cache but cache_read_input_tokens remains zero, your prompt content is not matching exactly — check for subtle differences like trailing spaces, dynamic timestamps, or variable content you did not realise was in the cached section.
Combining Caching with the Batch API
For maximum cost reduction on non-time-sensitive workloads, combine prompt caching with the Batch API:
- Prompt caching: 90% reduction on repeated input tokens
- Batch API: 50% reduction on all token costs plus eliminating the per-request overhead
Running large document analysis workloads through Batch API requests with cached system prompts can reduce costs by over 90% compared to individual standard requests.
Practical Impact: Cost Calculation
For a customer support application handling 10,000 daily conversations, each including a 20,000-token knowledge base:
- Without caching: 10,000 × 20,000 = 200 million input tokens daily
- With caching (assuming 90% cache hit rate): 10,000 × 2,000 effective tokens = 20 million input tokens daily
- Result: 90% reduction in input token costs for the knowledge base portion
Summary
Prompt caching is one of the highest-leverage optimisations available in the Claude API. For any application that sends repeated content — system prompts, knowledge bases, documents, conversation history — implement caching before you worry about any other optimisation.
Implementation checklist:
- Add
cache_control: {"type": "ephemeral"}to stable system prompt sections - Ensure cached content exceeds the minimum token threshold (1,024 for Sonnet)
- Place cache markers at the boundary between static and dynamic content
- Monitor cache_read_input_tokens in responses to verify caching is working
- Consider combining with Batch API for non-time-sensitive workloads
With prompt caching and the full agents module covered, let us do a rapid-fire refresher before moving into projects: AI Agents Refresher: Key Concepts, Patterns, and Pitfalls.
This post is part of the Anthropic AI Tutorial Series. Previous post: Model Context Protocol (MCP): Connect Claude to Any Tool.
