How much does prompt caching save?

Cache hits cost approximately 10% of the normal input token price - a 90% reduction on cached tokens. The first request pays a one-time cache write cost of about 125% of the normal rate. For applications that send a long stable system prompt on every call, caching typically reduces overall input costs by 70-90% on the cached portion. The more requests share the same cached prefix, the greater the saving.

What is the cache TTL and what happens when it expires?

The current cache TTL is 5 minutes. If a cached prefix is not accessed within that window, it is evicted and the next request will pay the full input token rate and write a new cache entry. For high-frequency applications this is rarely an issue. For low-frequency applications, consider whether caching is worth it - the cache write surcharge on infrequent calls may not be offset by enough cache hits.

Can I cache tool definitions as well as the system prompt?

Yes - tool definitions count as input tokens and can be substantial for large tool sets. Include your tool definitions in the cached prefix along with the system prompt. The cached block must be a stable prefix of the request - anything that changes per request (the user message, conversation history) must come after the cache boundary marker.

Claude Prompt Caching Guide: Cut API Costs by Up to 90%

← Back to Claude API Hub

As you build production applications with Claude, you will notice a pattern: many API requests send exactly the same content at the start. Your system prompt is the same. The company knowledge base you include is the same. The background documentation is the same. Only the user's specific question at the end differs.

Without prompt caching, you pay to process those thousands of identical tokens on every single API call. With prompt caching, you pay full price once to cache that content, and then subsequent requests that reuse the same cached prefix cost 90% less for those tokens. For document-heavy applications, agent systems, and high-volume customer service deployments, this difference is transformative.

What is Claude Prompt Caching?

Claude prompt caching lets you mark stable sections of your prompt - typically the system prompt and large reference documents - with cache_control: {"type": "ephemeral"}. On the first request, Claude caches those sections. On every subsequent request that includes the same content in the same position, Claude reads from the cache at 10% of the standard input token cost. Cache entries live for at least 5 minutes, resetting each time they are accessed.

How Prompt Caching Works

Prompt caching stores a snapshot of specified parts of your prompt on Anthropic's infrastructure after the first request that uses those sections. When a subsequent request begins with the same cached content, Claude retrieves the cached state rather than reprocessing the tokens from scratch.

The pricing model:

Cache write: 25% more expensive than standard input tokens (the first request that creates the cache)
Cache read: 90% cheaper than standard input tokens (every subsequent request that hits the cache)
Cache lifetime: Cached content is retained for a minimum of 5 minutes. Every time the cache is accessed, the lifetime resets for another 5 minutes

For a 100,000-token system prompt that is reused across 1,000 requests, the maths is compelling:

Without caching: 1,000 x 100,000 tokens = 100 million tokens billed at input rate
With caching: 1 x 100,000 at cache write rate + 999 x 100,000 at 10% of input rate = approximately 10 million equivalent tokens

Marking Content for Caching

You enable caching on specific content blocks by adding a cache_control parameter with type: "ephemeral". This tells Claude to cache up to and including that content block.

python

import anthropic

client = anthropic.Anthropic()

# Example: Caching a large system prompt
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": """You are an expert IT support agent for Meridian Systems.

--- COMPANY KNOWLEDGE BASE ---
[INSERT YOUR FULL 50,000 WORD KNOWLEDGE BASE HERE]
--- END KNOWLEDGE BASE ---

Your role is to help customers resolve technical issues quickly and accurately.
Always cite the relevant knowledge base section in your responses.""",
            "cache_control": {"type": "ephemeral"}  # Cache this entire system prompt
        }
    ],
    messages=[
        {
            "role": "user",
            "content": "My laptop can't connect to the company VPN. What should I try?"
        }
    ]
)

# Check cache usage in the response
usage = response.usage
print(f"Input tokens: {usage.input_tokens}")
print(f"Cache write tokens: {usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {usage.cache_read_input_tokens}")

Caching Strategies for Different Applications

Strategy 1 - Cache the System Prompt Only

Best for: Applications with a fixed, large system prompt and variable user messages.

python

system_with_cache = [
    {
        "type": "text",
        "text": large_system_prompt,
        "cache_control": {"type": "ephemeral"}
    }
]

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=system_with_cache,
    messages=[{"role": "user", "content": user_question}]
)

Strategy 2 - Cache a Long Document

Best for: Document Q&A systems where many users ask questions about the same document.

python

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": f"Here is the full technical specification document:\n\n{full_document_text}",
                    "cache_control": {"type": "ephemeral"}  # Cache the document
                },
                {
                    "type": "text",
                    "text": "What are the network latency requirements specified in section 4?"
                    # This question is not cached - it varies per user
                }
            ]
        }
    ]
)

Cache the Stable Parts, Not the Variable Parts

The key to effective caching is placing the cache_control marker at the boundary between stable and variable content. Everything before the marker gets cached. The user's specific question - which is different on every request - comes after the marker and is not cached. Only add cache_control to the parts of your prompt that are genuinely stable across multiple requests.

Strategy 3 - Cache Conversation History for Long Agents

Best for: Multi-turn agent conversations where the history grows long and the stable system prompt repeats.

python

def run_cached_agent_turn(messages: list, new_user_message: str) -> dict:
    """
    Run one turn of an agent conversation with caching applied to history.
    """
    # Mark the existing stable conversation history for caching
    cached_messages = []
    
    for i, msg in enumerate(messages):
        if i == len(messages) - 1:
            # Last historical message - mark as cache point
            if isinstance(msg["content"], str):
                cached_messages.append({
                    **msg,
                    "content": [
                        {
                            "type": "text",
                            "text": msg["content"],
                            "cache_control": {"type": "ephemeral"}
                        }
                    ]
                })
            else:
                cached_messages.append(msg)
        else:
            cached_messages.append(msg)
    
    # Add the new user message (not cached)
    cached_messages.append({"role": "user", "content": new_user_message})
    
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=4096,
        system=[{"type": "text", "text": system_prompt, "cache_control": {"type": "ephemeral"}}],
        messages=cached_messages
    )
    
    return response

Cache Control Rules and Limits

Minimum cacheable size: Cached content must be at least 1,024 tokens for Sonnet and Haiku, and at least 2,048 tokens for Opus. Below this threshold, cache_control is accepted but caching does not occur
Maximum cache breakpoints: You can have up to 4 cache_control markers in a single request, allowing you to cache different sections independently
Cache prefix matching: Caching works by exact prefix matching. The cached content must appear identically at the same position in the request. Even a single character difference creates a cache miss
Images can be cached: Base64-encoded images, PDFs, and other media included in the prompt can be marked with cache_control and will be cached along with surrounding text

Multiple Cache Breakpoints

For complex prompts with multiple stable sections followed by variable content:

python

system = [
    {
        "type": "text",
        "text": "You are an expert legal document analyst.\n\n" + base_instructions,
        "cache_control": {"type": "ephemeral"}  # Cache 1: System instructions
    }
]

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": f"CONTRACT TEXT:\n\n{contract_text}",
                "cache_control": {"type": "ephemeral"}  # Cache 2: The specific contract
            },
            {
                "type": "text",
                "text": user_specific_question  # Not cached - varies per user
            }
        ]
    }
]

Monitor Cache Hit Rate in Production

Use the cache_read_input_tokens field in the response usage object to monitor your cache hit rate. If you are sending requests you expect to hit the cache but cache_read_input_tokens remains zero, your prompt content is not matching exactly - check for subtle differences like trailing spaces, dynamic timestamps, or variable content you did not realise was in the cached section.

Combining Caching with the Batch API

For maximum cost reduction on non-time-sensitive workloads, combine prompt caching with the Batch API:

Prompt caching: 90% reduction on repeated input tokens
Batch API: 50% reduction on all token costs plus eliminating the per-request overhead

Running large document analysis workloads through Batch API requests with cached system prompts can reduce costs by over 90% compared to individual standard requests.

Practical Impact: Cost Calculation

For a customer support application handling 10,000 daily conversations, each including a 20,000-token knowledge base:

Without caching: 10,000 x 20,000 = 200 million input tokens daily
With caching (assuming 90% cache hit rate): 10,000 x 2,000 effective tokens = 20 million input tokens daily
Result: 90% reduction in input token costs for the knowledge base portion

Summary

Prompt caching is one of the highest-leverage optimisations available in the Claude API. For any application that sends repeated content - system prompts, knowledge bases, documents, conversation history - implement caching before you worry about any other optimisation.

Implementation checklist:

Add cache_control: {"type": "ephemeral"} to stable system prompt sections
Ensure cached content exceeds the minimum token threshold (1,024 for Sonnet)
Place cache markers at the boundary between static and dynamic content
Monitor cache_read_input_tokens in responses to verify caching is working
Consider combining with Batch API for non-time-sensitive workloads

With prompt caching and the full agents module covered, let us do a rapid-fire refresher before moving into projects: AI Agents Refresher: Key Concepts, Patterns, and Pitfalls.

Prompt caching pairs especially well with Claude RAG applications - cache the system prompt and retrieved document chunks that repeat across user queries. For the full cost picture including Batch API savings, review the Claude API pricing guide.

The Anthropic prompt caching documentation has the authoritative details on minimum token thresholds, maximum cache breakpoints, and the extended 1-hour cache option for less frequently accessed content.

This post is part of the Anthropic AI Tutorial Series. Previous post: Claude Model Context Protocol (MCP): Connect Claude to Any Tool.

External references:

Frequently Asked Questions

Q: How does Claude prompt caching work technically? You mark a specific prefix of your prompt as cacheable by adding "cache_control": {"type": "ephemeral"} to the last content block of the prefix you want cached. On the first request, Claude processes and caches that prefix. Subsequent requests with the same cached prefix (within the TTL) serve the cached KV computation, and you are charged a reduced cache-read price rather than the full input price. The cached prefix must appear at the start of the prompt - you cannot cache an arbitrary middle section.

Q: What is the minimum cacheable prompt size? The minimum cacheable prompt size is 1,024 tokens for Claude Sonnet and Haiku, and 2,048 tokens for Claude Opus. Shorter prompts cannot be cached. This means prompt caching is most beneficial for applications with large system prompts, long documents, or extensive tool definitions that stay the same across many requests.

Q: How long does a prompt cache entry last? The default cache TTL is 5 minutes from the last use of that cache entry - each request that hits the cache resets the timer. An extended cache option (where available) offers a 1-hour TTL. For applications with predictable high-traffic patterns, keep cache entries warm by sending periodic refresh requests. Cache entries are not shared across different API keys or organisations.

Part of the Claude AI Masterclass.