Building complex AI features—like interactive code assistants, codebase parsers, and long-form document agents—traditionally carried a heavy financial burden.
If you wanted to build an assistant that understood your codebase, you had to resend the entire codebase in the prompt history for every single follow-up question. This meant your API token usage scaled exponentially, leading to massive bills.
Enter Prompt Caching.
Recently popularized by Anthropic, DeepSeek, and OpenAI, prompt caching allows developers to reuse static context blocks across multiple requests for a fraction of the cost. Here is how it works under the hood.
How Prompt Caching Works Under the Hood
When you send a request to an LLM API, the server has to process your entire input prompt, run mathematical computations across the attention layers, and store the computed states (called the KV Cache).
In a traditional setup, this KV Cache is discarded immediately after the response is returned. The next time you ask a question, the server has to recalculate everything from scratch.
With Prompt Caching:
1. You designate a large, static block of text (e.g., your database schema, system instructions, or an entire code repository) as a cache boundary.
2. The server computes the KV Cache for this block once and stores it in memory.
3. For subsequent requests, the server checks if the incoming prompt matches the cached prefix.
4. Cache Hit: The model skips the expensive recalculation step and reads the states directly from memory, resulting in a 90% token price reduction and a 50% decrease in time-to-first-token (latency).
Anthropic vs. OpenAI Caching Implementations
- Anthropic (Claude API): Gives developers explicit control. You define your cache boundaries using the
cache_controlparameter in your messages payload. Cached input tokens are charged at a 90% discount. - OpenAI (GPT-4o API): Automated caching. The API automatically caches prompt prefixes longer than 1024 tokens without requiring manual code changes, offering a 50% discount on cached tokens.
A Practical Implementation Example
If you are building an AI-powered code scratchpad (like Devpads) where users write and debug scripts, you should cache your system rules and library references:
// Anthropic Message Payload with Prompt Caching
const messages = [
{
role: "system",
content: "You are a senior DevOps assistant. Here is the entire system reference documentation...",
// Tag this large static block to be cached
cache_control: { type: "ephemeral" }
},
{
role: "user",
content: "How do I secure the database credentials in my docker-compose template?"
}
];By utilizing prompt caching, developers can now build highly interactive, contextual AI features that are incredibly fast and commercially sustainable.