LLM Prompt Caching

👉 Understand working principal of prompt cache and how it helps to reduces latency and cost in LLM applications.

👉 Deep Dive in implicit and explicit caching strategies for production grade applications.


Optimization

When building real-world LLM applications, performance matters as much as accuracy.

  • Users expect faster responses.
  • Businesses expect lower cost.
  • Developers prefer predictable behavior.

This is exactly where optimization techniques comes-in, helping to turn the same model into something much faster, cheaper, and smarter without changing the model itself.

One of the most impactful optimization techniques is Prompt Caching, it helps reduce the compute time by identifying and caching the static part from the prompts.


Prompt Caching

Prompt caching is based on a simple idea: “Do not repeat work the model already did.”

Every prompt sent to an LLM contains a mixture of:

  • Instructions
  • Conversation history
  • Examples
  • Retrieved Documents (in RAG)

If this structure stays the same across many requests, the model should not have to re-process those repeated parts every time.

Prompt caching solves this by process and store the static part once and reusing it for later requests. This saves compute, reduces latency, and lowers token cost.

Using caching in production quality LLM systems, often see:

  • 20–50% faster responses
  • 25–70% cost reduction
  • More stable performance under load

Prompt caching becomes especially valuable in applications where a large part of the prompt rarely changes, such as:

  • RAG systems
  • Chatbots
  • Agent workflows
  • Long reasoning pipelines

Why prompt caching matters in practice ?

Many applications repeatedly use the same context, eg:

  • A chatbot always begins with the same system instructions.
  • A RAG pipeline always injects the same knowledge base chunks for similar questions.
  • An AI agent repeatedly uses the same tool definitions or schemas.
  • A reasoning pipeline reuses long, static explanation blocks.

Without caching, the LLM recalculates the internal attention memory for these static tokens again and again-wasting cost and compute.

While if caching is implemented, only the new tokens (usually the user query) are reprocessed.

There are two popular ways to cache the repetative data.

Implicit Caching (Provider-Side)

Some LLM providers automatically detect repeated prompt prefixes and if they find a long section that matches a previous request, they reuse the cached internal state.

From the developer’s perspective:

  • No configuration
  • No code changes
  • Faster responses automatically
  • Discounted cost for cached tokens

This makes implicit caching a simple and powerful optimization for most apps.

Insight:
Some providers (OpenAI, Anthropic) automatically cache: system prompts, long instructions & long context messages.

Explicit Caching (Developer-Side)

Explicit caching gives developers more control. You choose what to cache, how long to keep it, and when to reuse it.

This allows the Developer to controls what gets cached embeddings, context chunks, fixed instructions, etc.

This approach is ideal when:

  • You work with very long documents
  • You want predictable cache hits
  • Your app frequently reuses the same large context repeatedly
    • e.g. Knowledge Bases, RAG pipelines.
  • You want caching beyond the provider’s TTL window

Below are the common tools and frameworks used for caching:

  • Redis-based caching
  • LangChain’s cache layer
  • LlamaIndex document caching
  • Vector DBs combined with content-hash keys

Note:
Explicit caching requires more setup but offers far more control, especially for scalable workloads.


How prompt caching works ?

LLMs process tokens sequentially which is the most expensive part of processing a long prompt, especially a static one (like system instructions or a document).

Under the hood, LLMs compute internal Key–Value (KV) states when processing tokens. These KV states represent the model’s memory of the prompt.

The caching mechanism relies on storing these KV states.

Cache Miss (First Request)

The first time you send a long static prefix (e.g., system prompt and tool schema), the model processes it token-by-token and builds a KV cache.

This KV state (the internal memory/context of the prefix) is then stored in a fast cache, associated with a unique hash of the prefix.

Cache Hit (Next Requests)

When a new request arrives with the exact same static prefix along with a new dynamic content (the user’s query).

The System:

  • Identifies the static prefix
  • Retrieves the pre-computed KV cache for that segment
  • Loads it directly into the model’s memory and
  • The LLM only processes the new tokens

This reduces compute dramatically. In many cases, the cached part is billed at 50–90% discount, depending on the provider.

Limitations

Caching is powerful, but not unlimited. It has it’s own limitations. Which are:

  • Exact prefix match required
    • Caching typically requires an exact string prefix match for a cache hit.
    • Making it sensitive to even slight formatting changes (e.g., a single space), can break the cache.
  • Cache TTL (Time-To-Live)
    • Caches are ephemeral and expire after a short period (e.g., 5-60 minutes) of inactivity.
    • Limiting its benefit for low-frequency or highly variable requests.
  • Explicit caching adds complexity
    • Developers must manage cache keys, TTLs, and data shaping.
    • Managing cache keys and expirations for non-API-managed caches adds infrastructure complexity.

A typical request flow

Below request flow visually demonstrates how a typical caching works for new and subsequent requests.

flowchart TD

    %% User Request
    A[User Request] --> B[Build Prompt\n - static prefix + user query]

    %% Prefix Extraction
    B --> C[Extract Static Prefix]
    C --> D[Hash Prefix]

    %% Check Cache
    D --> E{Prefix Hash\nfound in Cache?}

    %% Cache Miss Path
    E -- No --> F[Cache Miss:\nLLM processes full prefix]
    F --> G[Compute KV State\nfor static prefix]
    G --> H[Store KV in Cache]
    H --> I[Process new tokens\n - user query]
    I --> J[Generate Response]

    %% Cache Hit Path
    E -- Yes --> K[Cache Hit:\nLoad KV State]
    K --> I[Process new tokens\n - user query]
    I --> J[Generate Response]

    %% Output
    J --> L[Return Response to User]
    
    %% Notes section
    classDef note fill:#f0f7ff,stroke:#b2d3ff,stroke-width:1px,color:#000;
    N1[Note:\nCache hit requires an exact prefix match, \neven whitespace may break it.]:::note
    N2[Note:\nKV cache accelerates only the static prefix.\nUser input still processed normally.]:::note

    L --> N1
    L --> N2

Provider Implementations

Leading providers like OpenAI, Gemini and Others have built-in caching, though their approaches differ.

Explicit caching options (like Gemini and Bedrock) are excellent for apps with large, stable prompt sections.

ProviderMechanismPrompt SizeConfigurationKey Detail
OpenAI (GPT-4o, etc.)Implicit Caching 1,024 tokensAutomatic (No code changes)Caches typically last 5-10 minutes of inactivity; applies a cost discount and latency reduction to the cached prefix.
Google Gemini (Vertex AI)Implicit and Explicit Caching 2,048 tokens (Implicit)Implicit is automatic.
Explicit requires API calls (e.g., CachedContent.create()) to manage the cache lifetime and content.
Explicit caching offers guaranteed cache hits for a configurable TTL, offering more control for large, long-lived contexts.
Amazon Bedrock (Claude, etc.)Explicit Caching (Context Caching)Model-dependent (e.g., 1,024 tokens)Requires setting a cache_control parameter or checkpoint markers in the prompt to define the prefix to be cached.Gives granular control over which part of a prompt is cached using markers.

How to design prompts for Maximum Cache Hits ?

We can greatly improve cache efficiency by structuring prompts carefully by following the below guidelines.

1. Put All Static Content First

Always place your fixed system instructions, schemas, examples, or retrieved documents before the user input.

If the user query appears first, the shared prefix breaks, hence no cache hit.

Optimized PromptBad prompt
[System Instructions]
[Tool Definitions]
[RAG Document Chunk]
[User Query]
[User Query]
[System Instructions]
[RAG Document Chunk]

2. Explicit Caching for Long or Expensive Context

If the app frequently queries the same large document (e.g., policy PDF, product catalog, codebase), explicit caching ensures the document stays in the cache for its TTL window.

This avoids recomputing thousands of tokens.

3. Monitor Cache Hit Rate and Token Savings

LLM Providers expose fields to monitor cache status such as:

  • cached_tokens
  • cache_hit
  • cache_miss

By tracking these metrics, we can ensure that how effectiely the caching design is actually working. If not, it’s usually because the prefix changed slightly.

Common issues include:

  • Inconsistent formatting
  • Random whitespace
  • Inserting dynamic content before static content