Caching

Think of caching as free key-value storage: The LLM input acts as the "key" and the generated output becomes the "value". Once cached, identical requests return the stored result instantly at no cost, effectively giving you a free key-value storage system backed by your LLM interactions.

AnotherAI offers caching capabilities that allow you to reuse the results of identical requests, saving both time and cost. When enabled, the system will return stored results for matching requests for free, instead of making redundant calls to the LLM.

How caching works

Input Hash

A unique hash is calculated based on the input provided to the model. This can be:

The list of messages if no specific input variables are used.
A combination of defined input variables (passed via extra_body) and the list of messages (relevant for replies or when messages supplement templated prompts).

Version Hash

A hash representing the agent's configuration is computed. This typically includes:

The model identifier (e.g., gpt-4o).
The temperature setting.
Other version parameters (such as top_p, max_tokens, etc.).
For calls using input variables: the message templates are also factored in.

Cache Check

Before calling the LLM provider and depending on the caching option (see below), AnotherAI checks if a previous run exists with the exact same Input Hash and Version Hash.

Cache Hit

If a matching run is found, its saved output is returned immediately, bypassing the actual model call.

Caching options

The behavior is controlled by the use_cache parameter, which can be passed in the extra_body of your API request. It accepts the following values:

Option	Description	Conditions
`"auto"` Default	The cache is checked only under specific conditions	`temperature` must be `0` and no `tools` are provided in the request
`"always"`	The cache is always checked	Regardless of `temperature` setting or use of `tools`
`"never"`	The cache is never checked	No caching occurs

When does AnotherAI look up the cache?

Caching behaviour depends on three things:

use_cache ("auto", "always", "never")
temperature
Whether you include tools in the request body

The table below shows whether a cache lookup occurs for every combination that matters:

`use_cache` value	`temperature`	`tools` present?	Cache lookup
`"auto"` (default)	`0`	No	✅ Yes
`"auto"`	`0`	Yes	❌ No
`"auto"`	> `0`	Any	❌ No
`"always"`	Any	Any	✅ Yes
`"never"`	Any	Any	❌ No

Why the cache is OFF by default when using the OpenAI-compatible endpoint

The OpenAI chat/completions endpoint defaults to temperature = 1, and AnotherAI inherits that default. Combined with use_cache = "auto", the first row that matches in the table is the third one (temperature > 0 → ❌ No).

Therefore, a request will not use the cache unless you either:

set temperature = 0 and omit tools, or
set use_cache = "always".

Examples:

import openai

# Configure the OpenAI client to use AnotherAI
client = openai.OpenAI(
    api_key="YOUR_ANOTHERAI_API_KEY",
    base_url="https://api.anotherai.dev/v1"
)

# Example 1: Cache is NEVER hit (Default behavior)
# Reason: temperature defaults to 1, and use_cache defaults to "auto"
completion = client.chat.completions.create(
  model="gpt-4o-mini",
  messages=[{"role": "user", "content": "Describe the meaning of life"}],
  metadata={"agent_id": "my-chatbot"}
)

# Example 2: Cache CAN be hit
# Reason: temperature is explicitly 0, meeting the "auto" cache condition.
# A subsequent identical request will hit the cache.
completion = client.chat.completions.create(
  model="gpt-4o-mini",
  messages=[{"role": "user", "content": "Describe the meaning of life"}],
  temperature=0,
  metadata={"agent_id": "my-chatbot"}
)

# Example 3: Cache CAN be hit (using Deployment and "always")
# Reason: use_cache="always" forces a cache check regardless of temperature.
completion = client.chat.completions.create(
  model="anotherai/deployment/travel-assistant:production#1", # Using an AnotherAI deployment
  # messages might be empty if the prompt is fully server-side
  messages=[], # required by SDK so pass an empty array
  extra_body={
    "input": {
      "destination": "Paris",
      "traveler_type": "business"
    },
    "use_cache": "always" # Force cache check
  }
  # response_format=MyPydanticModel # If expecting structured output
)

# Example 4: Caching with structured outputs
from pydantic import BaseModel

class TravelAdvice(BaseModel):
    destination: str
    tips: list[str]
    warnings: list[str]

completion = client.beta.chat.completions.parse(
  model="gpt-4o-mini",
  messages=[
    {"role": "user", "content": "Give me travel advice for Tokyo"}
  ],
  response_format=TravelAdvice,
  temperature=0,  # Required for auto caching
  extra_body={
    "metadata": {"agent_id": "travel-advisor"}
  }
)

Caching with AnotherAI features

Caching and Deployments

When using deployments, the version hash automatically includes all deployment parameters:

# Both requests will use the same cache if inputs match
# The deployment version determines model, temperature, and prompt
completion1 = client.chat.completions.create(
  model="anotherai/deployment/customer-support:production#1",
  messages=[{"role": "user", "content": "How do I reset my password?"}],
  extra_body={
    "use_cache": "always"
  }
)

# Subsequent identical request hits the cache
completion2 = client.chat.completions.create(
  model="anotherai/deployment/customer-support:production#1",
  messages=[{"role": "user", "content": "How do I reset my password?"}],
  extra_body={
    "use_cache": "always"
  }
)

Monitoring cache performance

Track cache hit rates using AnotherAI's observability tools.

Ask questions in natural language using Claude Code:

Show me the cache hit rate for each of my agents over the last 7 days

What's the total requests, cache hits, and cache hit rate percentage for all agents this week?

Caching with images

When using images as input to your models, it's important to understand how the caching mechanism handles different image formats:

Image input formats and cache behavior

The cache hash is computed based on the exact input provided, not the actual content of the image. This means:

If you provide an image as base64-encoded data, the cache hash will be calculated from that base64 string.
If you provide an image as a URL (e.g., S3 URL), the cache hash will be calculated from the URL string itself.

Important: Even if both inputs represent the same image content, they will produce different cache hashes because the input format differs. The system does not download and compare image contents when computing cache hashes, as this would defeat the performance benefits of caching.

Examples

# These two requests will have DIFFERENT cache hashes, even if the image is the same
 
# Request 1: Using base64
completion1 = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,/9j/4AAQ..."}}
        ]
    }],
    temperature=0,
    metadata={"agent_id": "image-analyzer"}
)

# Request 2: Using S3 URL for the same image
completion2 = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user", 
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": "https://s3.amazonaws.com/bucket/image.jpg"}}
        ]
    }],
    temperature=0,
    metadata={"agent_id": "image-analyzer"}
)

# These will NOT hit the same cache entry

Best practices for image caching

To maximize cache hits when working with images:

Be consistent with your image format: Choose either base64 or URL format and stick to it across your application.
Use stable URLs: If using URLs, ensure they don't contain changing parameters (like timestamps or signatures) that would alter the cache hash.
Consider preprocessing: If you need flexibility in image sources, consider standardizing to one format before making API calls.

Caching

On this page