Caching
Learn how to enable and configure caching in the AnotherAI API.
Think of caching as free key-value storage: The LLM input acts as the "key" and the generated output becomes the "value". Once cached, identical requests return the stored result instantly at no cost, effectively giving you a free key-value storage system backed by your LLM interactions.
AnotherAI offers caching capabilities that allow you to reuse the results of identical requests, saving both time and cost. When enabled, the system will return stored results for matching requests for free, instead of making redundant calls to the LLM.
How caching works
Input Hash
A unique hash is calculated based on the input provided to the model. This can be:
- The list of
messages
if no specific input variables are used. - A combination of defined
input
variables (passed viaextra_body
) and the list ofmessages
(relevant for replies or when messages supplement templated prompts).
Version Hash
A hash representing the agent's configuration is computed. This typically includes:
- The model identifier (e.g.,
gpt-4o
). - The
temperature
setting. - Other version parameters (such as
top_p
,max_tokens
, etc.). - For calls using input variables: the message templates are also factored in.
Cache Check
Before calling the LLM provider and depending on the caching option (see below), AnotherAI checks if a previous run exists with the exact same Input Hash and Version Hash.
Cache Hit
If a matching run is found, its saved output is returned immediately, bypassing the actual model call.
Caching options
The behavior is controlled by the use_cache
parameter, which can be passed in the extra_body
of your API request. It accepts the following values:
Option | Description | Conditions |
---|---|---|
"auto" Default | The cache is checked only under specific conditions | temperature must be 0 and no tools are provided in the request |
"always" | The cache is always checked | Regardless of temperature setting or use of tools |
"never" | The cache is never checked | No caching occurs |
When does AnotherAI look up the cache?
Caching behaviour depends on three things:
use_cache
("auto"
,"always"
,"never"
)temperature
- Whether you include
tools
in the request body
The table below shows whether a cache lookup occurs for every combination that matters:
use_cache value | temperature | tools present? | Cache lookup |
---|---|---|---|
"auto" (default) | 0 | No | ✅ Yes |
"auto" | 0 | Yes | ❌ No |
"auto" | > 0 | Any | ❌ No |
"always" | Any | Any | ✅ Yes |
"never" | Any | Any | ❌ No |
Why the cache is OFF by default when using the OpenAI-compatible endpoint
The OpenAI chat/completions
endpoint defaults to temperature = 1
, and AnotherAI inherits that default. Combined with use_cache = "auto"
, the first row that matches in the table is the third one (temperature > 0 → ❌ No
).
Therefore, a request will not use the cache unless you either:
- set
temperature = 0
and omittools
, or - set
use_cache = "always"
.
Examples:
import openai
# Configure the OpenAI client to use AnotherAI
client = openai.OpenAI(
api_key="YOUR_ANOTHERAI_API_KEY",
base_url="https://api.anotherai.dev/v1"
)
# Example 1: Cache is NEVER hit (Default behavior)
# Reason: temperature defaults to 1, and use_cache defaults to "auto"
completion = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Describe the meaning of life"}],
metadata={"agent_id": "my-chatbot"}
)
# Example 2: Cache CAN be hit
# Reason: temperature is explicitly 0, meeting the "auto" cache condition.
# A subsequent identical request will hit the cache.
completion = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Describe the meaning of life"}],
temperature=0,
metadata={"agent_id": "my-chatbot"}
)
# Example 3: Cache CAN be hit (using Deployment and "always")
# Reason: use_cache="always" forces a cache check regardless of temperature.
completion = client.chat.completions.create(
model="anotherai/deployment/travel-assistant:production#1", # Using an AnotherAI deployment
# messages might be empty if the prompt is fully server-side
messages=[], # required by SDK so pass an empty array
extra_body={
"input": {
"destination": "Paris",
"traveler_type": "business"
},
"use_cache": "always" # Force cache check
}
# response_format=MyPydanticModel # If expecting structured output
)
# Example 4: Caching with structured outputs
from pydantic import BaseModel
class TravelAdvice(BaseModel):
destination: str
tips: list[str]
warnings: list[str]
completion = client.beta.chat.completions.parse(
model="gpt-4o-mini",
messages=[
{"role": "user", "content": "Give me travel advice for Tokyo"}
],
response_format=TravelAdvice,
temperature=0, # Required for auto caching
extra_body={
"metadata": {"agent_id": "travel-advisor"}
}
)
Caching with AnotherAI features
Caching and Deployments
When using deployments, the version hash automatically includes all deployment parameters:
# Both requests will use the same cache if inputs match
# The deployment version determines model, temperature, and prompt
completion1 = client.chat.completions.create(
model="anotherai/deployment/customer-support:production#1",
messages=[{"role": "user", "content": "How do I reset my password?"}],
extra_body={
"use_cache": "always"
}
)
# Subsequent identical request hits the cache
completion2 = client.chat.completions.create(
model="anotherai/deployment/customer-support:production#1",
messages=[{"role": "user", "content": "How do I reset my password?"}],
extra_body={
"use_cache": "always"
}
)
Monitoring cache performance
Track cache hit rates using AnotherAI's observability tools.
Ask questions in natural language using Claude Code:
Show me the cache hit rate for each of my agents over the last 7 days
What's the total requests, cache hits, and cache hit rate percentage for all agents this week?
Caching with images
When using images as input to your models, it's important to understand how the caching mechanism handles different image formats:
Image input formats and cache behavior
The cache hash is computed based on the exact input provided, not the actual content of the image. This means:
- If you provide an image as base64-encoded data, the cache hash will be calculated from that base64 string.
- If you provide an image as a URL (e.g., S3 URL), the cache hash will be calculated from the URL string itself.
Important: Even if both inputs represent the same image content, they will produce different cache hashes because the input format differs. The system does not download and compare image contents when computing cache hashes, as this would defeat the performance benefits of caching.
Examples
# These two requests will have DIFFERENT cache hashes, even if the image is the same
# Request 1: Using base64
completion1 = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{"type": "image_url", "image_url": {"url": "..."}}
]
}],
temperature=0,
metadata={"agent_id": "image-analyzer"}
)
# Request 2: Using S3 URL for the same image
completion2 = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{"type": "image_url", "image_url": {"url": "https://s3.amazonaws.com/bucket/image.jpg"}}
]
}],
temperature=0,
metadata={"agent_id": "image-analyzer"}
)
# These will NOT hit the same cache entry
Best practices for image caching
To maximize cache hits when working with images:
- Be consistent with your image format: Choose either base64 or URL format and stick to it across your application.
- Use stable URLs: If using URLs, ensure they don't contain changing parameters (like timestamps or signatures) that would alter the cache hash.
- Consider preprocessing: If you need flexibility in image sources, consider standardizing to one format before making API calls.
How is this guide?