Evaluating an Agent
Learn how to use LLMs and datasets to create a robust evaluation system for your AI agents.
Evaluating your AI agents is important for ensuring they meet quality standards and perform reliably in production. However there is no one-size-fits-all evaluation criteria: some agents need to be evaluated only on accuracy, while others have additional criteria: tone, clarity, verbosity, etc.
Because of these differences, AnotherAI does not provide a single rigid evaluation tool, but instead gives you the tools to create a custom evaluation system.
Using LLMs as a Judge
LLM-as-a-judge style evaluation uses one AI model to evaluate the outputs of another. This approach enables scalable, consistent evaluation of your agents' responses across multiple criteria. Setting up an LLM-as-a-judge evaluation system is especially useful if you want to evaluate most - or all - completions as they are created in your product.
Key Benefits
- Scalability: Evaluate hundreds or thousands of outputs automatically
- Consistency: Apply the same evaluation criteria uniformly using a single judge model
- Structured Feedback: Get detailed scores and explanations for each criterion
- Continuous Monitoring: Track quality over time as you iterate
Example: Email Summarizer Evaluation
Let's walk through evaluating an email summarization agent. In this example, our agent:
- Takes an email as input
- Returns a summary
# Example agent usage
from openai import AsyncOpenAI
import os
# Initialize the AnotherAI client
client = AsyncOpenAI(
base_url="https://api.anotherai.dev/v1",
api_key=os.environ.get("ANOTHERAI_API_KEY")
)
async def summarize_email(email_content, model="gpt-4o-mini"):
"""Summarize an email using the specified model"""
response = await client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": "Summarize this email concisely."},
{"role": "user", "content": email_content}
]
)
return response.choices[0].message.content
Decide on Your Dataset
Before evaluating, you need to determine the test data for your agent. You have two options:
- Static Dataset: Pre-defined test cases for consistent, reproducible testing
- Production Data: Recent completions from your live agent for real-world evaluation
To learn more about both approaches, see Static Dataset Evaluation and Production Data Evaluation below.
Create the Judge Agent
Define evaluation criteria and scoring structure:
from pydantic import BaseModel, Field
class CriterionScore(BaseModel):
"""Score for a single evaluation criterion"""
score: int = Field(ge=1, le=10, description="Score from 1-10")
explanation: str = Field(description="Detailed explanation for the score")
class SummaryEvaluation(BaseModel):
"""Complete evaluation of an email summary"""
completeness: CriterionScore
accuracy: CriterionScore
clarity: CriterionScore
conciseness: CriterionScore
overall_score: float = Field(ge=1, le=10)
overall_feedback: str
async def create_summary_judge(client, original_email, summary):
"""Judge the quality of an email summary"""
return await client.beta.chat.completions.parse(
model="gpt-4o-mini", # Use consistent model for fair evaluation
messages=[{
"role": "system",
"content": """You are an expert evaluator of email summaries.
Evaluate the summary on these criteria:
1. Completeness (1-10): Does it capture the key information?
2. Accuracy (1-10): Is the information correctly represented?
3. Clarity (1-10): Is it clear and well-structured?
4. Conciseness (1-10): Is it appropriately brief?
Provide detailed explanations for each score."""
}, {
"role": "user",
"content": f"""Original email:\n{original_email}\n\nSummary to evaluate:\n{summary}"""
}],
response_format=SummaryEvaluation
)
Run the Evaluation Pipeline
Evaluate your agent across multiple models:
import asyncio
import httpx
import json
import os
from datetime import datetime
async def process_email(email, model, client, experiment_id):
"""Process a single email with a specific model"""
# Generate summary using the model
summary_response = await client.chat.completions.create(
model=model,
messages=[{
"role": "system",
"content": "Summarize this email concisely."
}, {
"role": "user",
"content": email['content']
}],
extra_body={
"metadata": {
"experiment_id": experiment_id,
"agent_id": "email-summarizer"
}
}
)
summary = summary_response.choices[0].message.content
completion_id = summary_response.id
# Evaluate the summary
eval_response = await create_summary_judge(client, email['content'], summary)
evaluation = eval_response.choices[0].message.parsed
# Send annotations to AnotherAI
annotations = []
# Individual criterion scores
for criterion in ['completeness', 'accuracy', 'clarity',
'conciseness']:
score_data = evaluation.__dict__[criterion]
annotations.append({
"target": {
"completion_id": completion_id,
"key_path": f"{criterion}"
},
"metric": {
"name": criterion,
"value": score_data.score
},
"text": score_data.explanation,
"metadata": {
"model": model,
"email_id": email['id'],
"experiment_id": experiment_id
},
"author_name": "summary-judge"
})
# Overall score
annotations.append({
"target": {"completion_id": completion_id},
"metric": {
"name": "overall_score",
"value": evaluation.overall_score
},
"text": evaluation.overall_feedback,
"metadata": {
"model": model,
"experiment_id": experiment_id
},
"author_name": "summary-judge"
})
# Send annotations to AnotherAI API
async with httpx.AsyncClient() as client:
response = await client.post(
"https://api.anotherai.dev/v1/annotations",
json={"annotations": annotations},
headers={"Authorization": f"Bearer {os.environ.get('ANOTHERAI_API_KEY')}"}
)
if response.status_code != 200:
print(f"Failed to send annotations: {response.text}")
return {
"email_id": email['id'],
"model": model,
"success": True,
"evaluation": evaluation
}
async def evaluate_email_summaries(emails, models, client):
"""Run complete evaluation pipeline"""
# Create an experiment to group results
experiment_id = f"email-eval-{datetime.now().strftime('%Y%m%d-%H%M%S')}"
# Process all combinations concurrently
tasks = [
process_email(email, model, client, experiment_id)
for email in emails
for model in models
]
results = await asyncio.gather(*tasks)
# Calculate statistics
model_stats = {}
for model in models:
model_results = [r for r in results if r['model'] == model and r['success']]
if model_results:
avg_score = sum(r['evaluation'].overall_score for r in model_results) / len(model_results)
model_stats[model] = {
"average_score": avg_score,
"success_count": len(model_results),
"total_count": len([r for r in results if r['model'] == model])
}
return {
"experiment_id": experiment_id,
"model_stats": model_stats,
"detailed_results": results
}
Analyze Results
After running your evaluation, review the results and use Claude to help analyze them:
# Example usage
async def main():
# Initialize the client
client = AsyncOpenAI(
base_url="https://api.anotherai.dev/v1",
api_key=os.environ.get("ANOTHERAI_API_KEY")
)
# Load test dataset
with open('dataset.json', 'r') as f:
emails = json.load(f)['emails']
# Compare models
models = ["gpt-4o-mini", "gpt-4.1-nano-latest"]
# Run evaluation
results = await evaluate_email_summaries(emails, models, client)
# Print summary
print(f"\nExperiment ID: {results['experiment_id']}")
print("\nModel Performance:")
for model, stats in results['model_stats'].items():
print(f" {model}:")
print(f" Average Score: {stats['average_score']:.2f}/10")
print(f" Success Rate: {stats['success_count']}/{stats['total_count']}")
# Run the evaluation
asyncio.run(main())
This code:
- Runs your email summarizer agent to create completions (summaries)
- Runs the judge agent to evaluate each summary
- Stores the evaluation scores as annotations on the summarizer's completions in AnotherAI
You can then ask Claude to analyze these evaluation results by querying the annotations:
Which model configuration of email-summarizer performed best overall according to the judge?
What were the most common issues the judge found in the email summaries?
Track Performance Over Time
If you want to run your evaluation regularly and monitor the results over time, you can set up continuous monitoring by asking Claude to create custom views that show the evaluation results:
Create a view that shows all completions of @[your-agent-name] with annotations.
The view should display the completion ID, input, outputs, and the annotation left on the completion.
Dataset Options:
Static Dataset Evaluation
Static dataset evaluation uses pre-defined test cases stored in files (JSON, CSV, etc.) to test your agent consistently. We recommend datasets have at least initial 20 test cases, however the exact number may be lower or much higher for you, depending on the complexity of your agent.
Benefits
- Reproducibility: Run the exact same tests across different agent versions
- Edge Case Coverage: Include specific scenarios you know are challenging
- Regression Testing: Ensure updates don't break existing functionality
Drawbacks
- Maintenance: Test cases can become outdated as requirements change
- Limited Coverage: May not reflect actual production usage patterns
- Artificial Examples: Test cases might not capture real-world complexity
Example Implementation
Create a JSON file with diverse test cases. For example:
{
"emails": [
{
"id": "budget-meeting-001",
"content": "Subject: Q4 Budget Review Meeting\n\nHi team,\n\nI'd like to schedule our Q4 budget review for next Friday at 2 PM. Please come prepared with your department's spending reports and projections for next quarter.\n\nThe meeting will cover:\n- Current quarter spending analysis\n- Next quarter budget allocations\n- Cost-saving initiatives\n\nPlease confirm your attendance by Wednesday.\n\nBest,\nSarah",
"category": "meeting-request"
},
{
"id": "urgent-support-002",
"content": "Subject: URGENT: Production Server Down\n\nTeam,\n\nOur main production server went down at 3:47 AM. Customer transactions are failing.\n\nImmediate actions needed:\n1. All hands on deck - cancel morning meetings\n2. DevOps team investigate root cause\n3. Customer Success prepare communications\n\nJoin emergency call: meet.company.com/emergency\n\n- Mark (CTO)",
"category": "urgent-technical"
}
]
}
Then load and use it in your evaluation:
# Load static dataset
import json
from openai import AsyncOpenAI
with open('test_emails.json', 'r') as f:
emails = json.load(f)['emails']
# Initialize client
client = AsyncOpenAI(
base_url="https://api.anotherai.dev/v1",
api_key=os.environ.get("ANOTHERAI_API_KEY")
)
# Define models to test
models = ["gpt-4o-mini", "gpt-4.1-nano-latest"]
# Run evaluation
results = await evaluate_email_summaries(emails, models, client)
Production Data Evaluation
Production data evaluation uses recent completions from your live agent to test performance on real-world data.
Benefits
- Real-World Testing: Evaluate on actual data your users are submitting
- Current Relevance: Always testing on the latest usage patterns
- No Maintenance: No need to update test cases manually
Drawbacks
- Inconsistency: Different data each time makes comparison harder
- No Edge Cases: Might miss rare but important scenarios
- Privacy Concerns: May need to filter sensitive production data
- Availability: Requires existing completions to evaluate
Example Implementation
To evaluate production data, you have two options:
Option 1: Use the playground MCP tool (Recommended)
Ask Claude to create an experiment with recent completions:
Create an experiment using the playground tool that tests gpt-4o-mini on the last 50
completions from my email-summarizer agent from the past day.
Option 2: Query and evaluate manually
Modify the code from Step 3 to query recent completions first:
# Query recent completions - ask Claude to fetch them
# "Get the last 50 completions from my email-summarizer agent from the past day"
completions = response.json()['results']
# Extract inputs from completions
emails = []
for completion in completions:
# Extract the original email from the input
emails.append({
"id": completion['id'],
"content": completion['input']['email_content']
})
# Now run the evaluation pipeline from Step 3
# Note: In AnotherAI, experiment_id is typically a string identifier you create
# rather than something registered via API
experiment_id = f"prod-eval-{datetime.now().strftime('%Y%m%d-%H%M%S')}"
# Run evaluation on the production data
results = await evaluate_email_summaries(emails, ["gpt-4o-mini"], client)
Choosing Your Approach
Use Static Datasets when:
- You need consistent benchmarks across versions
- You have specific edge cases to test
- You're comparing model performance
- You need reproducible results for stakeholders
Use Production Data when:
- You want to understand current performance
- You're monitoring quality over time
- You have sufficient production volume
- You want to catch real-world issues
How is this guide?