Testing New Models on an Existing Agent

Suppose you're currently using GPT-4o-mini for a given agent, but you've heard rave reviews about GPT-5 that was just released. You want to test whether the quality improvement justifies the higher cost before switching.

With how often new models are released, the above scenario is extremely common, so we wanted to make it as easy as possible for our users to test new models on existing agents.

Creating a side by side model comparison

Ask your AI assistant to create a experiment between your current and new model

Compare how anotherai/agent/calendar-event-extractor performs using current GPT-4o-mini 
vs the new GPT-5 model

If you want to be sure to test with real data, you can modify your prompt to include that instruction

Compare how anotherai/agent/calendar-event-extractor performs using current GPT-4o-mini 
vs the new GPT-5 model. Use the inputs from the last 20 completions.

Alternatively, if you have a dataset of standard test inputs you like to use to validate changes, you can modify your prompt to include that instruction:

Compare how anotherai/agent/calendar-event-extractor performs using current GPT-4o-mini 
vs the new GPT-5 model. Use the inputs from @email_test_cases.txt

Tip: Based on our testing, we've found that using Claude Opus to be our preferred model for evaluating the side by side performance of two other models.

Your AI assistant will analyze the results and provide a clear comparison:

AI Analysis of Model Comparison

Or you can view the experiment in the AnotherAI experiments view to see a side-by-side comparison of how each version in the experiment handles each input.

Calendar Event Extraction Experiment

Tip: Use real production data for testing, not artificial examples to ensure you're testing the new model with real-life scenarios.

Testing New Models on an Existing Agent

Creating a side by side model comparison

On this page