Use cases

Comparing Different Prompts and Models

This use case demonstrates how to compare different prompt variations and model performances using AnotherAI's playground feature.

Overview

Suppose you're building a new AI agent that extracts calendar events from emails. You want to make sure you're agent performs well on a variety of emails, but you're not sure which model will work best or what your prompt needs to say to cover all the use cases. In order to make the best agent possible, you may need to:

Compare performance across different models (GPT-4, Claude, Gemini, etc.)
Test multiple prompt variations to find the most effective approach
Optimize for specific metrics like cost, speed, and accuracy.

To compare performance across different models, use the playground feature to test your agent with different models and prompts.

Ask Claude (or your preferred AI agent) to create an experiment that compares the content you're interested in. You can also ask Claude to pick the best models for you based on different criteria (e.g. cost, speed, intelligence, etc.), or create a new version of your prompt for you

Use AnotherAI to compare how Claude Sonnet 4, GPT 4.1, and Gemini 2.5 Flash handle @calendar_event_extractor.py

Use AnotherAI to compare how @calendar_event_extractor.py performs with @test_prompt_1 and @test_prompt_2

Once Claude has created the experiment, it will send you a link and you can view the results in the playground. Something like this:

Tips:

Always reference the specific files of your agent when requesting experiments to avoid any ambiguity about what should be tested
Pick one variable to test with at a time (ex. models, prompts) to make sure that you can easily attribute a given variable on the agent's changes in performance.

AI-Powered Reviews and Improvements For Your Agent

This use case demonstrates how to iteratively improve your agent by having your agent-builder review runs, leave feedback and suggest improvements.

Overview

Suppose you have created an experiment that compares two prompts for an agent that extracts calendar events from emails, and you want to understand which prompt version performs better.

To understand how your prompt performs:

Ask Claude to review the completions in the experiment and leave feedback

Review the completions in anotherai/experiment/019885bb-24ea-70f8-c41b-0cbb22cc3c00 and leave feedback about the overall tone and accuracy of each completion, and give me a summary of your findings.

After Claude has provided feedback, you can ask it to create a new version of the agent that addresses any issues that you want to improve

Update calendar_event_extractor.py prompt to address all issues you found while reviewing

Deploy Agent Fixes

This use case demonstrates how to deploy prompt improvements and model changes without modifying your production code.

Overview

Suppose you've identified and fixed issues with your calendar event extraction system message. Now you need to deploy the improvement to production. For non-breaking changes, AnotherAI's deployment feature lets you update your agent's behavior instantly without touching any production code.

After validating improvements in an experiment, deploy the best performing version

Deploy version anotherai/version/a9f1fc5ab11299a9fee5604e51fe7b6e to production for calendar_event_extractor.py

For non-breaking changes, you can deploy the new version to your existing deployment without making any code changes. Just ask Claude to update the deployment for you:

Update deployment calendar-event-extractor/production#1 to use anotherai/version/a9f1fc5ab11299a9fee5604e51fe7b6e

For breaking changes, Claude will automatically create a new deployment for you. In these cases, you will need to update your code to use the new deployment id.

Review Performance Metrics (Latency, Cost)

This use case demonstrates how you can create specialized views to monitor different agent's performance metrics and completions.

Overview

Suppose your calendar event extraction agent is becoming increasingly popular in your product. You want to keep an eye on the overall spend on that agent to ensure you don't go over budget. You also want to keep an eye on the speed of the models you're using, so that your customers aren't left with long wait times.

You can create custom views to monitor:

The cost of one or more agents
The average latency of one or more agents
Specific completions data from specific agents

and more!

To create a view:

Describe the view or goal you want to Claude (or your preferred AI agent)

Create a view in AnotherAI that shows me the daily cost of calendar_event_extractor.py

Create a view in AnotherAI that shows me the total number of completions per day across all agents

Create a view in AnotherAI that shows all the completion outputs and annotations for @calendar_event_extractor.py

After creating a view, you can always make adjustments to it

Update anotherai/views/calendar-event-extractor-completions to include the completion inputs as well

Debug Agent Issues

This use case demonstrates how to identify, analyze, and resolve issues when your AI agent is not working as expected using AnotherAI's debugging tools.

Overview

Suppose your calendar event extraction agent isn't missing some of the events discussed in an email thread. You need to figure out what's wrong and fix it quickly so more users are not affected. Luckily, AnotherAI can help you and your AI agent do this

To debug issues with your agent:

Locate the completion or completions with the issues present in AnotherAI's web app.

Ask Claude (or your preferred AI agent) to investigate the problem and identify patterns

Why is does this completion not include the backup rain date event? anotherai/completion/019885bb-24ea-70b7-d769-2e90792e0b6d

Claude will review the completion and offer some suggestions

It looks like the rain date event is considered a backup event and is contingent on the original event being cancelled and the prompt is missing clear instructions on how to handle cases like this. Would you like me to create a new version of the prompt that handles this case properly?

After a fix has been proposed, you can create a new experiment to test the new version with the old version to ensure that it's working and not introducing additional regressions

Create a new experiment to compare the new prompt side by side with the old prompt and validate that it's working as expected

Tips:

Try to always provide a link to at least one completion with the issue present in AnotherAI's web app to make it easier for Claude to understand the issue.
Always test your updates with a few different inputs to avoid missing surprise regressions with your updates.

Compare New Model vs Production Model

This use case demonstrates how you can quickly evaluate new AI models against your current production model to help you make informed decisions about model updates

Overview

Suppose you're currently using GPT-4o-mini for your calendar event extraction agent, but you've heard rave reviews about GPT-5 that was just released. You want to test whether the quality improvement justifies the higher cost before switching.

Ask Claude to create a experiment between your current and new model

Compare how @calendar_event_extractor.py performs using current GPT-4o-mini vs the new GPT-5 model

If you want to be sure to test with real data, you can modify your prompt to include that instruction

Compare how @calendar_event_extractor.py performs using current GPT-4o-mini vs the new GPT-5 model. Use the inputs from the last 20 completions.

Alternatively, if you have a dataset of standard test inputs you like to use to validate changes, you can modify your prompt to include that instruction:

Compare how @calendar_event_extractor.py performs using current GPT-4o-mini vs the new GPT-5 model. Use the inputs from @email_test_cases.txt

Tips:

Use real production data for testing, not artificial examples

Use cases

On this page