AnotherAI
AnotherAI
Use CasesFundamentals

Using Experiments to Improve Agents

Learn how to use experiments to compare different agent configurations, test variations, and optimize for cost, speed, and accuracy

Experiments

Throughout the process of creating a new agent, in order to make the best agent possible, you may need to:

  • Compare quality, cost and speed across different models (GPT-5, Claude 4 Sonnet, Gemini 2.0 Flash, etc.)
  • Test multiple prompt variations to find which produces the most accurate, useful, or appropriately-toned outputs
  • Optimize for specific metrics like cost, speed, and accuracy.

Experiments allow you to systematically compare each of these different parameters of your agent to find the optimal setup for your use case across one or more test inputs (inputs being: the starting data you give your agent to process).

Creating Experiments

To create an experiment:

Configure MCP

Make sure you have the AnotherAI MCP configured and enabled. You can view the set up steps here.

Create Experiment

Then just ask your preferred AI assistant to set up experiments for you. We'll cover some common examples and sample messages you can use below.

The most common parameters to experiment with are prompts and models, however you can also experiment with changes to other parameters like temperature.

Prompts

Comparing different prompts is one of the most effective ways to improve your agent's performance. Small changes in wording, structure, or examples can lead to significant improvements. If you notice an issue with an existing prompt, you can even ask your AI assistant to generate prompt variations to use in the experiment.

Example:

Look at the prompt of anotherai/agent/email-rewriter and create an experiment in AnotherAI 
that compares the current prompt with a new prompt that better emphasizes adopting 
the tone lists in the input 

Your AI assistant will create the experiment and give you an initial analysis of the results and well as a URL to view the results in the AnotherAI web app.

Terminal output showing experiment creation

You can use the provided URL to view the results in the AnotherAI web app to perform manual analysis of the results.

Experiment view showing two prompts side by side

Models

Different models excel at different tasks. AnotherAI supports over 100 different models, and experiments can help you choose the right model for your agent, depending on its needs.

Example:

Create an AnotherAI experiment to help me find a faster model for anotherai/agent/email-rewriter,
but still maintains the same tone and verbosity considerations as my current model.

If you have a specific model in mind that you want to try - for example, a newly released model - you can ask your AI assistant to help you test that model against your existing agent version. You can always request that your AI assistant use inputs from existing completions, to ensure that you're testing with real, production data.

Example:

Can you retry the last 5 completions of anotherai/agent/email-rewriter and compare the outputs with 
GPT 5 mini?

Experiment view showing Claude 3.5 Sonnet vs GPT-5 Mini comparison

Other Parameters

Beyond prompts and models, fine-tuning other parameters can impact your agent's behavior and output quality. Temperature in particular can have a significant impact on the quality of the output.

Example:

Test my email-rewriter agent with temperatures 0.2, 0.5, and 0.8 to find 
the right balance between creativity and professionalism

Email Rewriter Temperature Experiment

Managing Large Experiments with Claude Code

If you're testing an agent that has a large system prompt and/or very long inputs, you may encounter token limit issues with the get_experiment MCP tool that impacts Claude Code's ability to provide accurate insights on your agent.

Claude Code Token Limit Error

In this case, you can manually increase Claude Code's output token limit.

To set up permanently for all terminal sessions:

For zsh (default on macOS):

echo 'export MAX_MCP_OUTPUT_TOKENS=150000' >> ~/.zshrc && source ~/.zshrc

For bash:

echo 'export MAX_MCP_OUTPUT_TOKENS=150000' >> ~/.bashrc && source ~/.bashrc

For temporary use in current session only:

export MAX_MCP_OUTPUT_TOKENS=150000

Notes:

  • If you forget or don't realize you need to set a higher limit, you can quit your existing session, run the command to increase the limit, and then use claude --resume to continue your previous session with the increased limit applied.

You can learn more about tool output limits for Claude Code in their documentation.

Tips:

  • When creating experiments from your codebase, always reference the specific files of your agent when requesting experiments to avoid any ambiguity about what should be tested
  • When not in your codebase (for example, when using ChatGPT), you can reference the agent by the agent_id found in AnotherAI (ex. anotherai/agent/email-rewriter) to avoid any ambiguity about what should be tested
  • Pick one variable to test with at a time (ex. models, prompts) to make sure that you can easily attribute a given variable on the agent's changes in performance.

Analyzing Experiment Results

Once your experiment has been created, you can:

  1. Review your AI assistant's analysis of the results (and ask follow up questions if needed) Email Rewriter Experiment Analysis
  2. Review side-by-side comparisons in the AnotherAI experiments view
  3. Use annotations to mark which outputs are better and why (keep reading to learn more about annotations!)

Other Ways to Improve your Agents

How is this guide?