Evaluating Agents

2025-09-0323:32429aunhumano.com

No amount of evals will replace the need to look at the data, once you have a evals good coverage you’ll be able to decrease the time but it’ll be always a must to just look at the agent traces to…

No amount of evals will replace the need to look at the data, once you have a evals good coverage you’ll be able to decrease the time but it’ll be always a must to just look at the agent traces to identify possible issues or things to improve.

You must create evals for your agents, stop relying solely on manual testing.Not sure where to start?

Add e2e evals, define a success criteria (did the agent meet the user’s goal?) and make the evals output a simple yes/no value.

This is much better than no evals.

By performing simple end to end agent evaluations you can quickly manage to:– identify problematic edge cases– update, trim and refine the agent prompts– make sure you are not breaking the already working cases

– compare the performance of the current llm model vs. cheaper ones

Once you created the e2e evals you can move on with “N – 1” evals, that is, evals that need to “simulate” previous interactions between system and user.

Suppose that either by looking at the data or by running a set of e2e evals you find that there is a problem when the user asks for the brand open stores in his area. Well, it’d be better to create an eval to directly improve this, but if you keep doing it with the e2e evals you won’t be able to always reproduce the error and your evals will take too much time and will cost too much money.

It’d be much better to “simulate” the previous interactions and then get to the point.

There’s one issue with this, you’ll have to be careful to keep the “N – 1” interactions updated whenever you make some changes because you will be “simulating” something that will never happen again in your agent.

It’s really difficult and time intensive to evaluate agents outputs when you are trying to validate complex conversation patterns that you want the LLM’s to strictly follow.I usually put “checkpoints” inside the prompts, words that I ask the llm to output verbatim.

This allows me to simply make some evals that check for exact strings. If at some point of the conversation the string is not present, I can pretty much know that the system is not working as expected.

Tools can help you by simplifying the setup/infra and maybe giving you a nice interface, but you still have to look at the data and build the specific evaluations for your use case.

Don’t rely solely on standard evals, build your own.


Read the original article

Comments

  • By localbuilder 2025-09-042:131 reply

    > There’s one issue with this, you’ll have to be careful to keep the “N – 1” interactions updated whenever you make some changes because you will be “simulating” something that will never happen again in your agent.

    This is the biggest problem I've encountered with evals for agents so far. Especially with agents that might do multiple turns of user input > perform task > more user input > perform another task > etc.

    Creating evals for these flows has been difficult because I've found mocking the conversation to a certain point runs into the drift problem you highlighted as the system changes. I've also explored using an LLM to create dynamic responses to points that require additional user input in E2E flows, which adds its own levels of complexity and indeterministic behavior. Both approaches are time consuming and difficult to setup in their own ways.

    • By mfalcon 2025-09-043:43

      Yes, and these problems are more present in the first iterations, when you are still trying to get a good enough agent behaviour.

      I'm still thinking about good ways to mitigate this issue, will share.

  • By CuriouslyC 2025-09-044:03

    Feed your failure traces into gemini to get a distillate then use DSPy to optimize the tools/prompts that are failing.

  • By mailswept_dev 2025-09-043:15

    Totally agree with this — especially the part about end-to-end evals. I’ve seen too many teams rely only on manual testing and miss obvious regressions. Checkpoints + lightweight e2e evals feel like the sweet spot before things get too costly.

HackerNews