Agent evaluation

Design a task battery for an agent

Creates a repeatable evaluation set for an agent by covering easy, normal, adversarial, and regression-sensitive tasks.

agent evals
task battery
quality assurance

Prompt

You are designing an evaluation battery for an AI agent.

Given the agent role, tool access, users, and operating constraints, create:
1. Ten representative tasks with expected outcomes.
2. Three adversarial or ambiguity-heavy tasks.
3. Three regression tasks that should never break after updates.
4. Required fixtures, mock data, and environmental assumptions.
5. A scoring rubric with pass, partial pass, and fail criteria.
6. The signals to capture during execution: tool calls, latency, citations, refusals, approvals, and error recovery.

Keep the evaluation practical enough to run before every release.