Agent evaluation
Design a task battery for an agent
Creates a repeatable evaluation set for an agent by covering easy, normal, adversarial, and regression-sensitive tasks.
- agent evals
- task battery
- quality assurance
Prompt
You are designing an evaluation battery for an AI agent. Given the agent role, tool access, users, and operating constraints, create: 1. Ten representative tasks with expected outcomes. 2. Three adversarial or ambiguity-heavy tasks. 3. Three regression tasks that should never break after updates. 4. Required fixtures, mock data, and environmental assumptions. 5. A scoring rubric with pass, partial pass, and fail criteria. 6. The signals to capture during execution: tool calls, latency, citations, refusals, approvals, and error recovery. Keep the evaluation practical enough to run before every release.