Run Evaluation Suite
Execute all 10 evaluations in parallel and report scores
Create the evaluation workspace directory at Evaluation Workspaces.
Spawn 10 subagents in parallel using @tool/task, one for each evaluation.
Each subagent should run its specific evaluation task and write results
to Evaluation Results.
The evaluations to run:
Eval: Multi-step Following - Multi-step instruction following
Eval: File Read Timing - File read timing
Eval: Context Needle - Context needle finding
Eval: Session Handoff - Session file handoff
Eval: Requirements Compliance - Requirements compliance
Eval: External Action Pattern - External action pattern
Eval: Judgment Consistency - Judgment consistency
Eval: Error Handling - Error handling
Eval: Tool Selection - Tool selection
Eval: Context Resilience - Context resilience
Spawn ALL 10 in a single @tool/task batch for parallel execution.
Wait for all subagents to complete, then read all result files fromEvaluation Results (files 1_multistep.json through 10_resilience.json).
For each evaluation result, judge whether it PASSED or FAILED based on
the criteria in the evaluation criteria slice.
Consider:
- Did the subagent complete the task?
- Did the outcome match the expected behavior?
- Were there any errors or unexpected behaviors?
Be strict but fair. The goal is to identify genuine capability gaps.
Write the final evaluation report to Evaluation Report with:
Sauna Evaluation Results
| # | Evaluation | Result | Notes |
|---|---|---|---|
| 1 | Multi-step Following | PASS/FAIL | Brief explanation |
| 2 | File Read Timing | PASS/FAIL | Brief explanation |
| ... (all 10 evaluations) |
Score: X/10 (XX%)
Details
(Brief analysis of any failures and patterns observed)
Present the evaluation results to the user with:
- The overall score (X/10)
- A summary of any failures
- Link to the full report at
Evaluation Report
You MUST use a todo list to complete these steps in order. Never move on to one step if you haven't completed the previous step. If you have multiple CONSECUTIVE read steps in a row, read them all at once (in parallel). Otherwise, do not read a file until you reach that step.
Add all steps to your todo list now and begin executing.
## Steps
1. Create the evaluation workspace directory at `session/eval/[eval_id]/[artifact_name].md`.
2. Spawn 10 subagents in parallel using @tool/task, one for each evaluation.
Each subagent should run its specific evaluation task and write results
to `session/eval/[eval_id].json`.
The evaluations to run:
1. `skills/sauna/[skill_id]/references/recipes/sauna.eval.multistep.md` - Multi-step instruction following
2. `skills/sauna/[skill_id]/references/recipes/sauna.eval.fileread.md` - File read timing
3. `skills/sauna/[skill_id]/references/recipes/sauna.eval.needle.md` - Context needle finding
4. `skills/sauna/[skill_id]/references/recipes/sauna.eval.handoff.md` - Session file handoff
5. `skills/sauna/[skill_id]/references/recipes/sauna.eval.requirements.md` - Requirements compliance
6. `skills/sauna/[skill_id]/references/recipes/sauna.eval.external.md` - External action pattern
7. `skills/sauna/[skill_id]/references/recipes/sauna.eval.consistency.md` - Judgment consistency
8. `skills/sauna/[skill_id]/references/recipes/sauna.eval.errorhandle.md` - Error handling
9. `skills/sauna/[skill_id]/references/recipes/sauna.eval.toolselect.md` - Tool selection
10. `skills/sauna/[skill_id]/references/recipes/sauna.eval.resilience.md` - Context resilience
Spawn ALL 10 in a single @tool/task batch for parallel execution.
3. Wait for all subagents to complete, then read all result files from
`session/eval/[eval_id].json` (files 1_multistep.json through 10_resilience.json).
4. [Read Evaluation Criteria]: Read the documentation in: `skills/sauna/[skill_id]/references/sauna.evaluation.criteria.md`
5. For each evaluation result, judge whether it PASSED or FAILED based on
the criteria in the evaluation criteria slice.
Consider:
- Did the subagent complete the task?
- Did the outcome match the expected behavior?
- Were there any errors or unexpected behaviors?
Be strict but fair. The goal is to identify genuine capability gaps.
6. Write the final evaluation report to `session/eval/report.md` with:
## Sauna Evaluation Results
| # | Evaluation | Result | Notes |
|---|------------|--------|-------|
| 1 | Multi-step Following | PASS/FAIL | Brief explanation |
| 2 | File Read Timing | PASS/FAIL | Brief explanation |
... (all 10 evaluations)
**Score: X/10 (XX%)**
### Details
(Brief analysis of any failures and patterns observed)
7. Present the evaluation results to the user with:
- The overall score (X/10)
- A summary of any failures
- Link to the full report at `session/eval/report.md`