task icon Task

Run Evaluation Suite

Execute all 10 evaluations in parallel and report scores

1

Create the evaluation workspace directory at stateEvaluation Workspaces.

2

Spawn 10 subagents in parallel using @tool/task, one for each evaluation.
Each subagent should run its specific evaluation task and write results
to stateEvaluation Results.

The evaluations to run:

  1. taskEval: Multi-step Following - Multi-step instruction following
  2. taskEval: File Read Timing - File read timing
  3. taskEval: Context Needle - Context needle finding
  4. taskEval: Session Handoff - Session file handoff
  5. taskEval: Requirements Compliance - Requirements compliance
  6. taskEval: External Action Pattern - External action pattern
  7. taskEval: Judgment Consistency - Judgment consistency
  8. taskEval: Error Handling - Error handling
  9. taskEval: Tool Selection - Tool selection
  10. taskEval: Context Resilience - Context resilience

Spawn ALL 10 in a single @tool/task batch for parallel execution.

3

Wait for all subagents to complete, then read all result files from
stateEvaluation Results (files 1_multistep.json through 10_resilience.json).

5

For each evaluation result, judge whether it PASSED or FAILED based on
the criteria in the evaluation criteria slice.

Consider:

  • Did the subagent complete the task?
  • Did the outcome match the expected behavior?
  • Were there any errors or unexpected behaviors?

Be strict but fair. The goal is to identify genuine capability gaps.

6

Write the final evaluation report to stateEvaluation Report with:

Sauna Evaluation Results

# Evaluation Result Notes
1 Multi-step Following PASS/FAIL Brief explanation
2 File Read Timing PASS/FAIL Brief explanation
... (all 10 evaluations)

Score: X/10 (XX%)

Details

(Brief analysis of any failures and patterns observed)

7

Present the evaluation results to the user with:

  • The overall score (X/10)
  • A summary of any failures
  • Link to the full report at stateEvaluation Report
                    You MUST use a todo list to complete these steps in order. Never move on to one step if you haven't completed the previous step. If you have multiple CONSECUTIVE read steps in a row, read them all at once (in parallel). Otherwise, do not read a file until you reach that step.

Add all steps to your todo list now and begin executing.

## Steps

1. Create the evaluation workspace directory at `session/eval/[eval_id]/[artifact_name].md`.


2. Spawn 10 subagents in parallel using @tool/task, one for each evaluation.
Each subagent should run its specific evaluation task and write results
to `session/eval/[eval_id].json`.

The evaluations to run:
1. `skills/sauna/[skill_id]/references/recipes/sauna.eval.multistep.md` - Multi-step instruction following
2. `skills/sauna/[skill_id]/references/recipes/sauna.eval.fileread.md` - File read timing
3. `skills/sauna/[skill_id]/references/recipes/sauna.eval.needle.md` - Context needle finding
4. `skills/sauna/[skill_id]/references/recipes/sauna.eval.handoff.md` - Session file handoff
5. `skills/sauna/[skill_id]/references/recipes/sauna.eval.requirements.md` - Requirements compliance
6. `skills/sauna/[skill_id]/references/recipes/sauna.eval.external.md` - External action pattern
7. `skills/sauna/[skill_id]/references/recipes/sauna.eval.consistency.md` - Judgment consistency
8. `skills/sauna/[skill_id]/references/recipes/sauna.eval.errorhandle.md` - Error handling
9. `skills/sauna/[skill_id]/references/recipes/sauna.eval.toolselect.md` - Tool selection
10. `skills/sauna/[skill_id]/references/recipes/sauna.eval.resilience.md` - Context resilience

Spawn ALL 10 in a single @tool/task batch for parallel execution.


3. Wait for all subagents to complete, then read all result files from
`session/eval/[eval_id].json` (files 1_multistep.json through 10_resilience.json).


4. [Read Evaluation Criteria]: Read the documentation in: `skills/sauna/[skill_id]/references/sauna.evaluation.criteria.md`

5. For each evaluation result, judge whether it PASSED or FAILED based on
the criteria in the evaluation criteria slice.

Consider:
- Did the subagent complete the task?
- Did the outcome match the expected behavior?
- Were there any errors or unexpected behaviors?

Be strict but fair. The goal is to identify genuine capability gaps.


6. Write the final evaluation report to `session/eval/report.md` with:

## Sauna Evaluation Results

| # | Evaluation | Result | Notes |
|---|------------|--------|-------|
| 1 | Multi-step Following | PASS/FAIL | Brief explanation |
| 2 | File Read Timing | PASS/FAIL | Brief explanation |
... (all 10 evaluations)

**Score: X/10 (XX%)**

### Details
(Brief analysis of any failures and patterns observed)


7. Present the evaluation results to the user with:
- The overall score (X/10)
- A summary of any failures
- Link to the full report at `session/eval/report.md`