Task

Run Evaluation Suite

Execute all 10 evaluations in parallel and report scores

Create the evaluation workspace directory at Evaluation Workspaces.

Spawn 10 subagents in parallel using @tool/task, one for each evaluation.
Each subagent should run its specific evaluation task and write results
to Evaluation Results.

The evaluations to run:

Eval: Multi-step Following - Multi-step instruction following
Eval: File Read Timing - File read timing
Eval: Context Needle - Context needle finding
Eval: Session Handoff - Session file handoff
Eval: Requirements Compliance - Requirements compliance
Eval: External Action Pattern - External action pattern
Eval: Judgment Consistency - Judgment consistency
Eval: Error Handling - Error handling
Eval: Tool Selection - Tool selection
Eval: Context Resilience - Context resilience

Spawn ALL 10 in a single @tool/task batch for parallel execution.

Wait for all subagents to complete, then read all result files from
Evaluation Results (files 1_multistep.json through 10_resilience.json).

Read Evaluation Criteria slice →

For each evaluation result, judge whether it PASSED or FAILED based on
the criteria in the evaluation criteria slice.

Consider:

Did the subagent complete the task?
Did the outcome match the expected behavior?
Were there any errors or unexpected behaviors?

Be strict but fair. The goal is to identify genuine capability gaps.

Write the final evaluation report to Evaluation Report with:

Sauna Evaluation Results

#	Evaluation	Result	Notes
1	Multi-step Following	PASS/FAIL	Brief explanation
2	File Read Timing	PASS/FAIL	Brief explanation
... (all 10 evaluations)

Score: X/10 (XX%)

Details

(Brief analysis of any failures and patterns observed)

Present the evaluation results to the user with:

The overall score (X/10)
A summary of any failures
Link to the full report at Evaluation Report

                    You MUST use a todo list to complete these steps in order. Never move on to one step if you haven't completed the previous step. If you have multiple CONSECUTIVE read steps in a row, read them all at once (in parallel). Otherwise, do not read a file until you reach that step.

Add all steps to your todo list now and begin executing.

## Steps

1. Create the evaluation workspace directory at `session/eval/[eval_id]/[artifact_name].md`.


2. Spawn 10 subagents in parallel using @tool/task, one for each evaluation.
Each subagent should run its specific evaluation task and write results
to `session/eval/[eval_id].json`.

The evaluations to run:
1. `skills/sauna/[skill_id]/references/recipes/sauna.eval.multistep.md` - Multi-step instruction following
2. `skills/sauna/[skill_id]/references/recipes/sauna.eval.fileread.md` - File read timing
3. `skills/sauna/[skill_id]/references/recipes/sauna.eval.needle.md` - Context needle finding
4. `skills/sauna/[skill_id]/references/recipes/sauna.eval.handoff.md` - Session file handoff
5. `skills/sauna/[skill_id]/references/recipes/sauna.eval.requirements.md` - Requirements compliance
6. `skills/sauna/[skill_id]/references/recipes/sauna.eval.external.md` - External action pattern
7. `skills/sauna/[skill_id]/references/recipes/sauna.eval.consistency.md` - Judgment consistency
8. `skills/sauna/[skill_id]/references/recipes/sauna.eval.errorhandle.md` - Error handling
9. `skills/sauna/[skill_id]/references/recipes/sauna.eval.toolselect.md` - Tool selection
10. `skills/sauna/[skill_id]/references/recipes/sauna.eval.resilience.md` - Context resilience

Spawn ALL 10 in a single @tool/task batch for parallel execution.


3. Wait for all subagents to complete, then read all result files from
`session/eval/[eval_id].json` (files 1_multistep.json through 10_resilience.json).


4. [Read Evaluation Criteria]: Read the documentation in: `skills/sauna/[skill_id]/references/sauna.evaluation.criteria.md`

5. For each evaluation result, judge whether it PASSED or FAILED based on
the criteria in the evaluation criteria slice.

Consider:
- Did the subagent complete the task?
- Did the outcome match the expected behavior?
- Were there any errors or unexpected behaviors?

Be strict but fair. The goal is to identify genuine capability gaps.


6. Write the final evaluation report to `session/eval/report.md` with:

## Sauna Evaluation Results

| # | Evaluation | Result | Notes |
|---|------------|--------|-------|
| 1 | Multi-step Following | PASS/FAIL | Brief explanation |
| 2 | File Read Timing | PASS/FAIL | Brief explanation |
... (all 10 evaluations)

**Score: X/10 (XX%)**

### Details
(Brief analysis of any failures and patterns observed)


7. Present the evaluation results to the user with:
- The overall score (X/10)
- A summary of any failures
- Link to the full report at `session/eval/report.md`

Task Info

Steps

Tokens

717

Used By

Sauna Evaluation Suite skill

task:sauna.evaluation.run