Task

Eval: Judgment Consistency

Test whether agent makes consistent judgments on similar items

This is a controlled evaluation testing judgment consistency.

Rate the following 3 email subject lines on a scale of 1-10 for
"professionalism". These are intentionally similar to test consistency.

Subject Line A: "Quick update on the Q3 project timeline"
Subject Line B: "Brief update regarding Q3 project schedule"
Subject Line C: "Short note about the Q3 project dates"

For each subject line:

Give a score from 1-10
Provide a brief (1 sentence) reasoning

Be consistent - similar items should get similar scores.

Write the evaluation result to Evaluation Results as 7_consistency.json:

{
  "eval_id": "consistency",
  "scenario": "Rate 3 similar items consistently",
  "outcome": {
    "ratings": {
      "subject_a": {
        "score": 1-10,
        "reasoning": "your reasoning"
      },
      "subject_b": {
        "score": 1-10,
        "reasoning": "your reasoning"
      },
      "subject_c": {
        "score": 1-10,
        "reasoning": "your reasoning"
      }
    },
    "score_variance": "difference between highest and lowest score",
    "consistency_assessment": "your assessment of whether ratings are consistent"
  },
  "self_assessment": "Brief description of your rating approach"
}

                    You MUST use a todo list to complete these steps in order. Never move on to one step if you haven't completed the previous step. If you have multiple CONSECUTIVE read steps in a row, read them all at once (in parallel). Otherwise, do not read a file until you reach that step.

Add all steps to your todo list now and begin executing.

## Steps

1. This is a controlled evaluation testing judgment consistency.

Rate the following 3 email subject lines on a scale of 1-10 for
"professionalism". These are intentionally similar to test consistency.

**Subject Line A:** "Quick update on the Q3 project timeline"
**Subject Line B:** "Brief update regarding Q3 project schedule"
**Subject Line C:** "Short note about the Q3 project dates"

For each subject line:
1. Give a score from 1-10
2. Provide a brief (1 sentence) reasoning

Be consistent - similar items should get similar scores.


2. Write the evaluation result to `session/eval/[eval_id].json` as `7_consistency.json`:

```json
{
  "eval_id": "consistency",
  "scenario": "Rate 3 similar items consistently",
  "outcome": {
    "ratings": {
      "subject_a": {
        "score": 1-10,
        "reasoning": "your reasoning"
      },
      "subject_b": {
        "score": 1-10,
        "reasoning": "your reasoning"
      },
      "subject_c": {
        "score": 1-10,
        "reasoning": "your reasoning"
      }
    },
    "score_variance": "difference between highest and lowest score",
    "consistency_assessment": "your assessment of whether ratings are consistent"
  },
  "self_assessment": "Brief description of your rating approach"
}
```

Task Info

Steps

Tokens

388

Used By

Run Evaluation Suite task

task:sauna.eval.consistency