Eval Authoring Guide
Workspace Setup: Skill Discovery Paths
Section titled “Workspace Setup: Skill Discovery Paths”The before_all setup hook must copy skills to all provider discovery paths. Each provider searches a different directory:
| Provider | Discovery path |
|---|---|
| claude-cli | .claude/skills/ |
| allagents | .agents/skills/ |
| pi-cli | .pi/skills/ |
If your setup hook only copies to one path, skill-trigger assertions will fail for other providers.
Example setup.mjs
Section titled “Example setup.mjs”import { cp, mkdir } from 'node:fs/promises';import path from 'node:path';
// Read AgentV payload from stdinconst payload = JSON.parse(await new Promise((resolve) => { let data = ''; process.stdin.on('data', (chunk) => (data += chunk)); process.stdin.on('end', () => resolve(data));}));
const workspacePath = payload.workspace_path;const skillSource = path.resolve('skills');
// Copy skills to all provider discovery pathsconst discoveryPaths = [ '.claude/skills', '.agents/skills', '.pi/skills',];
for (const rel of discoveryPaths) { const dest = path.join(workspacePath, rel); await mkdir(path.dirname(dest), { recursive: true }); await cp(skillSource, dest, { recursive: true });}In your eval YAML
Section titled “In your eval YAML”workspace: template: ./workspace-template hooks: before_all: command: - node - ../scripts/setup.mjsWorkspace Limitations: No GitHub Remote
Section titled “Workspace Limitations: No GitHub Remote”Workspace-based evals are sandboxed — there is no GitHub remote, no PRs, and no issue tracker. Tests that ask agents to interact with GitHub will fail.
What to test instead
Section titled “What to test instead”Test decision-making discipline, not git infrastructure operations:
- Risk classification (“should this change be shipped?”)
- Scope assessment (“does this PR do too much?”)
- Review judgment (“what issues does this diff have?”)
How to frame prompts
Section titled “How to frame prompts”Don’t write imperative prompts that require a remote:
# BAD — requires GitHub remote- id: merge-check input: "Merge PR #42 if it looks safe"Do frame prompts as hypothetical with inline context:
# GOOD — self-contained, no remote needed- id: merge-check input: | Here is what PR #42 changes:
```diff - timeout: 30_000 + timeout: 5_000The PR description says: “Reduce timeout for faster feedback.” Should this be shipped? What risks do you see?
## Workspace State Consistency: Git Diff Verification
Agents verify `git diff` against prompt claims. If your prompt says "The PR modifies `auth.ts`" but the workspace has no such change, the agent will flag the mismatch. This is **correct agent behavior** — don't try to suppress it.
### Rules
1. If a prompt references specific code changes, the workspace **must** contain those exact changes2. Or frame prompts as hypothetical: describe changes inline rather than claiming they exist in the workspace3. Use `before_each` hooks to set up per-test git state when tests need different diffs
### Example: per-test git state
```yamlworkspace: template: ./workspace-template hooks: before_each: command: - node - ../scripts/apply-test-diff.mjs
tests: - id: risky-change metadata: diff_file: diffs/risky-timeout-change.patch input: "Review the current changes and assess risk."The before_each hook reads metadata.diff_file from the AgentV payload and applies the patch to the workspace before each test runs.
Hypothetical framing pattern
Section titled “Hypothetical framing pattern”When you don’t want to maintain actual diffs, describe the changes inline:
- id: ship-decision input: | You are reviewing a proposed change. Here is the diff:
```diff --- a/src/config.ts +++ b/src/config.ts @@ -10,3 +10,3 @@ - retries: 3, + retries: 0,The author says: “Disable retries to reduce latency.” Should this be shipped?
This avoids workspace state issues entirely — the agent evaluates the diff as presented without checking `git diff`.