End-to-end evaluation harness for eforge. Runs eforge against fixture projects and validates the output compiles and tests pass.
- Node.js >= 22.6.0 (for native SQLite support)
eforgeon PATH (or setEFORGE_BIN)pnpm— installs harness deps and is also invoked by fixturevalidate:steps (pnpm install,pnpm type-check,pnpm test)
pnpm installThis repo includes a project-local pi extension in .pi/extensions/eforge-eval/. It is auto-discovered by pi when you open pi from the eval repo; run /reload if pi is already open.
Useful commands:
/eval— menu for running evals and browsing results/eval-run— interactive run wizard with profile/scenario multi-select plus repeat count, quality, dry-run, compare, and env-file options; or pass CLI-like args such as/eval-run --profile claude-sdk-opus --repeat 3 --skip-quality todo-api-errand-health-check/eval-status— active/latest run status/eval-runs— pick a completed run and rendersummary.jsonin pi/eval-result— pick a scenario and render itsresult.jsonhighlights/eval-compare— rendercomparison.jsonhighlights
--profile is required and names one or more profiles from eforge/profiles/. Comma-separated profiles run in parallel per scenario.
./run.sh --profile claude-sdk-opus todo-api-errand-health-check # One scenario, one profile
./run.sh --profile claude-sdk-opus,pi-opus todo-api-errand-health-check # Same scenario, two profiles in parallel
./run.sh --profile claude-sdk-opus todo-api-errand-health-check--claude-sdk-opus # Exact expanded ID
./run.sh --profile claude-sdk-opus --all # Every scenario
./run.sh --profile claude-sdk-opus --all --env-file .env # With extra env vars (e.g. Langfuse creds)
./run.sh --profile claude-sdk-opus --all --repeat 3 # Run each scenario 3 times, aggregate pass rate
./run.sh --profile claude-sdk-opus --all --compare 2026-04-15T12-00-00 # Diff against a prior run
./run.sh --profile claude-sdk-opus --dry-run todo-api-errand-health-check # Set up workspace only, skip eforge
./run.sh --profile claude-sdk-opus,pi-opus --skip-quality --all # Skip LLM-as-judge quality scoring (default: enabled)
./run.sh --cleanup # Remove all results
./open-monitor.sh # Open monitor UI over the shared DBScenario filters match on the base scenario ID (prefix-expanded across all selected profiles) or the fully expanded <scenario-id>--<profile> form.
Eval runs pin the chosen profile into the workspace at step 1 of eforge's 3-step profile resolution chain by copying the profile file into the workspace's eforge/profiles/ and writing a project-scope eforge/.active-profile marker. This means eval results are not affected by whatever profile a developer has set in ~/.config/eforge/ (user-scope marker or profile files).
| Variable | Default | Description |
|---|---|---|
EFORGE_BIN |
eforge |
Path to eforge binary. Use this to test a local build (e.g. EFORGE_BIN=~/projects/eforge/dist/cli.js) |
EFORGE_MONITOR_DB |
(auto-set) | Shared SQLite DB for metrics. Set automatically by the harness. |
EFORGE_TRACE_TAGS |
(auto-set) | Langfuse trace tags. Set automatically per scenario. |
--env-file sources an additional dotenv-style file into the eforge child process (useful for Langfuse credentials or other global secrets). Per-profile secrets belong in the env-file mapping in profile-envs.yaml instead.
profile-envs.yaml accepts a list of env files per profile (sourced in order, later files win on key collision):
profiles:
my-profile:
envFiles:
- env/primary.env
- env/secondary.env # keys here override primary.envA single-file shorthand is also accepted: envFile: env/my.env.
Pi-backed profiles authenticate in one of two ways:
- API-key providers (e.g. anthropic, openrouter) read credentials from environment variables. Declare a per-profile env file in
profile-envs.yamlif needed — seeenv/pi.envfor the OpenRouter-style template. - OAuth providers (e.g. openai-codex used by
pi-gpt) rely on cached credentials at~/.pi/agent/auth.json. Runpi loginonce in your user environment before evaluating.
In profile files, provider/model live under agents.models.<class> (usually max). There is no pi.provider or pi.model key — those are not part of eforge's Pi config schema.
mixed-opus-planner-pi-builder.yaml exercises the agentRuntimes map: planning, review, and evaluation tiers run on claude-sdk + opus-4-7, while the builder role is offloaded to a local mlx-lm Qwen model via Pi. Run a smoke test comparing it with the single-runtime claude-sdk-opus baseline:
./run.sh --profile claude-sdk-opus,mixed-opus-planner-pi-builder todo-api-errand-health-checkRequires the local mlx-lm server to be reachable; no API key needed.
- Each scenario copies a fixture to a temp directory in
/tmp/and initializes a fresh git repo. - The selected profile is copied into the workspace as
eforge/profiles/<name>.yaml, andeforge/.active-profileis written with the profile name — pinning step 1 of eforge's profile precedence. - Runs
eforge run <prd> --auto --verbose --foreground --no-monitorfrom the workspace. - Events are recorded to a shared SQLite DB (
results/monitor.db) viaEFORGE_MONITOR_DB. - Validation commands run against the workspace (type-check, tests, etc.).
- Results are aggregated into
results/<timestamp>/summary.json.
A monitor server starts from the eval repo root, providing a stable web UI for observing runs. Individual eforge runs use --no-monitor (foreground mode, writing directly to the shared DB). When multiple profiles are requested for the same scenario, they execute concurrently; scenarios themselves run sequentially.
Scenarios describe what to build. Edit scenarios.yaml:
scenarios:
- id: my-scenario
fixture: my-fixture # Directory under fixtures/
prd: docs/my-prd.md # PRD path within the fixture
description: "What this tests"
validate:
- pnpm install
- pnpm type-check
- pnpm test
expect: # Optional
mode: errand
buildStagesContain: [implement]
# skip: true # Opt in when the PRD is expected to be already satisfiedCreate the fixture under fixtures/my-fixture/ with source code and the PRD file.
Expectation checks are recorded on result.json under expectations.checks. mode and build-stage checks are informational (judgment calls). The skip check is a gating expectation: a mismatch fails the scenario. Scenarios that set expect.mode or declare non-empty validate steps implicitly expect skip: false; the synthesized check is tagged implicit: true on result.json so you can tell it apart from an explicit expect.skip.
Profiles describe how to build — harness, models, optional env file. They are plain eforge profile files living under eforge/profiles/. Drop a new file in that directory:
# eforge/profiles/my-profile.yaml
agentRuntimes:
default:
harness: pi # or: claude-sdk
defaultAgentRuntime: default
agents:
models:
max:
provider: openrouter # provider keys are harness-specific
id: some-model-idThe filename (minus .yaml) becomes the profile name and is used as the <scenario-id>--<profile> suffix on expanded scenario IDs. Profiles of the same base scenario auto-group for side-by-side comparison — no extra field required.
If the profile needs an env file (for API keys, etc.), add an entry to profile-envs.yaml:
profiles:
my-profile:
envFiles:
- env/my.envProfiles without an entry in profile-envs.yaml run without a custom env file (OAuth profiles like pi-gpt fall into this bucket — they rely on cached credentials).
Because profile files are native eforge format, you can also copy one from your own ~/.config/eforge/profiles/ into eval/eforge/profiles/ to measure it in the eval harness.
Results are stored in results/<timestamp>/ (gitignored). Only the last 50 runs are kept; older runs are pruned automatically.
Per run:
summary.json— aggregate metrics across all scenariosanalysis.json— observations/warnings produced bylib/analyze.tscomparison.json— side-by-side profile comparison (written when a scenario ran with multiple profiles)
Per scenario (<timestamp>/<scenario-id>--<profile>/):
result.json— metrics, validation results, expectations, and the profile used. By default, also contains aquality.absoluteblock (per-dimension scores + weighted overall); pass--skip-qualityto disable.eforge.log— full eforge outputorchestration.yaml— preserved plan metadatavalidate-*.log— per-validation-command output (one file pervalidate:step)workspace-path.txt— path to the temp workspace that was used (deleted after the run)quality/(omitted with--skip-quality) —prd.mdanddiff.patchsnapshots taken before workspace cleanup, used bycompare.tsto re-score pairwise without re-running eforge
With --repeat N > 1, each scenario directory additionally contains run-1/, run-2/, … with their own result.json; the top-level result.json becomes an aggregate with passRate and per-run pass flags.
Quality scoring runs by default on every eval, adding an LLM-as-judge layer on top of the correctness/cost metrics. Pass --skip-quality to disable it (useful when you don't have judge auth available or want to keep a run cheap).
- Absolute (per scenario, inline) — graded on a 4-dimension rubric (PRD adherence, code quality, test quality, change discipline) with anchored 1–5 scales. Output lands in
result.json.quality.absolute. - Pairwise (during
compare.ts, for each scenario group with ≥2 profiles) — judges each profile pair per dimension and emits a winner/tie. A/B order is randomized per pair to mitigate position bias. Output lands incomparison.json.groups[].dimensions.quality.
compare.ts includes the quality dimension whenever any input result.json has populated quality.absolute data — re-running npx tsx lib/compare.ts <existing-results-dir> regenerates pairwise scores from <scenario>/quality/{prd.md,diff.patch} snapshots without re-running eforge. Pass --skip-quality to that invocation to suppress new pairwise scoring (existing absolute data is still surfaced).
Configuration lives in judge.yaml at the eval root:
model: claude-opus-4-7
maxOutputTokens: 2048
weights:
prdAdherence: 0.4
codeQuality: 0.25
testQuality: 0.25
changeDiscipline: 0.1 # weights must sum to 1.0
maxDiffBytes: 80000 # diffs above this are truncated with a markerAuth: judge calls go through @anthropic-ai/claude-agent-sdk, which inherits Claude Code's host auth (subscription if logged in) and falls back to ANTHROPIC_API_KEY. If neither is available, scoring logs a non-fatal warning and the eval run continues without a quality block — pass --skip-quality upfront to silence it. The judge runs with allowedTools: [] — no file, shell, or MCP access — so it sees only the prompt + diff text passed in.