Queue Pilot is a Node.js workspace for inspecting queue pressure on the SiFive Slurm/Jenkins EDA
farm. Today it is primarily a read-heavy triage tool: it helps engineers understand why jobs are
pending, where regression flows are logjammed, and which parents or external flows are blocking
progress. Queue-management functions are planned later, but mutating Slurm actions are currently
gated off behind ENABLE_ACTIONS=false.
It ships with:
- a React web dashboard
- a Fastify REST API backed by cached Slurm snapshots in SQLite
- an MCP server so agents can query the same diagnostics surfaces
- Summarizes queue pressure by account and partition.
- Splits diagnostics into focused views for
logjams,pending,running, andcontrol planetraffic. - Groups jobs by flow, WCKey, and workdir root so related runs can be traced together.
- Surfaces fan-out
srunlogjams where running parent flows are waiting on re-queued children. - Annotates logjams with external queue pressure: higher-priority jobs from other flows ahead in the same scheduling lane and an estimated drain latency.
- Provides watchlist matching for jobs of interest by user, account, WCKey, workdir, name, or job id.
- Estimates ETA-to-start / ETA-to-finish from historical bucket statistics and live queue shape.
The web app currently exposes these pages:
Pressure: account and partition hotspot summary from the latest collector snapshot.Logjams: grouped D3 graph view of blocked flows, origin parents, active runners, external queue pressure, and blocked reason buckets.Control Plane: isolates/rootand nullish orchestration flows from normal verification traffic.Pending: aggregated graph or list view of waiting jobs, with WCKey grouping, parent blockers, and clickable workdir links.Running: aggregated graph or list view of active jobs by flow and WCKey.Watchlist: saved matchers with diagnosis and ETA context.
Pressure view:
Logjams view:
packages/shared- REASON taxonomy, ETA math, shared types and helperspackages/server- Fastify API, Slurm adapters, collector, diagnostics, watchlist, SQLitepackages/mcp- MCP server exposing queue diagnostics tools over stdiopackages/web- Vite + React dashboarddocs- architecture notes, Slurm query recipes, ETA notes, triage workflow referencesAGENTS.md- authoritative implementation spec and operating manual for this repo
- Slurm access is transport-agnostic:
cli,restd, ormock. - The diagnostics endpoints read from the latest cached snapshot when available instead of hitting live Slurm on every page refresh.
- SQLite stores snapshots and historical rollups used by the ETA heuristic.
- Queue actions are not implemented in the shipped code paths today;
ENABLE_ACTIONSremains a safety gate for future work.
npm installcp .env.example .envCommon settings:
SLURM_ADAPTER=mockfor offline developmentSLURM_ADAPTER=clifor live reads via local Slurm commands or SSHSLURM_SSH_HOST,SLURM_SSH_USER,SLURM_SSH_KEYwhen reading through a login nodeDB_PATHto control where the SQLite snapshot cache is storedENABLE_ACTIONS=falseto keep all Slurm access read-only
In one shell:
npm run dev:mockIn another shell:
npm run dev:webOptional MCP server:
npm run dev:mcpSet the adapter and connection details in .env, then start the server and web app:
npm run devnpm run dev:webThe server defaults to port 8080.
GET /api/clustersGET /api/pressureGET /api/diagnoseGET /api/jobs/:idGET /api/eta/:idGET|POST|DELETE /api/watchGET /api/watch/:id/status
The diagnose endpoint powers the page-specific views and supports:
section=summary|logjams|control|pending|runningview=graph|listfor pending/runningsearch=...for job ids, WCKeys, blockers, users, accounts, and workdirs
Run the shipped tests:
npm testBuild the web app:
npm run build- docs/SLURM-QUERIES.md - exact read-only Slurm commands used by the app
- docs/REGRESSION-SLURM-JOB-TRIAGE-WORKFLOW.md - reference workflow for stalled regression jobs
- docs/ETA-MODEL.md - ETA heuristic notes and caveats
- docs/ARCHITECTURE.md - service and transport overview
Read AGENTS.md first. It is the authoritative spec for the Slurm queries, diagnostics rules, ETA methodology, watchlist behavior, and safety constraints in this repo.

