Skip to content

feat(webhook): configurable queue selection for matching runners#5190

Open
guicaulada wants to merge 3 commits into
mainfrom
feat/webhook-queue-selection-strategy
Open

feat(webhook): configurable queue selection for matching runners#5190
guicaulada wants to merge 3 commits into
mainfrom
feat/webhook-queue-selection-strategy

Conversation

@guicaulada

@guicaulada guicaulada commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Description

A workflow_job whose labels match several runner configurations is always dispatched to the first matching queue (after the exactMatch sort). When multiple pools intentionally share a generic label — e.g. an "any architecture" or "this size or larger" label spanning several runner configs — every cold scale-up funnels to a single queue, overloading one pool while equally-valid pools sit idle. There is currently no way to spread that load.

This adds a configurable queue selection strategy, applied to the equally-best matches (those sharing the top exactMatch priority tier):

  • first (default): unchanged — deterministic first match.
  • random: pick one uniformly, spreading jobs across the matching queues so a single pool's queue does not become a bottleneck.
  • all: dispatch to every matching queue — scaling up one runner per matching pool and letting the first available runner take the job (speed over cost). GitHub assigns the queued job to exactly one runner; the losers are reaped by scale-down.

exactMatch priority is preserved: random/all only ever operate within the highest-priority matching tier, never a lower-priority match. The strategy applies to standard jobs; dynamic (ghr-) label jobs continue to use the first compliant queue.

Caveats for all (deliberate opt-in)

  • Multiplies instance launches per job (losers idle until scale-down's minimum_running_time_in_minutes).
  • Multiplies runner registrations per job, increasing GitHub API usage — relevant where API rate limits are already a concern.
  • Only truly races when enable_job_queued_check = false (otherwise later scale-ups see the job already taken and skip).

Changes

Lambda — new QUEUE_SELECTION_STRATEGY env var (validated; defaults to first), read by both the direct webhook and the EventBridge dispatcher; selectQueues() implements first/random/all within the top-priority matching tier.

Terraform — a queue_selection_strategy variable (validated first/random/all, default first) on the root and multi-runner modules, threaded through the webhook module config into the direct/eventbridge lambda env var, plus regenerated terraform-docs.

RFC note: Per CONTRIBUTING (discuss major changes first), open questions for maintainers — happy to adjust:

  • global setting (as implemented) vs. per-runner-config option?
  • naming (queue_selection_strategy; values first/random/all)?
  • should all (and random) extend to the dynamic-label path?

Test Plan

  • Added unit tests in dispatch.test.ts: default picks first; random spreads across equally-matching queues (Math.random mocked); random preserves exactMatch priority; all dispatches to every top-tier match but not lower-priority ones; invalid strategy rejected at config load.
  • yarn test (webhook): 40/40 pass. yarn build (ncc typecheck) passes. ESLint: 0 errors. Prettier: clean.
  • terraform fmt -check and terraform validate pass on the root and multi-runner modules; terraform-docs regenerated.

Related Issues

Motivation is load distribution across pools that share generic labels (avoiding single-queue hotspots), and a speed-over-cost option for large-scale environments. No existing upstream issue — happy to open one to track the discussion if preferred.

@guicaulada guicaulada requested a review from a team as a code owner June 29, 2026 16:56
@github-actions

Copy link
Copy Markdown
Contributor

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

A workflow_job whose labels match several runner configs is always
dispatched to the first matching queue (after the exactMatch sort). When
multiple pools share a generic label (e.g. an "any architecture" label),
every cold scale-up funnels to a single queue, overloading one pool while
equally-valid pools sit idle.

Add a queue selection strategy applied to the equally-best matches (those
sharing the top exactMatch priority tier):
- `first` (default): unchanged, deterministic first match.
- `random`: pick one uniformly, spreading jobs across the matching queues.
- `all`: dispatch to every matching queue, scaling up one runner per pool
  and letting the first available take the job (speed over cost). This
  multiplies instance launches and runner registrations per job.

exactMatch priority is preserved — random/all never select a lower-priority
match. Configured via a new QUEUE_SELECTION_STRATEGY env var (validated;
defaults to `first`), read by the direct webhook and EventBridge dispatcher.
The strategy applies to standard jobs; dynamic (ghr-) label jobs continue to
use the first compliant queue.
@guicaulada guicaulada force-pushed the feat/webhook-queue-selection-strategy branch from b1a0731 to 95aaac3 Compare June 29, 2026 19:16
Expose the queue_selection_strategy lambda setting as a public Terraform
variable on the root, multi-runner and webhook modules, validated to
first/random/all. Thread it through to both the direct webhook and the
eventbridge dispatcher lambdas via the QUEUE_SELECTION_STRATEGY env var
so the dispatch behaviour added in the previous commit is configurable.
@guicaulada guicaulada requested a review from a team as a code owner June 29, 2026 20:27
@guicaulada guicaulada changed the title feat(webhook): optional random queue selection for matching runners feat(webhook): configurable queue selection for matching runners Jun 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant