feat(webhook): configurable queue selection for matching runners#5190
Open
guicaulada wants to merge 3 commits into
Open
feat(webhook): configurable queue selection for matching runners#5190guicaulada wants to merge 3 commits into
guicaulada wants to merge 3 commits into
Conversation
Contributor
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Scanned FilesNone |
A workflow_job whose labels match several runner configs is always dispatched to the first matching queue (after the exactMatch sort). When multiple pools share a generic label (e.g. an "any architecture" label), every cold scale-up funnels to a single queue, overloading one pool while equally-valid pools sit idle. Add a queue selection strategy applied to the equally-best matches (those sharing the top exactMatch priority tier): - `first` (default): unchanged, deterministic first match. - `random`: pick one uniformly, spreading jobs across the matching queues. - `all`: dispatch to every matching queue, scaling up one runner per pool and letting the first available take the job (speed over cost). This multiplies instance launches and runner registrations per job. exactMatch priority is preserved — random/all never select a lower-priority match. Configured via a new QUEUE_SELECTION_STRATEGY env var (validated; defaults to `first`), read by the direct webhook and EventBridge dispatcher. The strategy applies to standard jobs; dynamic (ghr-) label jobs continue to use the first compliant queue.
b1a0731 to
95aaac3
Compare
Expose the queue_selection_strategy lambda setting as a public Terraform variable on the root, multi-runner and webhook modules, validated to first/random/all. Thread it through to both the direct webhook and the eventbridge dispatcher lambdas via the QUEUE_SELECTION_STRATEGY env var so the dispatch behaviour added in the previous commit is configurable.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
A
workflow_jobwhose labels match several runner configurations is always dispatched to the first matching queue (after theexactMatchsort). When multiple pools intentionally share a generic label — e.g. an "any architecture" or "this size or larger" label spanning several runner configs — every cold scale-up funnels to a single queue, overloading one pool while equally-valid pools sit idle. There is currently no way to spread that load.This adds a configurable queue selection strategy, applied to the equally-best matches (those sharing the top
exactMatchpriority tier):first(default): unchanged — deterministic first match.random: pick one uniformly, spreading jobs across the matching queues so a single pool's queue does not become a bottleneck.all: dispatch to every matching queue — scaling up one runner per matching pool and letting the first available runner take the job (speed over cost). GitHub assigns the queued job to exactly one runner; the losers are reaped by scale-down.exactMatchpriority is preserved:random/allonly ever operate within the highest-priority matching tier, never a lower-priority match. The strategy applies to standard jobs; dynamic (ghr-) label jobs continue to use the first compliant queue.Caveats for
all(deliberate opt-in)minimum_running_time_in_minutes).enable_job_queued_check = false(otherwise later scale-ups see the job already taken and skip).Changes
Lambda — new
QUEUE_SELECTION_STRATEGYenv var (validated; defaults tofirst), read by both the direct webhook and the EventBridge dispatcher;selectQueues()implementsfirst/random/allwithin the top-priority matching tier.Terraform — a
queue_selection_strategyvariable (validatedfirst/random/all, defaultfirst) on the root andmulti-runnermodules, threaded through thewebhookmoduleconfiginto thedirect/eventbridgelambda env var, plus regenerated terraform-docs.Test Plan
dispatch.test.ts: default picks first;randomspreads across equally-matching queues (Math.randommocked);randompreservesexactMatchpriority;alldispatches to every top-tier match but not lower-priority ones; invalid strategy rejected at config load.yarn test(webhook): 40/40 pass.yarn build(ncc typecheck) passes. ESLint: 0 errors. Prettier: clean.terraform fmt -checkandterraform validatepass on the root andmulti-runnermodules; terraform-docs regenerated.Related Issues
Motivation is load distribution across pools that share generic labels (avoiding single-queue hotspots), and a speed-over-cost option for large-scale environments. No existing upstream issue — happy to open one to track the discussion if preferred.