Skip to content

fix(supervisor): cancel pending delayed snapshots when the run completes or disconnects#3894

Open
myftija wants to merge 4 commits into
mainfrom
tri-10293-cancel-stale-delayed-snapshots
Open

fix(supervisor): cancel pending delayed snapshots when the run completes or disconnects#3894
myftija wants to merge 4 commits into
mainfrom
tri-10293-cancel-stale-delayed-snapshots

Conversation

@myftija

@myftija myftija commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

The compute suspend flow delays snapshots by snapshotDelayMs (~30s) so short-lived waitpoints skip the snapshot entirely, with the intent that a run continuing before the delay expires cancels the pending snapshot. But the only cancel() call site was the /continue action, which runners only invoke when restoring from an already-taken snapshot — so pending snapshots were never cancelled (zero snapshot.canceled events ever emitted in prod). When a run resumed and completed inside the window, the stale snapshot fired ~30s later anyway, pausing the VM 6–13s mid warm-start long-poll; the frozen guest couldn't fire its abort timer or send a FIN, causing stalls and run-engine driven retries.

Change

  • Cancel the pending snapshot on attempt.complete — after the platform accepts the completion, before the HTTP reply (so it can't reorder with the runner's next /suspend).
  • Cancel on runDisconnected (crash, exit, or run replaced on the socket).
  • Both cancels are guarded by a runnerId match (new TimerWheel.peek()): a stale duplicate runner for a reassigned run must not cancel the fresh runner's pending snapshot. A missing runnerId falls through to an unconditional cancel (the pre-existing /continue behavior is unchanged).

Waitpoint suspensions keep the runner socket connected and the attempt incomplete, so neither hook touches a snapshot that is still wanted.

Known limitation (fail-safe direction): socket.data.runnerId is frozen at the websocket handshake, so after a same-supervisor restore the disconnect-path guard refuses the cancel. The attempt.complete path uses the runner's current header id and is unaffected.

@changeset-bot

changeset-bot Bot commented Jun 10, 2026

Copy link
Copy Markdown

⚠️ No Changeset found

Latest commit: 3b4c9db

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

This PR adds runner-aware snapshot cancellation to prevent stale runner instances from cancelling snapshots after redeployment or reconnection. A new TimerWheel.peek(key) method inspects pending entries without removing them, enabling ComputeSnapshotService.cancel() to check the stored runnerId and refuse cancellation if it differs. The implementation is covered by comprehensive tests for scheduling, cancellation guards, and snapshot replacement semantics. Two workloadServer handlers—attempt.complete (HTTP) and runDisconnected (WebSocket)—now pass the runner identifier when cancelling snapshots.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: canceling pending delayed snapshots when runs complete or disconnect, which is the core fix addressed throughout the changeset.
Description check ✅ Passed The description provides a clear problem statement, explains the root cause, documents the specific changes made, and notes known limitations, but does not include testing steps or changelog section as specified in the template.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch tri-10293-cancel-stale-delayed-snapshots

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot]

This comment was marked as resolved.

@myftija myftija force-pushed the tri-10293-cancel-stale-delayed-snapshots branch from a76626c to 829fec6 Compare June 11, 2026 13:50
…nnects

The compute suspend flow delays snapshots by snapshotDelayMs to avoid
wasted work on short-lived waitpoints, with the intent that a run which
continues before the delay expires cancels the pending snapshot. But the
only cancel() call site is the /continue workload action, which runners
only invoke when restoring from an already-taken snapshot - so a pending
snapshot is never actually cancelled (zero snapshot.canceled events in
prod). When a run resumes and completes within the delay window, the
stale snapshot fires anyway and fcrun pauses the VM for ~6-13s while its
controller is mid warm-start long-poll. The frozen guest can't fire its
abort timer or send a FIN, so firestarter keeps the connection claimable
past the client deadline and dispatches runs into it - each one a ~300s
stall (TRI-10293).

Cancel the pending snapshot when the attempt completes and when the run
socket disconnects. Genuine waitpoint suspensions keep the runner socket
connected and the attempt incomplete, so neither hook cancels a snapshot
that is still wanted. Cancellation is guarded by runnerId so a stale
duplicate runner for a reassigned run can't cancel the new runner's
pending snapshot.
@myftija myftija force-pushed the tri-10293-cancel-stale-delayed-snapshots branch from 829fec6 to 2f5b8db Compare June 11, 2026 14:08
@myftija myftija marked this pull request as ready for review June 11, 2026 14:16

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

Open in Devin Review

myftija added 2 commits June 11, 2026 16:40
A runner calling attempts/complete has finished executing either way -
on a transient completion failure it retries the call, so cancelling
only on success leaves the stale snapshot armed in the meantime. The
runnerId guard already covers the stale-duplicate-runner case whose
completion fails server-side validation.

Addresses CodeRabbit review feedback on the PR.
devin-ai-integration[bot]

This comment was marked as resolved.

The timer wheel can tick during the awaited completeRunAttempt HTTP
round trip, so cancelling after it leaves a window where a due snapshot
dispatches anyway and pauses a VM that has moved on. Cancel first - the
runnerId guard is ordering-independent, and the runner can't schedule a
new suspend until it receives this route's reply, so nothing legitimate
can be cancelled early. Matches the existing /continue ordering.

Addresses Devin review feedback.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants