Skip to content

fix(cloud-agent): harden workspace bootstrap#3937

Merged
eshurakov merged 3 commits into
mainfrom
persistent-rumba
Jun 12, 2026
Merged

fix(cloud-agent): harden workspace bootstrap#3937
eshurakov merged 3 commits into
mainfrom
persistent-rumba

Conversation

@eshurakov

@eshurakov eshurakov commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Summary

Why

Large repositories can remain healthy while cloning or checking out for longer than the wrapper's previous two-minute wall-clock limit. When those operations were interrupted, the remaining .git directory could also make a later attempt reuse an incomplete workspace, turning one timeout into repeated setup failures.

What was done

  • Replace fixed long-command timeouts with a two-minute output-inactivity watchdog and a five-minute hard limit for Git and setup commands.
  • Bound the complete workspace phase to eight minutes and propagate cancellation through Git, setup, snapshot restore, and patch application, preserving headroom inside the existing ten-minute wrapper readiness budget.
  • Mark bootstrap attempts as pending and write a Git completion marker only after branch preparation, session restore, and setup commands finish; incomplete new workspaces are removed and cloned again.
  • Report sanitized Git percentages and generic setup activity without exposing credentials or setup output, then advance the status to Starting Kilo... while the runtime starts.
  • Bound captured subprocess output and failed-workspace cleanup so noisy or interrupted commands cannot grow memory or cleanup time without limit.

High-level architecture

sequenceDiagram
  participant Orchestrator
  participant Wrapper
  participant Workspace
  participant Kilo
  Orchestrator->>Wrapper: POST /session/ready (10-minute outer budget)
  Wrapper->>Workspace: Prepare repository and session (8-minute shared budget)
  alt Workspace is complete
    Workspace-->>Wrapper: Reuse and refresh credentials
  else Workspace is incomplete or cold
    Wrapper->>Workspace: Mark pending, remove stale state, clone, restore, and run setup
    Wrapper->>Workspace: Write completion marker
  end
  Wrapper->>Kilo: Start runtime
  Wrapper-->>Orchestrator: Session ready
Loading

Architecture decision

Decision: Keep bootstrap lifecycle policy in the wrapper and combine activity-aware command watchdogs with a shared workspace deadline and explicit completion markers.

Context: A single elapsed timeout could not distinguish a slow, active checkout from a stalled process, while .git existence alone could not distinguish a usable workspace from an interrupted clone.

Rationale: The wrapper owns the subprocesses and persisted workspace state, so it can observe output, cancel child processes, and write completion state at the point where bootstrap actually succeeds. Layered two-minute inactivity, five-minute command, eight-minute workspace, and ten-minute readiness limits keep each boundary finite without colliding with startup cleanup.

Alternatives considered:

  • Only increase the old Git timeout. This would allow larger repositories but would still wait too long on silent hangs and reuse interrupted workspaces.
  • Continue treating any .git directory as warm. This avoids recloning but preserves the failure mode where partial checkouts are mistaken for complete workspaces.

Consequences: Active long-running operations receive more time and interrupted new workspaces recover deterministically. Silent commands can still time out after two minutes, markerless legacy workspaces use a compatibility heuristic, and cleanup remains best-effort so the original setup failure is preserved.

Verification

  • Verified locally

Visual Changes

Screenshot 2026-06-10 at 15 02 36

Reviewer Notes

  • Review the nested timeout policy and cancellation propagation through repository preparation and snapshot restore.
  • Markerless legacy workspaces are migrated when Kilo auth exists and Git validates the worktree. A narrowly interrupted pre-marker workspace can still satisfy that heuristic; this is accepted because sessions are predominantly new.
  • The separate App Builder Unauthorized: Invalid token clone failure is intentionally not addressed by this PR.

@kilo-code-bot

kilo-code-bot Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Executive Summary

All previously identified issues have been resolved: restore-session.ts now uses isTimeoutTermination() (fixed in 9c0f319be), and the same commit simplifies warm-workspace detection to rely solely on the kilo-bootstrap-complete git marker, eliminating the fragile BOOTSTRAP_PENDING_MARKER two-file dance.

Resolved Issues
File Line Status
services/cloud-agent-next/wrapper/src/restore-session.ts 601 Fixed — isTimeoutTermination(importResult) now used instead of terminationReason === 'timeout'
New commit reviewed

9c0f319befix(cloud-agent): require completed bootstrap marker

  • restore-session.ts — Fixes the previously flagged WARNING: imports and applies isTimeoutTermination() for the kilo import timeout check, now returning a structured kilo_import_timeout subtype with process diagnostics. No new issues.
  • session-bootstrap.ts — Removes BOOTSTRAP_PENDING_MARKER / bootstrapPendingMarkerPath() entirely. isCompleteGitWorkspace() is simplified to a single exists(gitBootstrapMarkerPath(...)) check. The marker is deleted at the start of a bootstrap and written only after restore + setup commands complete, making it the sole, reliable evidence of a completed workspace. The legacy migration code path (which trusted auth.json + rev-parse) is gone. Logic is clean and correct.
  • session-bootstrap.test.ts — Removes the pending-marker assertion and the now-invalid "migrate legacy warm workspace" test. Replaces it with a "reclone legacy markerless workspaces" test that confirms workspaces with a bare .git + auth.json but no kilo-bootstrap-complete marker are treated as cold and re-bootstrapped. Coverage is appropriate.
Files Reviewed (12 files)
  • services/cloud-agent-next/src/execution/orchestrator.ts — comment update
  • services/cloud-agent-next/test/unit/execution/orchestrator.test.ts — new 8/10-minute timer test
  • services/cloud-agent-next/test/unit/wrapper/utils.test.tsisTimeoutTermination coverage
  • services/cloud-agent-next/wrapper/src/main.ts — shutdown abort propagation
  • services/cloud-agent-next/wrapper/src/restore-session.tspreviously flagged issue fixed
  • services/cloud-agent-next/wrapper/src/session-bootstrap.ts — bootstrap markers, deadline, watchdogs, signal propagation, pending-marker removal
  • services/cloud-agent-next/wrapper/src/session-bootstrap.test.ts — updated test coverage
  • services/cloud-agent-next/wrapper/src/utils.tsisTimeoutTermination() helper

Fix these issues in Kilo Cloud


Reviewed by claude-sonnet-4.6 · 423,536 tokens

Review guidance: REVIEW.md from base branch main

Comment thread services/cloud-agent-next/wrapper/src/session-bootstrap.ts Outdated
@eshurakov eshurakov merged commit ef994c1 into main Jun 12, 2026
16 checks passed
@eshurakov eshurakov deleted the persistent-rumba branch June 12, 2026 10:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants