Skip to content

[CI][DO-NOT-MERGE] Modularize the per-file test runner into planner/executor/aggregator#6160

Draft
hujc7 wants to merge 25 commits into
isaac-sim:developfrom
hujc7:jichuanh/test-runner-modular
Draft

[CI][DO-NOT-MERGE] Modularize the per-file test runner into planner/executor/aggregator#6160
hujc7 wants to merge 25 commits into
isaac-sim:developfrom
hujc7:jichuanh/test-runner-modular

Conversation

@hujc7

@hujc7 hujc7 commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

DO NOT MERGE — stacked on #5823. This branch is based on #5823's head, so the diff includes #5823's changes. The refactor is the 3 commits after the merge (Add pure device-plan planner, Drive the test runner from planner + executor, Extract _build_unit_cmd and guard the executor's device selection). If approved it folds into #5823 rather than merging standalone. Opened as an iteration PR for review + CI.

Summary

  • Splits the per-file CI test runner (tools/conftest.py) into three single-purpose modules and unifies device selection onto test_devices() + the device mask, removing the parallel -k (device_split) and mgpu_shard_select mechanisms.
  • tools/conftest.py shrinks ~1240 → 499 lines; the runner becomes a thin plan→execute driver.
  • Adds 25 deterministic unit tests for the new modules — runner logic that previously had no unit coverage.
  • Behavior-preserving: the same tests run on the same devices on both the single-GPU and multi-GPU lanes.

1. Modules

  1. tools/_device_plan.py — pure planner: turns (files, runtime mask, marker) into (file, mask) units. A device_isolated file splits into one single-device unit per device only when the runtime spans more than one device.
  2. tools/_device_exec.py — executor: per-unit subprocess lifecycle (timeouts, startup-hang/crash retries, JUnit parsing) plus run_unit. Sets the unit's mask as ISAACLAB_TEST_DEVICES.
  3. tools/_agnostic_select.py — a small plugin that drops device-agnostic tests, injected by the executor only for a cpu-less mask (mgpu shards and the cuda unit of a split). Replaces mgpu_shard_select's agnostic-drop; its variant filtering is now done by test_devices() + mask.

2. Removed / renamed

  1. Removed tools/_device_split.py (DEVICE_SPLIT_PASSES, -k) and .github/actions/multi-gpu/mgpu_shard_select.py.
  2. Renamed the device_split marker to device_isolated across the ovphysx tests and the pytest marker registration.

3. Behavior preservation

  1. The single-GPU lane sets ISAACLAB_TEST_DEVICES explicitly (equivalent to the prior unset default of cpu + cuda:0); the Kit GPU pin stays mgpu-only via the shell, unchanged.
  2. -k is gone; device-variant selection is the mask via test_devices(). The agnostic-drop plugin reproduces the prior cpu-pass-keeps / cuda-and-shard-drops behavior.

4. Validation

  1. 25 unit tests green; ruff clean; conftest import smoke.
  2. A local end-to-end run confirmed run_unit drives a real pytest subprocess, parses its report, and the agnostic plugin deselects paramless tests on a cpu-less mask.
  3. Real cuda:N shard device-variant selection runs in CI here.

hujc7 added 22 commits June 5, 2026 08:55
Most test callers pass both ``sim_cfg=`` and ``device=`` to
:func:`isaaclab.sim.build_simulation_context`, implicitly expecting the
``device`` kwarg to win. The helper previously dropped the kwarg silently
when ``sim_cfg`` was provided, causing warp kernel-launch device
mismatches on non-default GPUs: the test fixture allocated ``env_ids``
on the requested device while the articulation's ``self.device``
resolved from the untouched ``sim_cfg`` default (``cuda:0``), and
``wp.launch(..., device=self.device)`` failed with::

    RuntimeError: Error launching kernel 'set_root_link_pose_to_sim_index',
    trying to launch on device='cuda:0',
    but input array for argument 'env_ids' is on device=cuda:2.

Change ``device``'s default to ``None`` (sentinel) and apply it as an
override after sim_cfg construction in both branches. The one test that
asserted the old "sim_cfg overrides everything" contract is updated to
cover the new override semantics.
Add an ISAACLAB_PIN_KIT_GPU env var to AppLauncher. When truthy, it
appends Kit command-line overrides that pin the renderer to a single
GPU (renderer.multiGpu.enabled=False, autoEnable=False, maxGpuCount=1)
and disable the fabric GPU-interop path (physics.fabricUseGPUInterop=
false), so each Kit process touches only its assigned GPU instead of
enumerating every visible GPU at startup.

Used by the multi-GPU CI workflow to avoid a shared GPU-interop context
across concurrent sibling shards, which otherwise surfaces as
"Stage X already attached" errors and SimulationApp.close hangs (see
isaac-sim#3475). Off by default;
single-GPU and user-facing rendering paths are unchanged.
Adds the scope-intersect-runtime device selection helper (test_devices) and its unit test, so unit tests can declare the devices they are valid on and the multi-GPU lane can narrow them via ISAACLAB_TEST_DEVICES.
Pins torch and Warp to the target device before allocations and scopes the CUDA graph capture to it, so Newton runs correctly on cuda:1+ (issue isaac-sim#5132).
…test

tools/conftest.py gains a directory-rename work queue (claim/inflight/done) for work-stealing across shards and a per-file report slug to avoid JUnit collisions between same-basename files. A workspace-root conftest skips non-device-parametrized tests on non-default cuda shards, since single-GPU CI already covers them on cuda:0.
…ions

Adds an extra-env-vars input so the multi-GPU workflow can inject ISAACLAB_TEST_DEVICES / ISAACLAB_SIM_DEVICE into the test container.
Adds a workflow that runs the unit-test suite across non-default cuda devices in one container with N parallel pytest shards pulling from a shared work queue, plus the inside-container shard runner it mounts.
Switches device-parametrized unit tests to test_devices() so they also run on the non-default GPUs in the multi-GPU lane. Mechanical scope change only; no test logic changes.
…cher

When the caller does not pass an explicit device, AppLauncher reads ISAACLAB_SIM_DEVICE and uses it as the device. Lets the multi-GPU CI lane boot Kit on a non-default GPU without editing every test's AppLauncher() call site.
Speeds up iteration on the multi-GPU lane: forces run_docker_tests=false in build.yaml and gates docs.yaml / install-ci.yml behind a DO-NOT-MERGE PR-title check. Revert this commit before the PR is merged.
The test job's runs-on carried the `gpu` label, which in the
self-hosted fleet tags single-GPU runners. Requiring both `gpu` and
`multi-gpu` routed the job onto a single-GPU box, so the shard runner
aborted with "Need at least 2 visible devices; found 1"; the
over-constrained label also left the job queued for hours. Drop `gpu`
so the job targets the multi-GPU pool.
Move the host-orchestration bash (symlink, work-queue seed, MIG
detection, docker run, reconciler) and the JUnit-XML aggregation Python
out of the workflow's inline run: blocks into version-controlled,
lint-able scripts under .github/actions/multi-gpu/, alongside the
existing multi_gpu_shard_runner.sh. The workflow drops from 456 to 181
lines and each step becomes a one-line call.

Behavior-preserving: data still flows via step env, $GITHUB_OUTPUT,
$GITHUB_ENV, and $GITHUB_STEP_SUMMARY; the container --name now uses
the runner built-ins $GITHUB_RUN_ID/$GITHUB_RUN_ATTEMPT in place of the
YAML-only ${{ github.run_id }} templating. Also documents why the test
job must not carry the gpu label.
The ECR cache repo is resolved per runner pool (single-GPU runners ->
gitci-docker-cache; multi-GPU runners -> multigpu-docker-cache). Building
on [self-hosted, gpu] pushed the image to gitci-docker-cache, which the
multi-GPU test job cannot see, so it rebuilt from scratch (~27 min) on the
multi-GPU runner inside its pull step. Build on the multi-GPU pool so the
image lands in multigpu-docker-cache and the test job's pull hits.
Move non-default-shard device selection out of a repo-root conftest.py
into a scoped pytest plugin (mgpu_shard_select) that tools/conftest.py
injects per file only on multi-GPU shards. The plugin keys off
ISAACLAB_TEST_DEVICES so keep/drop matches each test's device
parametrization, deselects out-of-scope variants (cpu, cuda:0,
other-index), and maps the all-deselected NO_TESTS_COLLECTED exit to
OK. Confines the behavior to the lane instead of a global root conftest.
Annotate the uncommon bash idioms in the two multi-GPU orchestration
scripts for non-bash readers. Remove the now-dead py-spy/gdb
hang-capture remnants -- the SYS_PTRACE cap-add and the py-spy pip
install -- along with the matching changelog bullet, since the
conftest-side capture they supported was removed.
# Conflicts:
#	source/isaaclab_ovphysx/test/assets/test_articulation.py
#	source/isaaclab_ovphysx/test/assets/test_rigid_object.py
#	source/isaaclab_ovphysx/test/assets/test_rigid_object_collection.py
#	tools/conftest.py
Introduce tools/_device_plan.py: a pure planner that turns a runner's files
plus its concrete runtime device mask into (file, mask) work units, splitting a
device_isolated file into one single-device unit per device only when the
runtime spans more than one device. Includes is_isolated() marker detection by
source regex (no import). Covered by 14 unit tests.

This is the first extraction of the multi-GPU / device-split runner refactor;
nothing is wired into conftest yet, so runner behavior is unchanged.
Replace the device_split (-k) and mgpu_shard_select mechanisms with one model:
the planner emits (file, mask) units and the executor sets the unit's mask as
ISAACLAB_TEST_DEVICES so test_devices() does all device-variant selection. A
device_isolated file (renamed from device_split) splits into one single-device
unit per device only when the runtime spans more than one; an mgpu shard, with a
single-device runtime, runs each file once.

- Add tools/_device_exec.py: the executor (subprocess lifecycle, timeouts,
  retries, JUnit parsing) plus run_unit(); conftest's run_individual_tests is now
  a thin planner+executor driver.
- Add tools/_agnostic_select.py: a dumb plugin that drops device-agnostic tests,
  injected by the executor only for cpu-less masks (shards and the cuda unit of a
  split). Replaces mgpu_shard_select's agnostic-drop; its variant filtering is
  now handled by test_devices()+mask.
- Remove tools/_device_split.py (DEVICE_SPLIT_PASSES, -k) and
  .github/actions/multi-gpu/mgpu_shard_select.py.
- Rename the device_split marker to device_isolated across the ovphysx tests and
  the pytest marker registration.

Behavior is preserved: the same tests run on the same devices on both lanes; the
single-GPU lane still uses no Kit GPU pin and mgpu still pins via the shell.
Pull the per-unit command/env construction out of run_unit into a pure
_build_unit_cmd() and cover it with deterministic tests (no subprocess, GPU, or
Kit): the unit's mask becomes ISAACLAB_TEST_DEVICES, the _agnostic_select plugin
is injected only for a cpu-less mask, -k is never used, and split units get a
suffixed report name while single units keep the aggregator-friendly one. This
is the behavior-equivalence guard for the planner+executor refactor.
@github-actions github-actions Bot added isaac-lab Related to Isaac Lab team infrastructure labels Jun 12, 2026
hujc7 added 3 commits June 12, 2026 01:26
The first refactor assumed test_devices()+mask did all device-variant selection,
so the plugin only dropped device-agnostic tests. But tests with a literal device
list (e.g. parametrize("device", ["cuda:0", "cpu"])) ignore the mask, so their
out-of-scope variant leaked into the wrong split unit and tripped ovphysx's
process-global device lock (RuntimeError: locked to 'gpu', cannot switch to 'cpu').

Restore per-unit device filtering: rename the plugin to _device_select and have it
keep only the tests whose device param is in the unit's mask (plus the agnostic
rule), reading the structured param so it catches literal lists too. The executor
injects it for both halves of a split and for shards. This reproduces the old
-k 'cpu or not cuda' / -k 'cuda' selection exactly.
Begin rebuilding the per-file CI runner as a tools/test_runner/ package, one
module per lifecycle stage, driven by a single RunnerConfig that turns the
scattered constants and env reads into configuration:

- planning.py: RunnerConfig (with from_env), Unit, Planner (device_isolated split).
- process.py: ProcessCapture (subprocess + hang/timeout + diagnostics), JUnitReport.
- execution.py: ExecContext, UnitRunner (build cmd, drive capture, classify, retry),
  merge_statuses.
- selector.py: DeviceSelect, the in-process device-mask plugin as a class with
  documented hooks and a one-line registrar.

Additive only — conftest still uses the current modules; the wiring and removal
of the old _device_* modules follow. 25 unit tests, ruff clean.
conftest.py is now two documented hooks that build a RunnerConfig from the
environment and run the Session lifecycle (collect -> plan -> execute -> report),
exiting with the runner's code. Add session.py (WorkQueue, Collector, Reporter,
Session) completing the framework, and delete the flat _device_plan/_device_exec/
_device_select modules and their tests.

Behavior is preserved: the same tests run on the same devices on both lanes,
verified by a literal-device-param e2e (the cpu unit drops cuda variants, the
cuda unit drops cpu + agnostic) and 32 framework unit tests. ruff clean.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

infrastructure isaac-lab Related to Isaac Lab team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant