[CI][DO-NOT-MERGE] Modularize the per-file test runner into planner/executor/aggregator by hujc7 · Pull Request #6160 · isaac-sim/IsaacLab

hujc7 · 2026-06-12T00:39:12Z

DO NOT MERGE — stacked on #5823. This branch is based on #5823's head, so the diff includes #5823's changes. The refactor is the 3 commits after the merge (Add pure device-plan planner, Drive the test runner from planner + executor, Extract _build_unit_cmd and guard the executor's device selection). If approved it folds into #5823 rather than merging standalone. Opened as an iteration PR for review + CI.

Summary

Splits the per-file CI test runner (tools/conftest.py) into three single-purpose modules and unifies device selection onto test_devices() + the device mask, removing the parallel -k (device_split) and mgpu_shard_select mechanisms.
tools/conftest.py shrinks ~1240 → 499 lines; the runner becomes a thin plan→execute driver.
Adds 25 deterministic unit tests for the new modules — runner logic that previously had no unit coverage.
Behavior-preserving: the same tests run on the same devices on both the single-GPU and multi-GPU lanes.

1. Modules

tools/_device_plan.py — pure planner: turns (files, runtime mask, marker) into (file, mask) units. A device_isolated file splits into one single-device unit per device only when the runtime spans more than one device.
tools/_device_exec.py — executor: per-unit subprocess lifecycle (timeouts, startup-hang/crash retries, JUnit parsing) plus run_unit. Sets the unit's mask as ISAACLAB_TEST_DEVICES.
tools/_agnostic_select.py — a small plugin that drops device-agnostic tests, injected by the executor only for a cpu-less mask (mgpu shards and the cuda unit of a split). Replaces mgpu_shard_select's agnostic-drop; its variant filtering is now done by test_devices() + mask.

2. Removed / renamed

Removed tools/_device_split.py (DEVICE_SPLIT_PASSES, -k) and .github/actions/multi-gpu/mgpu_shard_select.py.
Renamed the device_split marker to device_isolated across the ovphysx tests and the pytest marker registration.

3. Behavior preservation

The single-GPU lane sets ISAACLAB_TEST_DEVICES explicitly (equivalent to the prior unset default of cpu + cuda:0); the Kit GPU pin stays mgpu-only via the shell, unchanged.
-k is gone; device-variant selection is the mask via test_devices(). The agnostic-drop plugin reproduces the prior cpu-pass-keeps / cuda-and-shard-drops behavior.

4. Validation

25 unit tests green; ruff clean; conftest import smoke.
A local end-to-end run confirmed run_unit drives a real pytest subprocess, parses its report, and the agnostic plugin deselects paramless tests on a cpu-less mask.
Real cuda:N shard device-variant selection runs in CI here.

Most test callers pass both ``sim_cfg=`` and ``device=`` to :func:`isaaclab.sim.build_simulation_context`, implicitly expecting the ``device`` kwarg to win. The helper previously dropped the kwarg silently when ``sim_cfg`` was provided, causing warp kernel-launch device mismatches on non-default GPUs: the test fixture allocated ``env_ids`` on the requested device while the articulation's ``self.device`` resolved from the untouched ``sim_cfg`` default (``cuda:0``), and ``wp.launch(..., device=self.device)`` failed with:: RuntimeError: Error launching kernel 'set_root_link_pose_to_sim_index', trying to launch on device='cuda:0', but input array for argument 'env_ids' is on device=cuda:2. Change ``device``'s default to ``None`` (sentinel) and apply it as an override after sim_cfg construction in both branches. The one test that asserted the old "sim_cfg overrides everything" contract is updated to cover the new override semantics.

Add an ISAACLAB_PIN_KIT_GPU env var to AppLauncher. When truthy, it appends Kit command-line overrides that pin the renderer to a single GPU (renderer.multiGpu.enabled=False, autoEnable=False, maxGpuCount=1) and disable the fabric GPU-interop path (physics.fabricUseGPUInterop= false), so each Kit process touches only its assigned GPU instead of enumerating every visible GPU at startup. Used by the multi-GPU CI workflow to avoid a shared GPU-interop context across concurrent sibling shards, which otherwise surfaces as "Stage X already attached" errors and SimulationApp.close hangs (see isaac-sim#3475). Off by default; single-GPU and user-facing rendering paths are unchanged.

Adds the scope-intersect-runtime device selection helper (test_devices) and its unit test, so unit tests can declare the devices they are valid on and the multi-GPU lane can narrow them via ISAACLAB_TEST_DEVICES.

Pins torch and Warp to the target device before allocations and scopes the CUDA graph capture to it, so Newton runs correctly on cuda:1+ (issue isaac-sim#5132).

…test tools/conftest.py gains a directory-rename work queue (claim/inflight/done) for work-stealing across shards and a per-file report slug to avoid JUnit collisions between same-basename files. A workspace-root conftest skips non-device-parametrized tests on non-default cuda shards, since single-GPU CI already covers them on cuda:0.

…ions Adds an extra-env-vars input so the multi-GPU workflow can inject ISAACLAB_TEST_DEVICES / ISAACLAB_SIM_DEVICE into the test container.

Adds a workflow that runs the unit-test suite across non-default cuda devices in one container with N parallel pytest shards pulling from a shared work queue, plus the inside-container shard runner it mounts.

Switches device-parametrized unit tests to test_devices() so they also run on the non-default GPUs in the multi-GPU lane. Mechanical scope change only; no test logic changes.

…cher When the caller does not pass an explicit device, AppLauncher reads ISAACLAB_SIM_DEVICE and uses it as the device. Lets the multi-GPU CI lane boot Kit on a non-default GPU without editing every test's AppLauncher() call site.

Speeds up iteration on the multi-GPU lane: forces run_docker_tests=false in build.yaml and gates docs.yaml / install-ci.yml behind a DO-NOT-MERGE PR-title check. Revert this commit before the PR is merged.

The test job's runs-on carried the `gpu` label, which in the self-hosted fleet tags single-GPU runners. Requiring both `gpu` and `multi-gpu` routed the job onto a single-GPU box, so the shard runner aborted with "Need at least 2 visible devices; found 1"; the over-constrained label also left the job queued for hours. Drop `gpu` so the job targets the multi-GPU pool.

Move the host-orchestration bash (symlink, work-queue seed, MIG detection, docker run, reconciler) and the JUnit-XML aggregation Python out of the workflow's inline run: blocks into version-controlled, lint-able scripts under .github/actions/multi-gpu/, alongside the existing multi_gpu_shard_runner.sh. The workflow drops from 456 to 181 lines and each step becomes a one-line call. Behavior-preserving: data still flows via step env, $GITHUB_OUTPUT, $GITHUB_ENV, and $GITHUB_STEP_SUMMARY; the container --name now uses the runner built-ins $GITHUB_RUN_ID/$GITHUB_RUN_ATTEMPT in place of the YAML-only ${{ github.run_id }} templating. Also documents why the test job must not carry the gpu label.

The ECR cache repo is resolved per runner pool (single-GPU runners -> gitci-docker-cache; multi-GPU runners -> multigpu-docker-cache). Building on [self-hosted, gpu] pushed the image to gitci-docker-cache, which the multi-GPU test job cannot see, so it rebuilt from scratch (~27 min) on the multi-GPU runner inside its pull step. Build on the multi-GPU pool so the image lands in multigpu-docker-cache and the test job's pull hits.

Move non-default-shard device selection out of a repo-root conftest.py into a scoped pytest plugin (mgpu_shard_select) that tools/conftest.py injects per file only on multi-GPU shards. The plugin keys off ISAACLAB_TEST_DEVICES so keep/drop matches each test's device parametrization, deselects out-of-scope variants (cpu, cuda:0, other-index), and maps the all-deselected NO_TESTS_COLLECTED exit to OK. Confines the behavior to the lane instead of a global root conftest.

Annotate the uncommon bash idioms in the two multi-GPU orchestration scripts for non-bash readers. Remove the now-dead py-spy/gdb hang-capture remnants -- the SYS_PTRACE cap-add and the py-spy pip install -- along with the matching changelog bullet, since the conftest-side capture they supported was removed.

…ests actions" This reverts commit 6cef107.

…m#5823" This reverts commit c514199.

# Conflicts: # source/isaaclab_ovphysx/test/assets/test_articulation.py # source/isaaclab_ovphysx/test/assets/test_rigid_object.py # source/isaaclab_ovphysx/test/assets/test_rigid_object_collection.py # tools/conftest.py

Introduce tools/_device_plan.py: a pure planner that turns a runner's files plus its concrete runtime device mask into (file, mask) work units, splitting a device_isolated file into one single-device unit per device only when the runtime spans more than one device. Includes is_isolated() marker detection by source regex (no import). Covered by 14 unit tests. This is the first extraction of the multi-GPU / device-split runner refactor; nothing is wired into conftest yet, so runner behavior is unchanged.

Replace the device_split (-k) and mgpu_shard_select mechanisms with one model: the planner emits (file, mask) units and the executor sets the unit's mask as ISAACLAB_TEST_DEVICES so test_devices() does all device-variant selection. A device_isolated file (renamed from device_split) splits into one single-device unit per device only when the runtime spans more than one; an mgpu shard, with a single-device runtime, runs each file once. - Add tools/_device_exec.py: the executor (subprocess lifecycle, timeouts, retries, JUnit parsing) plus run_unit(); conftest's run_individual_tests is now a thin planner+executor driver. - Add tools/_agnostic_select.py: a dumb plugin that drops device-agnostic tests, injected by the executor only for cpu-less masks (shards and the cuda unit of a split). Replaces mgpu_shard_select's agnostic-drop; its variant filtering is now handled by test_devices()+mask. - Remove tools/_device_split.py (DEVICE_SPLIT_PASSES, -k) and .github/actions/multi-gpu/mgpu_shard_select.py. - Rename the device_split marker to device_isolated across the ovphysx tests and the pytest marker registration. Behavior is preserved: the same tests run on the same devices on both lanes; the single-GPU lane still uses no Kit GPU pin and mgpu still pins via the shell.

Pull the per-unit command/env construction out of run_unit into a pure _build_unit_cmd() and cover it with deterministic tests (no subprocess, GPU, or Kit): the unit's mask becomes ISAACLAB_TEST_DEVICES, the _agnostic_select plugin is injected only for a cpu-less mask, -k is never used, and split units get a suffixed report name while single units keep the aggregator-friendly one. This is the behavior-equivalence guard for the planner+executor refactor.

The first refactor assumed test_devices()+mask did all device-variant selection, so the plugin only dropped device-agnostic tests. But tests with a literal device list (e.g. parametrize("device", ["cuda:0", "cpu"])) ignore the mask, so their out-of-scope variant leaked into the wrong split unit and tripped ovphysx's process-global device lock (RuntimeError: locked to 'gpu', cannot switch to 'cpu'). Restore per-unit device filtering: rename the plugin to _device_select and have it keep only the tests whose device param is in the unit's mask (plus the agnostic rule), reading the structured param so it catches literal lists too. The executor injects it for both halves of a split and for shards. This reproduces the old -k 'cpu or not cuda' / -k 'cuda' selection exactly.

Begin rebuilding the per-file CI runner as a tools/test_runner/ package, one module per lifecycle stage, driven by a single RunnerConfig that turns the scattered constants and env reads into configuration: - planning.py: RunnerConfig (with from_env), Unit, Planner (device_isolated split). - process.py: ProcessCapture (subprocess + hang/timeout + diagnostics), JUnitReport. - execution.py: ExecContext, UnitRunner (build cmd, drive capture, classify, retry), merge_statuses. - selector.py: DeviceSelect, the in-process device-mask plugin as a class with documented hooks and a one-line registrar. Additive only — conftest still uses the current modules; the wiring and removal of the old _device_* modules follow. 25 unit tests, ruff clean.

conftest.py is now two documented hooks that build a RunnerConfig from the environment and run the Session lifecycle (collect -> plan -> execute -> report), exiting with the runner's code. Add session.py (WorkQueue, Collector, Reporter, Session) completing the framework, and delete the flat _device_plan/_device_exec/ _device_select modules and their tests. Behavior is preserved: the same tests run on the same devices on both lanes, verified by a literal-device-param e2e (the cpu unit drops cuda variants, the cuda unit drops cpu + agnostic) and 32 framework unit tests. ruff clean.

hujc7 added 22 commits June 5, 2026 08:55

[Tests] Add test_devices helper for device-parametrized unit tests

f1c0d18

Adds the scope-intersect-runtime device selection helper (test_devices) and its unit test, so unit tests can declare the devices they are valid on and the multi-GPU lane can narrow them via ISAACLAB_TEST_DEVICES.

[Newton] Fix Newton/Warp init on non-default CUDA devices

878990a

Pins torch and Warp to the target device before allocations and scopes the CUDA graph capture to it, so Newton runs correctly on cuda:1+ (issue isaac-sim#5132).

[CI] Forward extra env vars through run-tests / run-package-tests act…

6cef107

…ions Adds an extra-env-vars input so the multi-GPU workflow can inject ISAACLAB_TEST_DEVICES / ISAACLAB_SIM_DEVICE into the test container.

[CI] Add multi-GPU pytest workflow

71658df

Adds a workflow that runs the unit-test suite across non-default cuda devices in one container with N parallel pytest shards pulling from a shared work queue, plus the inside-container shard runner it mounts.

[Tests] Parametrize unit tests over multi-GPU device scope

cef4569

Switches device-parametrized unit tests to test_devices() so they also run on the non-default GPUs in the multi-GPU lane. Mechanical scope change only; no test logic changes.

TEMP: skip docker/docs/install-ci CI while iterating isaac-sim#5823

c514199

Speeds up iteration on the multi-GPU lane: forces run_docker_tests=false in build.yaml and gates docs.yaml / install-ci.yml behind a DO-NOT-MERGE PR-title check. Revert this commit before the PR is merged.

Revert "[CI] Forward extra env vars through run-tests / run-package-t…

1e710b1

…ests actions" This reverts commit 6cef107.

Revert "TEMP: skip docker/docs/install-ci CI while iterating isaac-si…

4441f68

…m#5823" This reverts commit c514199.

Merge remote-tracking branch 'origin/develop' into jichuanh/multi-gpu-ci

3694e7f

github-actions Bot added isaac-lab Related to Isaac Lab team infrastructure labels Jun 12, 2026

hujc7 added 3 commits June 12, 2026 01:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI][DO-NOT-MERGE] Modularize the per-file test runner into planner/executor/aggregator#6160

[CI][DO-NOT-MERGE] Modularize the per-file test runner into planner/executor/aggregator#6160
hujc7 wants to merge 25 commits into
isaac-sim:developfrom
hujc7:jichuanh/test-runner-modular

hujc7 commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hujc7 commented Jun 12, 2026

Summary

1. Modules

2. Removed / renamed

3. Behavior preservation

4. Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant