Fix: z-image regional guidance split mismatch by Pfannkuchensack · Pull Request #9273 · invoke-ai/InvokeAI

Pfannkuchensack · 2026-06-07T10:28:50Z

Summary

fix(z-image): repair & realign Regional Guidance after diffusers refactor

Using "Regional Guidance" layers with Z-Image crashed with:

RuntimeError: split_with_sizes expects split_sizes to sum exactly to 162
(input tensor's size at dimension 0), but got split_sizes=[160]

Why: The regional-prompting patch (z_image_transformer_patch.py) was a hand-copied snapshot of an older ZImageTransformer2DModel.forward. The installed diffusers version refactored that method — in particular _pad_with_ids now produces caption pos_ids that are longer than the caption feature tensor (the pos grid is built at the padded length and extra pad pos-ids are appended). The model copes by splitting RoPE embeddings by pos_ids lengths and truncating, but the stale patch split by feature lengths → size mismatch crash. FLUX.2 was unaffected because it doesn't use this code path.

How:

No more drift — create_regional_forward now delegates to the model's own helpers (patchify_and_embed, _prepare_sequence, _build_unified_sequence) instead of re-implementing patchify/RoPE/padding logic. It only overrides the main-layer attention mask, so it stays in sync with upstream diffusers.
Mask alignment — The unified sequence pads the image and caption blocks individually to a multiple of 32, so the real per-item layout is [img_real | img_pad | txt_real | txt_pad]. The four regional sub-blocks (img→img, img→txt, txt→img, txt→txt) are now scattered into their padding-aware positions instead of being placed in a contiguous top-left block (which only happened to align at square 1024×1024). This fixes regional guidance silently having no effect at most other resolutions.
CFG/negative pass — The patched forward also runs for the negative prompt (different text length). The regional mask was built for the positive prompt only, so it is now applied only to passes whose caption length matches the positive prompt; other passes fall back to the plain padding mask.

Note: the remaining "soft" region edge (content bleeding slightly past the mask) is intended behavior of the FLUX-style regional approach — unrestricted image self-attention plus alternating full-attention layers preserve global coherence. Token ordering and grid dimensions were verified to match exactly; there is no positional offset.

Related Issues / Discussions

Closes #9251

QA Instructions

Load a Z-Image main model.
Add one or more Regional Guidance layers with masks + prompts on the canvas.
Generate — previously this crashed with the split_with_sizes error; it should now complete successfully.
Verify the prompted content appears within the masked regions.
Test at a non-square resolution (e.g. 832×1216) to confirm the mask aligns (previously it was effectively ignored at non-multiple-of-32 image token counts).
Test with CFG enabled (guidance > 1.0) and a negative prompt to confirm the negative pass is unaffected.

Merge Plan

Standard merge. Touches only invokeai/backend/z_image/z_image_transformer_patch.py.

Checklist

The PR has a short but descriptive title, suitable for a changelog
Tests added / updated (if applicable)
❗Changes to a redux slice have a corresponding migration
Documentation added / updated (if applicable)
Updated What's New copy (if doing a release after this PR)

Z-Image Regional Guidance crashed with "split_with_sizes expects split_sizes to sum exactly to 162 ... but got split_sizes=[160]". The regional-prompting patch was a hand-copied snapshot of an outdated ZImageTransformer2DModel.forward. The installed diffusers version changed _pad_with_ids so caption pos_ids are now longer than the caption feature tensor, while the stale patch split RoPE embeddings by feature lengths instead of pos_ids lengths. Rewrite create_regional_forward to delegate to the model's own helpers (patchify_and_embed, _prepare_sequence, _build_unified_sequence) and only override the main-layer attention mask to inject the regional mask. This keeps the patch in sync with upstream diffusers and stops re-implementing the drift-prone patchify/RoPE/padding logic.

…ctor Z-Image Regional Guidance crashed with "split_with_sizes expects split_sizes to sum exactly to 162 ... but got split_sizes=[160]". The regional-prompting patch was a hand-copied snapshot of an outdated ZImageTransformer2DModel.forward; the installed diffusers version changed _pad_with_ids so caption pos_ids are longer than the caption feature tensor, while the stale patch split RoPE embeddings by feature lengths instead of pos_ids lengths. Rewrite create_regional_forward to delegate to the model's own helpers (patchify_and_embed, _prepare_sequence, _build_unified_sequence) so it stays in sync with upstream diffusers, and only override the main-layer attention mask. Also fix two reasons regional guidance had no visible effect: - Mask alignment: the unified sequence pads the image and caption blocks individually to a multiple of 32, so the real layout is [img_real | img_pad | txt_real | txt_pad]. Scatter the four regional sub-blocks into their padding-aware positions instead of assuming a contiguous top-left block (which only matched square 1024x1024). - CFG pass: the patched forward also runs for the negative prompt; only apply the regional mask to passes whose caption length matches the positive prompt, otherwise fall back to the plain padding mask.

lstein · 2026-06-24T21:36:54Z

Code review findings

Reviewed the regional-guidance realignment against the installed diffusers transformer_z_image.py. The core fix is correct: helper call signatures/return unpacking match, the basic-mode sequence order [img, cap] matches the regional mask order [img, txt], and the four-sub-block scatter lands text at the padded offset x_len instead of the old contiguous top-left assumption (the actual "split mismatch"). Three findings worth surfacing:

1. Negative/CFG pass can collide with the positive layout and get the wrong regional mask

invokeai/backend/z_image/z_image_transformer_patch.py:144-148

The positive vs. negative pass is distinguished solely by cap_len != expected_cap_len, where expected_cap_len = txt_seq_len + ((-txt_seq_len) % 32). This only compares the caption length rounded up to a multiple of 32, so any negative prompt whose token count rounds to the same multiple as the positive regional embeds is treated as the positive pass and has the positive mask injected into the unconditional prediction.

Concrete: single region with a 20-token positive prompt → padded cap_len=32; a 25-token negative prompt → padded cap_len=32. The negative forward (cap_feats=[neg_prompt_embeds], z_image_denoise.py:625) then matches, applied_regional[0]=True, and regional[it, it] / regional[ii, it] index a mask built for the positive embedding ranges — corrupting CFG. With short prompts and a single region this is reachable, and the failure is silent. The comments acknowledge the heuristic, but consider a more robust discriminator.

Editorial note from Lincoln: The Canvas UI does not currently provide the option for a negative prompt in regional guidance. Up to you to decide whether it is worth making the discriminator more robust in the event that negative prompting is added in the future.

2. Fragile coupling: regional guidance silently no-ops if `cap_feats` ever diverges from the mask source

z_image_transformer_patch.py:144-152 (altitude)

The feature only works because pos_prompt_embeds is regional_text_conditioning.prompt_embeds (z_image_denoise.py:306), so its padded length equals expected_cap_len. There's no assertion of this invariant inside the patch. If a caller ever passes a different cap_feats for the positive pass, use_regional becomes False and regional prompting silently does nothing rather than erroring. Passing an explicit "this is the conditioned pass" flag (or applying the patch per-call) would be more robust than inferring identity from a length match — original_forward is already threaded through but unused, so the plumbing exists.

3. `float_mask` is fully materialized and cloned even on passes that never use it

z_image_transformer_patch.py:126-134 (efficiency)

float_mask is built with torch.where(...).expand(bsz, 1, S, S).clone() before the matching loop, but on any pass where no item matches (every negative pass, the common case), use_regional is False and all layers fall back to unified_mask — the cloned (bsz, 1, S, S) tensor is discarded unused. For a 1024² generation S ≈ 4096, that's a ~33 MB bf16 clone plus a full-tensor where wasted on roughly half of all forward calls. The match conditions (lines 145-150) are cheap and depend only on x_seqlens/cap_seqlens; compute applied_regional first and build float_mask only when any(applied_regional).

Finding #1 is the only correctness issue; #2 and #3 are robustness/efficiency. None are hard blockers given the documented single-region/separate-pass usage, but #1 is worth hardening if short-prompt regional + CFG is a supported combination.

🤖 Generated with Claude Code

lstein · 2026-06-24T21:56:10Z

Tested with several regional guidance layers and it works as advertised. Please have a look at the Claude review and address any of the issues that you feel are valid. From my point of view, none of these are blockers.

The regional attention patch ran for both the conditioned and negative/CFG forward passes and distinguished them by comparing the padded caption length against the positive prompt's expected length. Two short prompts that round up to the same multiple of 32 collided, so the positive regional mask could be injected into the unconditional prediction and silently corrupt CFG. Discriminate the conditioned pass by tensor identity (cap_feats is the exact positive_cap_feats the mask was built for) instead of a length heuristic, so the positive and negative passes can never be confused. The context manager now requires positive_cap_feats whenever a regional mask is provided, turning the previously inferred invariant into an enforced one rather than a silent no-op. Also build the (bsz, 1, S, S) float mask lazily: compute applied_regional from cheap scalar checks first and skip materializing/cloning the full mask on passes that never match (every negative pass), avoiding a ~33 MB bf16 clone per call.

Pfannkuchensack added 3 commits June 7, 2026 11:42

Chore Ruff + Typegen

2318170

Pfannkuchensack requested review from JPPhoto, blessedcoolant, dunkeroni and lstein as code owners June 7, 2026 10:28

github-actions Bot added python PRs that change python files backend PRs that change backend files labels Jun 7, 2026

lstein self-assigned this Jun 17, 2026

lstein added the 6.13.5 Library Updates label Jun 17, 2026

lstein added this to Invoke - Community Roadmap Jun 17, 2026

lstein moved this to 6.13.5 LIBRARY UPDATES in Invoke - Community Roadmap Jun 17, 2026

github-actions Bot added the invocations PRs that change invocations label Jun 25, 2026

Merge branch 'main' into fix/z-image-regional-guidance-split-mismatch

f830aa8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: z-image regional guidance split mismatch#9273

Fix: z-image regional guidance split mismatch#9273
Pfannkuchensack wants to merge 5 commits into
invoke-ai:mainfrom
Pfannkuchensack:fix/z-image-regional-guidance-split-mismatch

Pfannkuchensack commented Jun 7, 2026

Uh oh!

lstein commented Jun 24, 2026 •

edited

Loading

Uh oh!

lstein commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Pfannkuchensack commented Jun 7, 2026

Summary

Related Issues / Discussions

QA Instructions

Merge Plan

Checklist

Uh oh!

lstein commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code review findings

1. Negative/CFG pass can collide with the positive layout and get the wrong regional mask

2. Fragile coupling: regional guidance silently no-ops if cap_feats ever diverges from the mask source

3. float_mask is fully materialized and cloned even on passes that never use it

Uh oh!

lstein commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lstein commented Jun 24, 2026 •

edited

Loading

2. Fragile coupling: regional guidance silently no-ops if `cap_feats` ever diverges from the mask source

3. `float_mask` is fully materialized and cloned even on passes that never use it