Skip to content

Fix: z-image regional guidance split mismatch#9273

Open
Pfannkuchensack wants to merge 5 commits into
invoke-ai:mainfrom
Pfannkuchensack:fix/z-image-regional-guidance-split-mismatch
Open

Fix: z-image regional guidance split mismatch#9273
Pfannkuchensack wants to merge 5 commits into
invoke-ai:mainfrom
Pfannkuchensack:fix/z-image-regional-guidance-split-mismatch

Conversation

@Pfannkuchensack

Copy link
Copy Markdown
Collaborator

Summary

fix(z-image): repair & realign Regional Guidance after diffusers refactor

Using "Regional Guidance" layers with Z-Image crashed with:

RuntimeError: split_with_sizes expects split_sizes to sum exactly to 162
(input tensor's size at dimension 0), but got split_sizes=[160]

Why: The regional-prompting patch (z_image_transformer_patch.py) was a hand-copied snapshot of an older ZImageTransformer2DModel.forward. The installed diffusers version refactored that method — in particular _pad_with_ids now produces caption pos_ids that are longer than the caption feature tensor (the pos grid is built at the padded length and extra pad pos-ids are appended). The model copes by splitting RoPE embeddings by pos_ids lengths and truncating, but the stale patch split by feature lengths → size mismatch crash. FLUX.2 was unaffected because it doesn't use this code path.

How:

  1. No more driftcreate_regional_forward now delegates to the model's own helpers (patchify_and_embed, _prepare_sequence, _build_unified_sequence) instead of re-implementing patchify/RoPE/padding logic. It only overrides the main-layer attention mask, so it stays in sync with upstream diffusers.

  2. Mask alignment — The unified sequence pads the image and caption blocks individually to a multiple of 32, so the real per-item layout is [img_real | img_pad | txt_real | txt_pad]. The four regional sub-blocks (img→img, img→txt, txt→img, txt→txt) are now scattered into their padding-aware positions instead of being placed in a contiguous top-left block (which only happened to align at square 1024×1024). This fixes regional guidance silently having no effect at most other resolutions.

  3. CFG/negative pass — The patched forward also runs for the negative prompt (different text length). The regional mask was built for the positive prompt only, so it is now applied only to passes whose caption length matches the positive prompt; other passes fall back to the plain padding mask.

Note: the remaining "soft" region edge (content bleeding slightly past the mask) is intended behavior of the FLUX-style regional approach — unrestricted image self-attention plus alternating full-attention layers preserve global coherence. Token ordering and grid dimensions were verified to match exactly; there is no positional offset.

Related Issues / Discussions

Closes #9251

QA Instructions

  1. Load a Z-Image main model.
  2. Add one or more Regional Guidance layers with masks + prompts on the canvas.
  3. Generate — previously this crashed with the split_with_sizes error; it should now complete successfully.
  4. Verify the prompted content appears within the masked regions.
  5. Test at a non-square resolution (e.g. 832×1216) to confirm the mask aligns (previously it was effectively ignored at non-multiple-of-32 image token counts).
  6. Test with CFG enabled (guidance > 1.0) and a negative prompt to confirm the negative pass is unaffected.

Merge Plan

Standard merge. Touches only invokeai/backend/z_image/z_image_transformer_patch.py.

Checklist

  • The PR has a short but descriptive title, suitable for a changelog
  • Tests added / updated (if applicable)
  • ❗Changes to a redux slice have a corresponding migration
  • Documentation added / updated (if applicable)
  • Updated What's New copy (if doing a release after this PR)

Z-Image Regional Guidance crashed with "split_with_sizes expects
split_sizes to sum exactly to 162 ... but got split_sizes=[160]". The
regional-prompting patch was a hand-copied snapshot of an outdated
ZImageTransformer2DModel.forward. The installed diffusers version
changed _pad_with_ids so caption pos_ids are now longer than the
caption feature tensor, while the stale patch split RoPE embeddings by
feature lengths instead of pos_ids lengths.

Rewrite create_regional_forward to delegate to the model's own helpers
(patchify_and_embed, _prepare_sequence, _build_unified_sequence) and
only override the main-layer attention mask to inject the regional
mask. This keeps the patch in sync with upstream diffusers and stops
re-implementing the drift-prone patchify/RoPE/padding logic.
…ctor

Z-Image Regional Guidance crashed with "split_with_sizes expects
split_sizes to sum exactly to 162 ... but got split_sizes=[160]". The
regional-prompting patch was a hand-copied snapshot of an outdated
ZImageTransformer2DModel.forward; the installed diffusers version
changed _pad_with_ids so caption pos_ids are longer than the caption
feature tensor, while the stale patch split RoPE embeddings by feature
lengths instead of pos_ids lengths.

Rewrite create_regional_forward to delegate to the model's own helpers
(patchify_and_embed, _prepare_sequence, _build_unified_sequence) so it
stays in sync with upstream diffusers, and only override the main-layer
attention mask.

Also fix two reasons regional guidance had no visible effect:
- Mask alignment: the unified sequence pads the image and caption
  blocks individually to a multiple of 32, so the real layout is
  [img_real | img_pad | txt_real | txt_pad]. Scatter the four regional
  sub-blocks into their padding-aware positions instead of assuming a
  contiguous top-left block (which only matched square 1024x1024).
- CFG pass: the patched forward also runs for the negative prompt; only
  apply the regional mask to passes whose caption length matches the
  positive prompt, otherwise fall back to the plain padding mask.
@github-actions github-actions Bot added python PRs that change python files backend PRs that change backend files labels Jun 7, 2026
@lstein lstein self-assigned this Jun 17, 2026
@lstein lstein added the 6.13.5 Library Updates label Jun 17, 2026
@lstein lstein moved this to 6.13.5 LIBRARY UPDATES in Invoke - Community Roadmap Jun 17, 2026
@lstein

lstein commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Code review findings

Reviewed the regional-guidance realignment against the installed diffusers transformer_z_image.py. The core fix is correct: helper call signatures/return unpacking match, the basic-mode sequence order [img, cap] matches the regional mask order [img, txt], and the four-sub-block scatter lands text at the padded offset x_len instead of the old contiguous top-left assumption (the actual "split mismatch"). Three findings worth surfacing:

1. Negative/CFG pass can collide with the positive layout and get the wrong regional mask

invokeai/backend/z_image/z_image_transformer_patch.py:144-148

The positive vs. negative pass is distinguished solely by cap_len != expected_cap_len, where expected_cap_len = txt_seq_len + ((-txt_seq_len) % 32). This only compares the caption length rounded up to a multiple of 32, so any negative prompt whose token count rounds to the same multiple as the positive regional embeds is treated as the positive pass and has the positive mask injected into the unconditional prediction.

Concrete: single region with a 20-token positive prompt → padded cap_len=32; a 25-token negative prompt → padded cap_len=32. The negative forward (cap_feats=[neg_prompt_embeds], z_image_denoise.py:625) then matches, applied_regional[0]=True, and regional[it, it] / regional[ii, it] index a mask built for the positive embedding ranges — corrupting CFG. With short prompts and a single region this is reachable, and the failure is silent. The comments acknowledge the heuristic, but consider a more robust discriminator.

Editorial note from Lincoln: The Canvas UI does not currently provide the option for a negative prompt in regional guidance. Up to you to decide whether it is worth making the discriminator more robust in the event that negative prompting is added in the future.

2. Fragile coupling: regional guidance silently no-ops if cap_feats ever diverges from the mask source

z_image_transformer_patch.py:144-152 (altitude)

The feature only works because pos_prompt_embeds is regional_text_conditioning.prompt_embeds (z_image_denoise.py:306), so its padded length equals expected_cap_len. There's no assertion of this invariant inside the patch. If a caller ever passes a different cap_feats for the positive pass, use_regional becomes False and regional prompting silently does nothing rather than erroring. Passing an explicit "this is the conditioned pass" flag (or applying the patch per-call) would be more robust than inferring identity from a length match — original_forward is already threaded through but unused, so the plumbing exists.

3. float_mask is fully materialized and cloned even on passes that never use it

z_image_transformer_patch.py:126-134 (efficiency)

float_mask is built with torch.where(...).expand(bsz, 1, S, S).clone() before the matching loop, but on any pass where no item matches (every negative pass, the common case), use_regional is False and all layers fall back to unified_mask — the cloned (bsz, 1, S, S) tensor is discarded unused. For a 1024² generation S ≈ 4096, that's a ~33 MB bf16 clone plus a full-tensor where wasted on roughly half of all forward calls. The match conditions (lines 145-150) are cheap and depend only on x_seqlens/cap_seqlens; compute applied_regional first and build float_mask only when any(applied_regional).


Finding #1 is the only correctness issue; #2 and #3 are robustness/efficiency. None are hard blockers given the documented single-region/separate-pass usage, but #1 is worth hardening if short-prompt regional + CFG is a supported combination.

🤖 Generated with Claude Code

@lstein

lstein commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Tested with several regional guidance layers and it works as advertised. Please have a look at the Claude review and address any of the issues that you feel are valid. From my point of view, none of these are blockers.

The regional attention patch ran for both the conditioned and negative/CFG
forward passes and distinguished them by comparing the padded caption length
against the positive prompt's expected length. Two short prompts that round up
to the same multiple of 32 collided, so the positive regional mask could be
injected into the unconditional prediction and silently corrupt CFG.

Discriminate the conditioned pass by tensor identity (cap_feats is the exact
positive_cap_feats the mask was built for) instead of a length heuristic, so
the positive and negative passes can never be confused. The context manager now
requires positive_cap_feats whenever a regional mask is provided, turning the
previously inferred invariant into an enforced one rather than a silent no-op.

Also build the (bsz, 1, S, S) float mask lazily: compute applied_regional from
cheap scalar checks first and skip materializing/cloning the full mask on passes
that never match (every negative pass), avoiding a ~33 MB bf16 clone per call.
@github-actions github-actions Bot added the invocations PRs that change invocations label Jun 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

6.13.5 Library Updates backend PRs that change backend files invocations PRs that change invocations python PRs that change python files

Projects

Status: 6.13.5 LIBRARY UPDATES

Development

Successfully merging this pull request may close these issues.

[bug]: Z-Image Turbo models fail in Canvas when using Regional Guidance

2 participants