Fix: z-image regional guidance split mismatch#9273
Conversation
Z-Image Regional Guidance crashed with "split_with_sizes expects split_sizes to sum exactly to 162 ... but got split_sizes=[160]". The regional-prompting patch was a hand-copied snapshot of an outdated ZImageTransformer2DModel.forward. The installed diffusers version changed _pad_with_ids so caption pos_ids are now longer than the caption feature tensor, while the stale patch split RoPE embeddings by feature lengths instead of pos_ids lengths. Rewrite create_regional_forward to delegate to the model's own helpers (patchify_and_embed, _prepare_sequence, _build_unified_sequence) and only override the main-layer attention mask to inject the regional mask. This keeps the patch in sync with upstream diffusers and stops re-implementing the drift-prone patchify/RoPE/padding logic.
…ctor Z-Image Regional Guidance crashed with "split_with_sizes expects split_sizes to sum exactly to 162 ... but got split_sizes=[160]". The regional-prompting patch was a hand-copied snapshot of an outdated ZImageTransformer2DModel.forward; the installed diffusers version changed _pad_with_ids so caption pos_ids are longer than the caption feature tensor, while the stale patch split RoPE embeddings by feature lengths instead of pos_ids lengths. Rewrite create_regional_forward to delegate to the model's own helpers (patchify_and_embed, _prepare_sequence, _build_unified_sequence) so it stays in sync with upstream diffusers, and only override the main-layer attention mask. Also fix two reasons regional guidance had no visible effect: - Mask alignment: the unified sequence pads the image and caption blocks individually to a multiple of 32, so the real layout is [img_real | img_pad | txt_real | txt_pad]. Scatter the four regional sub-blocks into their padding-aware positions instead of assuming a contiguous top-left block (which only matched square 1024x1024). - CFG pass: the patched forward also runs for the negative prompt; only apply the regional mask to passes whose caption length matches the positive prompt, otherwise fall back to the plain padding mask.
Code review findingsReviewed the regional-guidance realignment against the installed diffusers 1. Negative/CFG pass can collide with the positive layout and get the wrong regional mask
The positive vs. negative pass is distinguished solely by Concrete: single region with a 20-token positive prompt → padded Editorial note from Lincoln: The Canvas UI does not currently provide the option for a negative prompt in regional guidance. Up to you to decide whether it is worth making the discriminator more robust in the event that negative prompting is added in the future. 2. Fragile coupling: regional guidance silently no-ops if
|
|
Tested with several regional guidance layers and it works as advertised. Please have a look at the Claude review and address any of the issues that you feel are valid. From my point of view, none of these are blockers. |
The regional attention patch ran for both the conditioned and negative/CFG forward passes and distinguished them by comparing the padded caption length against the positive prompt's expected length. Two short prompts that round up to the same multiple of 32 collided, so the positive regional mask could be injected into the unconditional prediction and silently corrupt CFG. Discriminate the conditioned pass by tensor identity (cap_feats is the exact positive_cap_feats the mask was built for) instead of a length heuristic, so the positive and negative passes can never be confused. The context manager now requires positive_cap_feats whenever a regional mask is provided, turning the previously inferred invariant into an enforced one rather than a silent no-op. Also build the (bsz, 1, S, S) float mask lazily: compute applied_regional from cheap scalar checks first and skip materializing/cloning the full mask on passes that never match (every negative pass), avoiding a ~33 MB bf16 clone per call.
Summary
fix(z-image): repair & realign Regional Guidance after diffusers refactor
Using "Regional Guidance" layers with Z-Image crashed with:
Why: The regional-prompting patch (
z_image_transformer_patch.py) was a hand-copied snapshot of an olderZImageTransformer2DModel.forward. The installed diffusers version refactored that method — in particular_pad_with_idsnow produces captionpos_idsthat are longer than the caption feature tensor (the pos grid is built at the padded length and extra pad pos-ids are appended). The model copes by splitting RoPE embeddings bypos_idslengths and truncating, but the stale patch split by feature lengths → size mismatch crash. FLUX.2 was unaffected because it doesn't use this code path.How:
No more drift —
create_regional_forwardnow delegates to the model's own helpers (patchify_and_embed,_prepare_sequence,_build_unified_sequence) instead of re-implementing patchify/RoPE/padding logic. It only overrides the main-layer attention mask, so it stays in sync with upstream diffusers.Mask alignment — The unified sequence pads the image and caption blocks individually to a multiple of 32, so the real per-item layout is
[img_real | img_pad | txt_real | txt_pad]. The four regional sub-blocks (img→img, img→txt, txt→img, txt→txt) are now scattered into their padding-aware positions instead of being placed in a contiguous top-left block (which only happened to align at square 1024×1024). This fixes regional guidance silently having no effect at most other resolutions.CFG/negative pass — The patched forward also runs for the negative prompt (different text length). The regional mask was built for the positive prompt only, so it is now applied only to passes whose caption length matches the positive prompt; other passes fall back to the plain padding mask.
Related Issues / Discussions
Closes #9251
QA Instructions
split_with_sizeserror; it should now complete successfully.Merge Plan
Standard merge. Touches only
invokeai/backend/z_image/z_image_transformer_patch.py.Checklist
What's Newcopy (if doing a release after this PR)