Public

WeChat Login

Code Issues Pull requests Events Packages Insights

main

Branch

Tag

fok<F16@users.noreply.huggingface.co>

Update README.md

1700c825

10 commits

.gitattributes
README.md
zit_sda_v1.safetensors

⚡️ Z-Image-Turbo-SDA: Restoring Generative Diversity in 8-Step Models

a38959bb6977d81bf164d0d2452dd30d

ddb03e343b61d491eebb3ec54f968f26

352f36ec0f425a654c3682d4a5d1f978

d06e3b31ec16f3501d5c11361af588e8

95e27e37864b5d1441c30d4fb0b30611

37f4b30a6f489d78e834c605dac43f5f

comparison_prompt_00_group_00

comparison_prompt_00_group_01

comparison_prompt_00_group_02

comparison_prompt_00_group_03

Z-Image-Turbo-SDA is a highly efficient LoKr (Low-Rank Kronecker Product) adapter designed to rescue the "Diversity Collapse" problem in few-step distilled Flow Matching / Diffusion models.

By applying our novel Semantic Directional Alignment (SDA) Loss, this LoKr adapter recovers 70.2% of the original teacher model's compositional diversity (LPIPS) while perfectly maintaining the extreme sharpness and fast inference speed of the 8-step baseline model.

📖 The Problem: Diversity Collapse in Distillation

When distilling a 50-step diffusion model (Teacher) into an 8-step model (Student), the student often learns to shortcut the manifold, converging to the "mean" of the distribution. As a result, changing the random noise seed yields almost identical compositions (same poses, same layouts).

Standard SFT (Supervised Fine-Tuning) fails to fix this because forcing the student to match the teacher's absolute velocity ( $v$ ) destroys the rectified straight-line trajectory, leading to blurry, gray, or scrambled images.

💡 Our Solution: Semantic Directional Alignment (SDA) + LoKr

Instead of matching raw velocity, we fine-tuned this model using a custom 4-pillar physics-based architecture, backed by the parameter-efficient LoKr structure:

1. $x_0$ -Space Cosine Loss (Direction Without Magnitude)

Instead of matching the raw velocity $v$ , we project predictions into the clean $x_0$ space ( $\hat{x}_0 = z - v$ ). We then apply Spatial Low-Pass Filtering (AvgPool2d, $k=8$ ) to prevent the model from cheating via high-frequency noise. Finally, we use Standard Cosine Loss to align the semantic direction of the student's variance with the teacher's, completely freeing the student to maintain its high-contrast magnitude.

2. Self-Reference SFT Anchor (The "Thermostat" Effect)

To prevent the diversity push from destroying the 8-step trajectory, we introduce a continuous anchor: sft_loss = Huber(v_student, v_baseline_detached, delta=0.08) This acts as an elastic tether. If the model strays too far to change the composition, the Huber loss provides a constant (L1) pull to polish the high-frequency details without violently snapping the model back to the collapsed mean.

3. Asymmetric Adapter Masking (The Breakthrough)

If a Pre-trained Training Adapter is used, we apply it asymmetrically:

Diversity Forward (Adapter OFF): Exposes the raw collapsed manifold of the student, generating massive, accurate gradients for composition correction.
SFT Forward (Adapter ON): Provides an indestructible quality baseline, locking in the sharpness.

4. Extreme-Low-Noise Skip (`div_skip`)

At the final inference step (e.g., $t \approx 8.9$ ), macro-composition is already decided. Pushing diversity here only causes high-frequency jitter (blur). We completely cut off the Diversity Loss at the lowest noise step, utilizing pure SFT to guarantee razor-sharp final pixels.

💻 How to Use

This adapter is trained as a LoKr (LyCORIS ecosystem). Modern versions of diffusers coupled with peft natively support loading LoKr weights seamlessly.


from diffusers import DiffusionPipeline
import torch

# 1. Load the base 8-step distilled model
pipeline = DiffusionPipeline.from_pretrained(
    "Tongyi-MAI/Z-Image-Turbo", 
    torch_dtype=torch.float16
).to("cuda")

# 2. Load the SDA Diversity LoKr Adapter
# (Ensure you have the latest `diffusers` and `peft` installed)
pipeline.load_lora_weights("F16/z-image-turbo-sda", adapter_name="sda_diversity")

prompt = "A lone traveler standing on a mountain peak, epic fantasy lighting"

# 3. Generate diverse images with different seeds!
# You will now get completely different poses, camera angles, and layouts across seeds.
for seed in[42, 123, 777, 999]:
    generator = torch.Generator(device="cuda").manual_seed(seed)
    image = pipeline(
        prompt=prompt, 
        num_inference_steps=8, 
        guidance_scale=1,
        generator=generator
    ).images[0]
    image.save(f"traveler_seed_{seed}.png")

📊 Quantitative Evaluation

Evaluated across complex prompts and 16 random seeds per prompt at Step 2500:

Model Variant	Avg LPIPS (Perceptual Diversity) $\uparrow$	Pixel StdDev (Magnitude Variance) $\downarrow$
8-Step Baseline (Collapsed)	0.564	0.190
8-Step + SDA LoKr (Ours)	0.691 (+0.127)	0.209 (Pristine Sharpness)
50-Step Teacher	0.745	0.243

Why this matters: The SDA LoKr successfully decoupled structural diversity from destructive noise. It bumped the macro-compositional diversity (LPIPS) close to the 50-step teacher, while strictly constraining the Pixel StdDev to prevent the "darkening/blurring" effect typical in continuous ODE fine-tuning.

💻 Implementation & Usage

The core loss mechanism can be easily integrated into any Diffusers / Flow Matching training loop.

Pseudo Code


import torch
import torch.nn.functional as F

def compute_sda_loss(student_model, teacher_model, z1, z2, latents, t_continuous, v_self_detached):
    # 1. Teacher Predictions (No Grad)
    with torch.no_grad():
        x0_T_z1 = z1 - teacher_model(z1, t_continuous)
        x0_T_z2 = z2 - teacher_model(z2, t_continuous)
    
    # 2. Student Baseline (from detached SFT forward, Adapter is ON)
    x0_S_z1 = z1 - v_self_detached  
    
    # 3. Asymmetric Adapter Masking: Adapter-OFF for Diversity
    # Temporarily disable the pre-trained assistant adapter to expose 
    # the raw collapsed manifold and generate massive diversity gradients.
    student_model.assistant_adapter.is_active = False
    try:
        # Student forward with Adapter OFF
        x0_S_z2 = z2 - student_model(z2, t_continuous)
    finally:
        # Crucial: Turn it back ON for subsequent SFT / validation steps
        student_model.assistant_adapter.is_active = True
    
    # Calculate Deltas
    delta_T = x0_T_z2 - x0_T_z1
    delta_S = x0_S_z2 - x0_S_z1
    
    # 4. Spatial Low-Pass Filter (Force Macro-Composition changes, prevent high-freq cheating)
    pooled_T = F.avg_pool2d(delta_T, kernel_size=8).view(delta_T.size(0), -1)
    pooled_S = F.avg_pool2d(delta_S, kernel_size=8).view(delta_S.size(0), -1)
    
    # 5. Standard Cosine Loss (Direction alignment without magnitude penalty)
    cos_sim = F.cosine_similarity(pooled_S, pooled_T, dim=-1)
    div_loss = (1.0 - cos_sim).mean()
    
    return div_loss

# --- Training Loop Integration ---

# SFT Forward (Adapter ON) -> Protects pristine image quality
with torch.no_grad():
    v_self = student_baseline(xt_z1, t) 

# Self-Reference Huber Loss (Elastic anchor to preserve 8-step trajectory)
sft_loss = F.huber_loss(v_pol_z1, v_self.detach(), delta=0.08)

# Time-Segmented Diversity (Enabled only for t > div_skip_threshold)
# Skip Index 7 (extremely low noise) to lock in final pixel sharpness
if t > LOW_NOISE_THRESHOLD:
    div_loss = compute_sda_loss(student_model, teacher_model, z1, z2, latents, t, v_pol_z1.detach())
    loss = sft_loss + diversity_lambda * div_loss
else:
    loss = sft_loss # Pure SFT

🛠 Training Details

Base Model: Z-Image-Turbo (8-Step Flow Matching)
Adapter Type: LoKr (Low-Rank Kronecker Product, full_rank, factor=8)
Optimizer: AdamW (weight_decay=0.001)
Learning Rate: 2.5e-4
Diversity Weight ( $\lambda$ ): 1.0 (Standard Cosine Loss)
Huber Delta: 0.08
Timesteps: Strict discrete scheduler sampling (indices 1-6) with div_skip at extremely low noise (index 7).

❓ FAQ (Frequently Asked Questions)

Q1: Can I use SDA LoRA alongside other Z-Image-Turbo (ZIT) LoRAs?

A: Yes, but with potential compatibility risks. You will need to balance the weights manually.

The Physics: The core mechanism of SDA LoRA involves rotating the model's velocity field. Because the prediction direction has been intentionally deviated from the original rectified trajectory to achieve diversity, it may clash with other ZIT LoRAs that expect the original "straight-line" path.
The Conflict: When stacking multiple LoRAs, you might encounter structural artifacts or blurring due to conflicting directional gradients.
The Solution: Try lowering the inference weight of the SDA LoRA (e.g., to 0.5 ~ 0.7). Note the inherent trade-off: reducing the weight weakens the "semantic push," causing diversity to regress back toward the collapsed mean. You will need to find the "sweet spot" based on your specific needs.

Q2: Why is the diversity more dramatic in simple prompts compared to complex ones?

A: This depends on the "topological constraints" the prompt places on the composition space.

Simple Prompts (e.g., "A cat sitting"): SDA can unleash maximum diversity here because the prompt imposes few restrictions on the latent space. The "cat" has the freedom to sit, lie down, or look in different directions.
Complex Prompts (e.g., "A cat on the left, looking at a red cup on the right"): These prompts act like a rigid "blueprint," strictly anchoring elements to specific coordinates. Since the text encoder's guidance is extremely strong, the SDA LoRA cannot easily flip the composition without violating the prompt's instructions. Consequently, longer, more descriptive prompts will naturally show less spatial variation.

⚠️ Developer Notes & Limitations

SDA is a result of my personal study and experimental journey into the underlying mechanics of Flow Matching models. As such, it is far from perfect.

The current v1 release was derived from our best internal checkpoint (div_27_2500, which achieved up to 70% diversity recovery), followed by an additional 360 steps of quality-focused fine-tuning.

A Regrettable Trade-off (Anatomy vs. Diversity): Maintaining rigid structures like limbs and hands requires the model to stay close to its original distilled manifold. To achieve better anatomical stability, I had to compromise on the "semantic push." As a result, the composition diversity recovery regressed from its peak of 70% back to 58% in this final version.
Residual Issues: Despite sacrificing 12% of the diversity, the model may still produce occasional anatomical errors in complex poses—a common "Achilles' heel" for few-step distilled models.

Solving the zero-sum game between "macro-compositional diversity" and "micro-anatomical rigidity" within an 8-step inference framework remains an open challenge. I hope to find a more elegant solution in the future, and I welcome the community to build upon this exploration.

🤝 Citation

If you use this model or the SDA methodology in your research, please cite:


@misc{sda_diversity_loss_2026,
  title={Teacher-Guided Semantic Directional Alignment (SDA) for Restoring Diversity in Few-Step Distilled Models},
  author={Fok},
  year={2026},
  url={https://huggingface.co/F16/z-image-turbo-sda}
}