









Z-Image-Turbo-SDA is a highly efficient LoKr (Low-Rank Kronecker Product) adapter designed to rescue the "Diversity Collapse" problem in few-step distilled Flow Matching / Diffusion models.
By applying our novel Semantic Directional Alignment (SDA) Loss, this LoKr adapter recovers 70.2% of the original teacher model's compositional diversity (LPIPS) while perfectly maintaining the extreme sharpness and fast inference speed of the 8-step baseline model.
When distilling a 50-step diffusion model (Teacher) into an 8-step model (Student), the student often learns to shortcut the manifold, converging to the "mean" of the distribution. As a result, changing the random noise seed yields almost identical compositions (same poses, same layouts).
Standard SFT (Supervised Fine-Tuning) fails to fix this because forcing the student to match the teacher's absolute velocity () destroys the rectified straight-line trajectory, leading to blurry, gray, or scrambled images.
Instead of matching raw velocity, we fine-tuned this model using a custom 4-pillar physics-based architecture, backed by the parameter-efficient LoKr structure:
Instead of matching the raw velocity , we project predictions into the clean space (). We then apply Spatial Low-Pass Filtering (AvgPool2d, ) to prevent the model from cheating via high-frequency noise. Finally, we use Standard Cosine Loss to align the semantic direction of the student's variance with the teacher's, completely freeing the student to maintain its high-contrast magnitude.
To prevent the diversity push from destroying the 8-step trajectory, we introduce a continuous anchor:
sft_loss = Huber(v_student, v_baseline_detached, delta=0.08)
This acts as an elastic tether. If the model strays too far to change the composition, the Huber loss provides a constant (L1) pull to polish the high-frequency details without violently snapping the model back to the collapsed mean.
If a Pre-trained Training Adapter is used, we apply it asymmetrically:
This adapter is trained as a LoKr (LyCORIS ecosystem). Modern versions of diffusers coupled with peft natively support loading LoKr weights seamlessly.
from diffusers import DiffusionPipeline
import torch
# 1. Load the base 8-step distilled model
pipeline = DiffusionPipeline.from_pretrained(
"Tongyi-MAI/Z-Image-Turbo",
torch_dtype=torch.float16
).to("cuda")
# 2. Load the SDA Diversity LoKr Adapter
# (Ensure you have the latest `diffusers` and `peft` installed)
pipeline.load_lora_weights("F16/z-image-turbo-sda", adapter_name="sda_diversity")
prompt = "A lone traveler standing on a mountain peak, epic fantasy lighting"
# 3. Generate diverse images with different seeds!
# You will now get completely different poses, camera angles, and layouts across seeds.
for seed in[42, 123, 777, 999]:
generator = torch.Generator(device="cuda").manual_seed(seed)
image = pipeline(
prompt=prompt,
num_inference_steps=8,
guidance_scale=1,
generator=generator
).images[0]
image.save(f"traveler_seed_{seed}.png")
Evaluated across complex prompts and 16 random seeds per prompt at Step 2500:
| Model Variant | Avg LPIPS (Perceptual Diversity) | Pixel StdDev (Magnitude Variance) |
|---|---|---|
| 8-Step Baseline (Collapsed) | 0.564 | 0.190 |
| 8-Step + SDA LoKr (Ours) | 0.691 (+0.127) | 0.209 (Pristine Sharpness) |
| 50-Step Teacher | 0.745 | 0.243 |
Why this matters: The SDA LoKr successfully decoupled structural diversity from destructive noise. It bumped the macro-compositional diversity (LPIPS) close to the 50-step teacher, while strictly constraining the Pixel StdDev to prevent the "darkening/blurring" effect typical in continuous ODE fine-tuning.
The core loss mechanism can be easily integrated into any Diffusers / Flow Matching training loop.
import torch
import torch.nn.functional as F
def compute_sda_loss(student_model, teacher_model, z1, z2, latents, t_continuous, v_self_detached):
# 1. Teacher Predictions (No Grad)
with torch.no_grad():
x0_T_z1 = z1 - teacher_model(z1, t_continuous)
x0_T_z2 = z2 - teacher_model(z2, t_continuous)
# 2. Student Baseline (from detached SFT forward, Adapter is ON)
x0_S_z1 = z1 - v_self_detached
# 3. Asymmetric Adapter Masking: Adapter-OFF for Diversity
# Temporarily disable the pre-trained assistant adapter to expose
# the raw collapsed manifold and generate massive diversity gradients.
student_model.assistant_adapter.is_active = False
try:
# Student forward with Adapter OFF
x0_S_z2 = z2 - student_model(z2, t_continuous)
finally:
# Crucial: Turn it back ON for subsequent SFT / validation steps
student_model.assistant_adapter.is_active = True
# Calculate Deltas
delta_T = x0_T_z2 - x0_T_z1
delta_S = x0_S_z2 - x0_S_z1
# 4. Spatial Low-Pass Filter (Force Macro-Composition changes, prevent high-freq cheating)
pooled_T = F.avg_pool2d(delta_T, kernel_size=8).view(delta_T.size(0), -1)
pooled_S = F.avg_pool2d(delta_S, kernel_size=8).view(delta_S.size(0), -1)
# 5. Standard Cosine Loss (Direction alignment without magnitude penalty)
cos_sim = F.cosine_similarity(pooled_S, pooled_T, dim=-1)
div_loss = (1.0 - cos_sim).mean()
return div_loss
# --- Training Loop Integration ---
# SFT Forward (Adapter ON) -> Protects pristine image quality
with torch.no_grad():
v_self = student_baseline(xt_z1, t)
# Self-Reference Huber Loss (Elastic anchor to preserve 8-step trajectory)
sft_loss = F.huber_loss(v_pol_z1, v_self.detach(), delta=0.08)
# Time-Segmented Diversity (Enabled only for t > div_skip_threshold)
# Skip Index 7 (extremely low noise) to lock in final pixel sharpness
if t > LOW_NOISE_THRESHOLD:
div_loss = compute_sda_loss(student_model, teacher_model, z1, z2, latents, t, v_pol_z1.detach())
loss = sft_loss + diversity_lambda * div_loss
else:
loss = sft_loss # Pure SFT
full_rank, factor=8)weight_decay=0.001)div_skip at extremely low noise (index 7).A: Yes, but with potential compatibility risks. You will need to balance the weights manually.
0.5 ~ 0.7). Note the inherent trade-off: reducing the weight weakens the "semantic push," causing diversity to regress back toward the collapsed mean. You will need to find the "sweet spot" based on your specific needs.A: This depends on the "topological constraints" the prompt places on the composition space.
SDA is a result of my personal study and experimental journey into the underlying mechanics of Flow Matching models. As such, it is far from perfect.
The current v1 release was derived from our best internal checkpoint (div_27_2500, which achieved up to 70% diversity recovery), followed by an additional 360 steps of quality-focused fine-tuning.
Solving the zero-sum game between "macro-compositional diversity" and "micro-anatomical rigidity" within an 8-step inference framework remains an open challenge. I hope to find a more elegant solution in the future, and I welcome the community to build upon this exploration.
If you use this model or the SDA methodology in your research, please cite:
@misc{sda_diversity_loss_2026, title={Teacher-Guided Semantic Directional Alignment (SDA) for Restoring Diversity in Few-Step Distilled Models}, author={Fok}, year={2026}, url={https://huggingface.co/F16/z-image-turbo-sda} }