The Z-Engineer returns — now with a PhD in "not being mid."
This is Z-Engineer V4, the culmination of extensive research into what makes an AI prompt engineer actually good at its job. Built on the Qwen 3 architecture and trained using a novel SMART Training methodology, this 4B parameter model doesn't just describe scenes—it understands the craft of visual storytelling down to the lens flare.
Z-Engineer V4 is a fully fine-tuned (not LoRA, we went all in) version of the text encoder from Tongyi-MAI/Z-Image-Turbo. It's been specifically trained to understand the nuances of AI Image Generation workflows.
It excels at:
This version introduces SMART Training (Smart Mode with Adaptive Regularization Topologer)—a custom training methodology that goes beyond standard cross-entropy optimization.
The secret sauce? Four auxiliary regularizers that operate on hidden states, logits, and weight matrices:
| Regularizer | What It Does | Why It Matters |
|---|---|---|
| Entropic | Prevents mode collapse, encourages diversity | No more repetitive "cinematic, 8k, masterpiece" loops |
| Holographic | Enforces depth-wise information compression | Clean feature hierarchy from surface to abstract |
| Topological | Encourages coherent latent trajectories | Prompts flow logically instead of word salad |
| Manifold | Stabilizes weight distributions | Rock-solid training dynamics |
The result? A model that generalizes better, outputs more varied responses, and doesn't collapse into repetitive patterns even after 55,000 training examples.
I have a custom node for seamless integration with ComfyUI:
For best results, use this system prompt:
Interpret the user seed as production intent, then build a definitive 200-250 word single-paragraph image prompt that preserves every explicit constraint while intelligently expanding missing details. First infer the core subject, action, setting, and emotional tone; treat these as non-negotiable anchors. Then enhance with precise visual staging (explicit foreground, midground, background), clear visual hierarchy and eye path, physically plausible lighting (source, direction, softness, color temperature), and optical strategy (if lens/aperture are provided, preserve exactly; if absent, choose fitting lens and aperture and imply their depth-of-field effect). Integrate organic, manufactured, and environmental textures with realistic material behavior, add motion/atmospheric cues only when they support the scene, and apply a coherent color grade consistent with mood and environment. Keep the prose vivid but controlled: no contradictions, no overstuffing, no generic filler. Do not mention camera body brands. Output one polished paragraph only, no bullets, no line breaks, no meta commentary.
I believe in open science. Here's exactly how this was built:
Hardware:
Dataset:
Training Configuration:
| Parameter | Value |
|---|---|
| Method | Full Fine-Tune (not LoRA) |
| Base Model | Qwen3-4b-Z-Image-Turbo-AbliteratedV1 |
| Optimizer Steps | 7,500+ |
| Batch Size | 2 × 8 accumulation = 16 effective |
| Learning Rate | 1e-5 (cosine decay with 5% warmup) |
| Precision | BFloat16 |
| Sequence Length | 640 tokens |
| Total Training Time | ~90 hours |
I provide a full suite of GGUF quantizations for use with llama.cpp, Ollama, and LM Studio:
| Quantization | Size | Notes |
|---|---|---|
| F16 | 8.0 GB | Full precision, maximum quality |
| Q8_0 | 4.3 GB | Near-lossless, recommended for most users |
| Q6_K | 3.3 GB | Great balance of quality and size |
| Q5_K_M | 2.9 GB | Good quality, smaller footprint |
| Q5_K_S | 2.8 GB | Slightly smaller Q5 variant |
| Q4_K_M | 2.5 GB | Solid 4-bit, good for VRAM-limited setups |
| Q4_K_S | 2.4 GB | Smaller 4-bit variant |
| Q3_K_L | 2.2 GB | Lower quality 3-bit, for the desperate |
| Q3_K_M | 2.1 GB | Medium 3-bit |
| Q2_K | 1.7 GB | Emergency-only tier. But it exists! |
With Ollama:
ollama run BennyDaBall/Qwen3-4b-Z-Image-Engineer-V4
With LM Studio:
This model generates text for image prompts. While I have filtered the dataset to the best of my ability, users should exercise their own judgment. I am not responsible for the content you generate.
Also, if you use this to generate prompts for images that get you in trouble, that's a you problem. The model is just vibing.
Built with ❤️ and way too much GPU time by BennyDaBall