These examples demonstrate how to quantize MoE models using llm-compressor. We'll walk through the GLM-4.7 example which applies AWQ quantization to create a W4A16 (4-bit weights, 16-bit activations) model.
To get started, install:
git clone https://github.com/vllm-project/llm-compressor.git
cd llm-compressor
pip install -e .
You can run the complete example with:
python3 glm4_7_example.py
This example demonstrates quantizing the zai-org/GLM-4.7 MoE model using AWQ (Activation-aware Weight Quantization) to 4-bit precision. The process automatically handles MoE-specific calibration requirements.
First, load the GLM-4.7 model and its tokenizer from the Hugging Face Hub:
from transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor.modeling.glm4_moe import CalibrationGlm4MoeMoE # noqa: F401
model_id = "zai-org/GLM-4.7"
model = AutoModelForCausalLM.from_pretrained(model_id, dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)
Important: The import of CalibrationGlm4MoeMoE is crucial for proper MoE calibration. This custom module automatically replaces the original Glm4MoeMoE class during calibration to ensure all experts are properly calibrated, even those that wouldn't normally be activated for certain tokens. More details on this can be found in Quantizing MoEs with a custom definition.
Load and preprocess a calibration dataset. In this example, we use ultrachat_200k:
from datasets import load_dataset
DATASET_ID = "HuggingFaceH4/ultrachat_200k"
DATASET_SPLIT = "train_sft"
NUM_CALIBRATION_SAMPLES = 512
MAX_SEQUENCE_LENGTH = 2048
# Load and shuffle the dataset
ds = load_dataset(DATASET_ID, split=f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES}]")
ds = ds.shuffle(seed=42)
# Apply chat template
def preprocess(example):
return {
"text": tokenizer.apply_chat_template(
example["messages"],
tokenize=False,
)
}
ds = ds.map(preprocess)
# Tokenize
def tokenize(sample):
return tokenizer(
sample["text"],
padding=False,
max_length=MAX_SEQUENCE_LENGTH,
truncation=True,
add_special_tokens=False,
)
ds = ds.map(tokenize, remove_columns=ds.column_names)
Note: 512 calibration samples is a good starting point. Increasing the number of samples can improve quantization accuracy.
Define which layers to quantize and which to ignore. GLM-4.7 has dense layers at the beginning that should be excluded:
from llmcompressor.modifiers.awq import AWQModifier
moe_ignores = [
# Layers 0-2: Dense layers - ignore entire layers
"model.layers.0.*",
"model.layers.1.*",
"model.layers.2.*",
# Ignore the output head
"lm_head",
]
# Configure AWQ with W4A16 (4-bit weights, 16-bit activations)
recipe = AWQModifier(targets="Linear", scheme="W4A16", ignore=moe_ignores)
Why ignore these layers?
lm_head (language model head) is typically kept at higher precision for better output qualityApply the quantization recipe using the oneshot method:
from llmcompressor import oneshot
oneshot(
model=model,
dataset=ds,
recipe=recipe,
max_seq_length=MAX_SEQUENCE_LENGTH,
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)
The oneshot method:
Save the compressed model to disk:
SAVE_DIR = model_id.rstrip("/").split("/")[-1] + "-W4A16-G128"
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)
The model will be saved in a compressed format with 4-bit weights, ready for vLLM inference.
Quantizing MoE models with a scheme that requires calibration data (for example, schemes where activations are not dynamic, such as FP8 or INT8 per-tensor activations, or NVFP4), or with an algorithm that requires data (such as GPTQ, AWQ, or AutoRound), requires a calibration-friendly MoE block definition for the model being quantized.
Examples of calibration-friendly definitions can be found in the modeling folder. Each definition enables an MoE calibration context by inheriting from the MoECalibrationModule class and registering the MoE block that should be replaced with a custom definition.
In particular, each model-specific definition includes an updated forward pass that ensures all tokens are routed through all experts during calibration, including experts that would not normally be activated. Only the activated experts contribute to the final output of the MoE block. This behavior ensures proper calibration of all expert layers.
These custom definitions replace the existing MoE implementations during oneshot processing. The replacement can be either temporary or permanent; in the temporary case, the original definition is restored after calibration. In the GLM-4.7 example above, the CalibrationGlm4MoeMoE custom definition registers a replacement of all Glm4MoeMoE instances from the transformers library with the calibration-friendly version. You can see this definition replacement applied in llmcompressor/modeling/glm4_moe.py.
Without a custom calibration-friendly definition, MoE experts may be calibrated incorrectly, which can result in numerical instability or NaNs.