llmcompressor supports quantizing weights and activations to fp8 for memory savings and inference acceleration with vllm
fp8compuation is supported on Nvidia GPUs with compute capability > 8.9 (Ada Lovelace, Hopper).
To get started, install:
pip install llmcompressor
The example includes an end-to-end script for applying the quantization algorithm.
python3 llama3_example.py
The resulting model Meta-Llama-3-8B-Instruct-FP8-Dynamic is ready to be loaded into vLLM.
Now, we will step though the code in the example. There are three steps:
Load the model using AutoModelForCausalLM
from transformers import AutoTokenizer, AutoModelForCausalLM
MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
For fp8 quantization, we can recover accuracy with simple PTQ quantization.
We recommend targeting all Linear layers using the FP8_DYNAMIC scheme, which uses:
Since simple PTQ does not require data for weight quantization and the activations are quantized dynamically, we do not need any calibration data for this quantization flow.
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
# Configure the simple PTQ quantization
recipe = QuantizationModifier(
targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"])
# Apply the quantization algorithm.
oneshot(model=model, recipe=recipe)
# Save the model.
SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-FP8-Dynamic"
model.save_pretrained(SAVE_DIR)
tokenizer.save_pretrained(SAVE_DIR)
We have successfully created an fp8 model!
Install vllm and lm-evaluation-harness:
pip install vllm lm_eval==0.4.3
Load and run the model in vllm:
from vllm import LLM
model = LLM("./Meta-Llama-3-8B-Instruct-FP8-Dynamic")
model.generate("Hello my name is")
Evaluate accuracy with lm_eval (for example on 250 samples of gsm8k):
Note: quantized models can be sensitive to the presence of the
bostoken.lm_evaldoes not add abostoken by default, so make sure to include theadd_bos_token=Trueargument when running your evaluations.
MODEL=$PWD/Meta-Llama-3-8B-Instruct-FP8-Dynamic
lm_eval \
--model vllm \
--model_args pretrained=$MODEL,add_bos_token=True \
--tasks gsm8k --num_fewshot 5 --batch_size auto --limit 250
We can see the resulting scores look good:
|Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr| |-----|------:|----------------|-----:|-----------|---|----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.768|± |0.0268| | | |strict-match | 5|exact_match|↑ |0.768|± |0.0268|
Please open up an issue on vllm-project/llm-compressor