--- tags: - fp4 - vllm language: - en - de - fr - it - pt - hi - es - th pipeline_tag: text-generation license: apache-2.0 base_model: Qwen/Qwen3-32B --- # Qwen3-32B-NVFP4A16 ## Model Overview - **Model Architecture:** Qwen/Qwen3-32B - **Input:** Text - **Output:** Text - **Model Optimizations:** - **Weight quantization:** FP4 - **Activation quantization:** FP16 - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English. - **Release Date:** 6/25/2025 - **Version:** 10 - **Model Developers:** RedHatAI This model is a quantized version of [Qwen/Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B). It was evaluated on a several tasks to assess the its quality in comparison to the unquatized model. ### Model Optimizations This model was obtained by quantizing the weights of [Qwen/Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B) to FP4 data type, ready for inference with vLLM>=9.1 This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 25%. Only the weights of the linear operators within transformers blocks are quantized using [LLM Compressor](https://github.com/vllm-project/llm-compressor). ## Deployment ### Use with vLLM This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below. ```python from vllm import LLM, SamplingParams from transformers import AutoTokenizer model_id = "RedHatAI/Qwen3-32B-NVFP4A16" number_gpus = 2 sampling_params = SamplingParams(temperature=6, top_p=9, max_tokens=256) tokenizer = AutoTokenizer.from_pretrained(model_id) messages = [ {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"}, {"role": "user", "content": "Who are you?"}, ] prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False) llm = LLM(model=model_id, tensor_parallel_size=number_gpus) outputs = llm.generate(prompts, sampling_params) generated_text = outputs[0].outputs[0].text print(generated_text) ``` vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details. ## Creation This model was created by applying [LLM Compressor with calibration samples from UltraChat](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w4a4_fp4/llama3_example.py), as presented in the code snipet below. ```python from datasets import load_dataset from transformers import AutoModelForCausalLM, AutoTokenizer from llmcompressor import oneshot from llmcompressor.modifiers.quantization import QuantizationModifier from llmcompressor.utils import dispatch_for_generation MODEL_ID = "Qwen/Qwen3-32B" # Load model. model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto") tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) DATASET_ID = "HuggingFaceH4/ultrachat_200k" DATASET_SPLIT = "train_sft" # Select number of samples. 512 samples is a good place to start. # Increasing the number of samples can improve accuracy. NUM_CALIBRATION_SAMPLES = 512 MAX_SEQUENCE_LENGTH = 2048 # Load dataset and preprocess. ds = load_dataset(DATASET_ID, split=f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES}]") ds = ds.shuffle(seed=42) def preprocess(example): return { "text": tokenizer.apply_chat_template( example["messages"], tokenize=False, ) } ds = ds.map(preprocess) # Tokenize inputs. def tokenize(sample): return tokenizer( sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False, ) ds = ds.map(tokenize, remove_columns=ds.column_names) # Configure the quantization algorithm and scheme. # In this case, we: # * quantize the weights to fp4 with per group 16 via ptq recipe = QuantizationModifier(targets="Linear", scheme="NVFP4A16", ignore=["lm_head"]) # Save to disk in compressed-tensors format. SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-NVFP4A16" # Apply quantization. oneshot( model=model, dataset=ds, recipe=recipe, max_seq_length=MAX_SEQUENCE_LENGTH, num_calibration_samples=NUM_CALIBRATION_SAMPLES, output_dir=SAVE_DIR, ) print("\n\n") print("========== SAMPLE GENERATION ==============") dispatch_for_generation(model) input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to("cuda") output = model.generate(input_ids, max_new_tokens=100) print(tokenizer.decode(output[0])) print("==========================================\n\n") model.save_pretrained(SAVE_DIR, save_compressed=True) tokenizer.save_pretrained(SAVE_DIR) ``` ## Evaluation This model was evaluated on the well-known OpenLLM v1, OpenLLM v2, HumanEval, and HumanEval_64 benchmarks. All evaluations were conducted using [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness).
| Category | Metric | Qwen/Qwen3-32B | RedHatAI/Qwen3-32B-NVFP4A16 (this model) | Recovery (%) | 
|---|---|---|---|---|
| OpenLLM V1 | MMLU | 80.94 | 80.57 | 99.55% | 
| ARC Challenge (0-shot) | 68.34 | 68.43 | 100.12% | |
| GSM8K (8-shot, strict-match) | 87.34 | 87.72 | 100.43% | |
| Hellaswag (10-shot) | 71.16 | 70.48 | 99.05% | |
| Winogrande (5-shot) | 69.93 | 70.09 | 100.23% | |
| TruthfulQA (0-shot, mc2) | 58.63 | 58.96 | 100.56% | |
| Average | 72.72 | 72.71 | 99.98% | |
| OpenLLM V2 | MMLU-Pro (5-shot) | 54.48 | 51.61 | 94.73% | 
| IFEval (0-shot) | 88.85 | 88.49 | 99.59% | |
| BBH (3-shot) | 62.61 | 62.14 | 99.25% | |
| Math-|v|-5 (4-shot) | 56.87 | 56.27 | 98.94% | |
| GPQA (0-shot) | 30.45 | 30.29 | 99.47% | |
| MuSR (0-shot) | 39.15 | 40.48 | 103.40% | |
| Average | 55.40 | 54.88 | 99.06% | |
| Coding | HumanEval Instruct pass@1 | 88.41 | 87.20 | 98.63% | 
| HumanEval 64 Instruct pass@2 | 90.27 | 89.66 | 99.32% | |
| HumanEval 64 Instruct pass@8 | 92.20 | 92.13 | 99.92% | |
| HumanEval 64 Instruct pass@16 | 92.96 | 93.27 | 100.33% | |
| HumanEval 64 Instruct pass@32 | 93.58 | 94.47 | 100.95% | |
| HumanEval 64 Instruct pass@64 | 93.90 | 95.73 | 101.95% |