--- language: en license: apache-2.0 pipeline_tag: text-generation tags: - quantization - nvfp4 - qwen base_model: Qwen/Qwen3-0.6B model_name: Qwen3-0.6B-NVFP4 --- # Qwen3-0.6B-NVFP4 NVFP4-quantized version of `Qwen/Qwen3-0.6B` produced with [llmcompressor](https://github.com/neuralmagic/llm-compressor). ## Notes - Quantization scheme: NVFP4 (linear layers, `lm_head` excluded) - Calibration samples: 512 - Max sequence length during calibration: 2048 ## Deployment ### Use with vLLM This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below. ```python from vllm import LLM, SamplingParams from transformers import AutoTokenizer model_id = "llmat/Qwen3-0.6B-NVFP4" number_gpus = 1 sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256) tokenizer = AutoTokenizer.from_pretrained(model_id) messages = [ {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"}, {"role": "user", "content": "Who are you?"}, ] prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False) llm = LLM(model=model_id, tensor_parallel_size=number_gpus) outputs = llm.generate(prompts, sampling_params) generated_text = outputs[0].outputs[0].text print(generated_text) ``` vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.