--- license: apache-2.0 base_model: cerebras/GLM-4.6-REAP-218B-A32B library_name: transformers tags: - glm - moe - mixture-of-experts - autoround - quantized - 4-bit - w4a16 - vllm - sglang - cerebras model_type: glm4 pipeline_tag: text-generation quantized_by: 0xSero inference: false --- # GLM-4.6-REAP-218B-A32B W4A16 (AutoRound Quantization) This is a **4-bit quantized** version of [cerebras/GLM-4.6-REAP-218B-A32B](https://huggingface.co/cerebras/GLM-4.6-REAP-218B-A32B) using [Intel AutoRound](https://github.com/intel/auto-round). ## Model Details | Property | Value | |----------|-------| | **Base Model** | [cerebras/GLM-4.6-REAP-218B-A32B](https://huggingface.co/cerebras/GLM-4.6-REAP-218B-A32B) | | **Quantization** | W4A16 (4-bit weights, 16-bit activations) | | **Method** | Intel AutoRound | | **Format** | auto_round (compatible with vLLM, SGLang) | | **Architecture** | GLM-4 Mixture of Experts | | **Total Parameters** | 218B | | **Active Parameters** | 32B (A32B) | | **Original Size** | ~436 GB (BF16) | | **Quantized Size** | ~116 GB | ## Performance Benchmarks Tested on 8x NVIDIA RTX 3090 (24GB each) with vLLM: ### Speed Test (~20k context) | Metric | Value | |--------|-------| | **Prompt Tokens** | ~21,178 | | **Completion Tokens** | 393 | | **Time to First Token (TTFB)** | 23.82s | | **Total Generation Time** | 36.45s | | **Prefill Speed** | ~889 tok/s | | **Generation Speed** | ~31 tok/s | ### Coherence Test The model correctly recalled all embedded facts from a long context: - Character name: Aurelia - Product code: ZX-42-ALPHA - Transaction amount: 7,530,000 credits - Scientist name: Dr. Linh Tran - Date: 2025-12-15 ## Usage ### vLLM (Recommended) ```bash vllm serve /GLM-4.6-REAP-218B-A32B-W4A16-AutoRound \ --host 0.0.0.0 --port 8000 \ --tensor-parallel-size 4 --pipeline-parallel-size 2 \ --quantization auto-round \ --kv-cache-dtype fp8 \ --max-model-len 200000 \ --gpu-memory-utilization 0.88 \ --cpu-offload-gb 4 \ --block-size 32 \ --max-num-seqs 8 \ --max-num-batched-tokens 8192 \ --swap-space 32 \ --enable-expert-parallel \ --enable-prefix-caching \ --enable-chunked-prefill \ --disable-custom-all-reduce \ --disable-log-requests \ --trust-remote-code ``` ### SGLang ```bash python3 -m sglang.launch_server \ --model-path 0xSero/GLM-4.6-REAP-218B-A32B-W4A16-AutoRound \ --tp-size 8 \ --trust-remote-code ``` ### Hardware Requirements | Configuration | VRAM Required | Notes | |---------------|---------------|-------| | **8x 24GB GPUs** | ~192GB total | TP=4, PP=2, recommended | | **4x 48GB GPUs** | ~192GB total | TP=4, no PP needed | | **8x 48GB GPUs** | ~384GB total | Full speed, larger batches | **Minimum**: 8x 24GB GPUs (RTX 3090/4090) or equivalent ~192GB total VRAM. ## Quantization Details ### Method Quantized using [Intel AutoRound](https://github.com/intel/auto-round) with the following configuration: - **Scheme**: W4A16 (4-bit weights, 16-bit activations) - **Calibration samples**: 64 - **Sequence length**: 512 - **Batch size**: 1 ### Quantization Script ```python #!/usr/bin/env python3 """ GLM-4.6-REAP-218B W4A16 Quantization using Intel AutoRound Produces SGLang/vLLM-compatible 4-bit quantized model. """ import logging from datetime import datetime from pathlib import Path logging.basicConfig( level=logging.INFO, format="%(asctime)s | %(levelname)s - %(message)s", ) logger = logging.getLogger(__name__) MODEL_ID = "/mnt/llm_models/GLM-4.6-REAP-218B-A32B" # or "cerebras/GLM-4.6-REAP-218B-A32B" OUTPUT_DIR = "/mnt/llm_models/GLM-4.6-REAP-218B-A32B-W4A16-AutoRound" def main(): logger.info("=" * 80) logger.info("GLM-4.6-REAP-218B W4A16 Quantization (Intel AutoRound)") logger.info("=" * 80) start = datetime.now() from auto_round import AutoRound logger.info(f"Model: {MODEL_ID}") logger.info(f"Output: {OUTPUT_DIR}") logger.info(f"Scheme: W4A16 (4-bit weights, 16-bit activations)") Path(OUTPUT_DIR).mkdir(parents=True, exist_ok=True) logger.info("Initializing AutoRound (CPU mode)...") autoround = AutoRound( MODEL_ID, scheme="W4A16", device="cpu", device_map="cpu", trust_remote_code=True, batch_size=1, seqlen=512, nsamples=64, ) logger.info("Starting quantization...") autoround.quantize_and_save(OUTPUT_DIR, format="auto_round") elapsed = datetime.now() - start logger.info("=" * 80) logger.info(f"Done in {elapsed}") logger.info(f"Output: {OUTPUT_DIR}") logger.info("=" * 80) if __name__ == "__main__": main() ``` ## About the Base Model **GLM-4.6-REAP-218B-A32B** is a Mixture of Experts (MoE) model from Cerebras with: - 218 billion total parameters - 32 billion active parameters per forward pass - Strong performance on reasoning and long-context tasks - Native support for 128k+ context windows For more details, see the [base model card](https://huggingface.co/cerebras/GLM-4.6-REAP-218B-A32B). ## Limitations - Quantization may slightly reduce quality compared to BF16 - Requires significant VRAM (~192GB minimum across GPUs) - Best results with tensor parallelism across 4-8 GPUs ## License This quantized model inherits the license from the base model. See [cerebras/GLM-4.6-REAP-218B-A32B](https://huggingface.co/cerebras/GLM-4.6-REAP-218B-A32B) for licensing details. ## Acknowledgments - [Cerebras](https://cerebras.ai/) for the base GLM-4.6-REAP model - [Intel](https://github.com/intel/auto-round) for the AutoRound quantization toolkit - [vLLM](https://github.com/vllm-project/vllm) and [SGLang](https://github.com/sgl-project/sglang) teams for inference support ## Citation If you use this model, please cite the original: ```bibtex @misc{glm46reap, title={GLM-4.6-REAP-218B-A32B}, author={Cerebras}, year={2024}, url={https://huggingface.co/cerebras/GLM-4.6-REAP-218B-A32B} } ```