---
license: apache-2.0
base_model: cerebras/GLM-4.6-REAP-218B-A32B
library_name: transformers
tags:
  - glm
  - moe
  - mixture-of-experts
  - autoround
  - quantized
  - 4-bit
  - w4a16
  - vllm
  - sglang
  - cerebras
model_type: glm4
pipeline_tag: text-generation
quantized_by: 0xSero
inference: false
---

# GLM-4.6-REAP-218B-A32B W4A16 (AutoRound Quantization)

This is a **4-bit quantized** version of [cerebras/GLM-4.6-REAP-218B-A32B](https://huggingface.co/cerebras/GLM-4.6-REAP-218B-A32B) using [Intel AutoRound](https://github.com/intel/auto-round).

## Model Details

| Property | Value |
|----------|-------|
| **Base Model** | [cerebras/GLM-4.6-REAP-218B-A32B](https://huggingface.co/cerebras/GLM-4.6-REAP-218B-A32B) |
| **Quantization** | W4A16 (4-bit weights, 16-bit activations) |
| **Method** | Intel AutoRound |
| **Format** | auto_round (compatible with vLLM, SGLang) |
| **Architecture** | GLM-4 Mixture of Experts |
| **Total Parameters** | 218B |
| **Active Parameters** | 32B (A32B) |
| **Original Size** | ~436 GB (BF16) |
| **Quantized Size** | ~116 GB |

## Performance Benchmarks

Tested on 8x NVIDIA RTX 3090 (24GB each) with vLLM:

### Speed Test (~20k context)

| Metric | Value |
|--------|-------|
| **Prompt Tokens** | ~21,178 |
| **Completion Tokens** | 393 |
| **Time to First Token (TTFB)** | 23.82s |
| **Total Generation Time** | 36.45s |
| **Prefill Speed** | ~889 tok/s |
| **Generation Speed** | ~31 tok/s |

### Coherence Test

The model correctly recalled all embedded facts from a long context:
- Character name: Aurelia
- Product code: ZX-42-ALPHA
- Transaction amount: 7,530,000 credits
- Scientist name: Dr. Linh Tran
- Date: 2025-12-15

## Usage

### vLLM (Recommended)

```bash
  vllm serve /GLM-4.6-REAP-218B-A32B-W4A16-AutoRound \
      --host 0.0.0.0 --port 8000 \
      --tensor-parallel-size 4 --pipeline-parallel-size 2 \
      --quantization auto-round \
      --kv-cache-dtype fp8 \
      --max-model-len 200000 \
      --gpu-memory-utilization 0.88 \
      --cpu-offload-gb 4 \
      --block-size 32 \
      --max-num-seqs 8 \
      --max-num-batched-tokens 8192 \
      --swap-space 32 \
      --enable-expert-parallel \
      --enable-prefix-caching \
      --enable-chunked-prefill \
      --disable-custom-all-reduce \
      --disable-log-requests \
      --trust-remote-code
```

### SGLang

```bash
python3 -m sglang.launch_server \
    --model-path 0xSero/GLM-4.6-REAP-218B-A32B-W4A16-AutoRound \
    --tp-size 8 \
    --trust-remote-code
```

### Hardware Requirements

| Configuration | VRAM Required | Notes |
|---------------|---------------|-------|
| **8x 24GB GPUs** | ~192GB total | TP=4, PP=2, recommended |
| **4x 48GB GPUs** | ~192GB total | TP=4, no PP needed |
| **8x 48GB GPUs** | ~384GB total | Full speed, larger batches |

**Minimum**: 8x 24GB GPUs (RTX 3090/4090) or equivalent ~192GB total VRAM.

## Quantization Details

### Method

Quantized using [Intel AutoRound](https://github.com/intel/auto-round) with the following configuration:

- **Scheme**: W4A16 (4-bit weights, 16-bit activations)
- **Calibration samples**: 64
- **Sequence length**: 512
- **Batch size**: 1

### Quantization Script

```python
#!/usr/bin/env python3
"""
GLM-4.6-REAP-218B W4A16 Quantization using Intel AutoRound

Produces SGLang/vLLM-compatible 4-bit quantized model.
"""

import logging
from datetime import datetime
from pathlib import Path

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s | %(levelname)s - %(message)s",
)
logger = logging.getLogger(__name__)

MODEL_ID = "/mnt/llm_models/GLM-4.6-REAP-218B-A32B"  # or "cerebras/GLM-4.6-REAP-218B-A32B"
OUTPUT_DIR = "/mnt/llm_models/GLM-4.6-REAP-218B-A32B-W4A16-AutoRound"


def main():
    logger.info("=" * 80)
    logger.info("GLM-4.6-REAP-218B W4A16 Quantization (Intel AutoRound)")
    logger.info("=" * 80)

    start = datetime.now()

    from auto_round import AutoRound

    logger.info(f"Model: {MODEL_ID}")
    logger.info(f"Output: {OUTPUT_DIR}")
    logger.info(f"Scheme: W4A16 (4-bit weights, 16-bit activations)")

    Path(OUTPUT_DIR).mkdir(parents=True, exist_ok=True)

    logger.info("Initializing AutoRound (CPU mode)...")
    autoround = AutoRound(
        MODEL_ID,
        scheme="W4A16",
        device="cpu",
        device_map="cpu",
        trust_remote_code=True,
        batch_size=1,
        seqlen=512,
        nsamples=64,
    )

    logger.info("Starting quantization...")
    autoround.quantize_and_save(OUTPUT_DIR, format="auto_round")

    elapsed = datetime.now() - start
    logger.info("=" * 80)
    logger.info(f"Done in {elapsed}")
    logger.info(f"Output: {OUTPUT_DIR}")
    logger.info("=" * 80)


if __name__ == "__main__":
    main()
```

## About the Base Model

**GLM-4.6-REAP-218B-A32B** is a Mixture of Experts (MoE) model from Cerebras with:

- 218 billion total parameters
- 32 billion active parameters per forward pass
- Strong performance on reasoning and long-context tasks
- Native support for 128k+ context windows

For more details, see the [base model card](https://huggingface.co/cerebras/GLM-4.6-REAP-218B-A32B).

## Limitations

- Quantization may slightly reduce quality compared to BF16
- Requires significant VRAM (~192GB minimum across GPUs)
- Best results with tensor parallelism across 4-8 GPUs

## License

This quantized model inherits the license from the base model. See [cerebras/GLM-4.6-REAP-218B-A32B](https://huggingface.co/cerebras/GLM-4.6-REAP-218B-A32B) for licensing details.

## Acknowledgments

- [Cerebras](https://cerebras.ai/) for the base GLM-4.6-REAP model
- [Intel](https://github.com/intel/auto-round) for the AutoRound quantization toolkit
- [vLLM](https://github.com/vllm-project/vllm) and [SGLang](https://github.com/sgl-project/sglang) teams for inference support

## Citation

If you use this model, please cite the original:

```bibtex
@misc{glm46reap,
  title={GLM-4.6-REAP-218B-A32B},
  author={Cerebras},
  year={2024},
  url={https://huggingface.co/cerebras/GLM-4.6-REAP-218B-A32B}
}
```