Model Card for qep qep 1bit extreme

🚨 This model is 1bit quantized version of Cohere Labs Command A using QEP. You can find the unquantized version of Cohere Labs Command A here.

Model Summary

An optimized 1-bit quantized version of c4ai/command-a-03-2025 achieving 6.7x compression with enhanced performance through advanced quantization optimization techniques.

Key Features

  • Extreme Compression: 6.7× smaller (207GB → 30.2GB, -85%), runs even on a single GPU (30B on A100 80GB).
  • Enhanced Performance: Onebit quantization, enhanced by Fujitsu QEP & QQA.
  • Inference Speed Up: Faster inference via "Bitlinear computation".

Model Details

  • Base Model: c4ai/command-a-03-2025
  • Quantization Method: OneBit with Fujitsu QEP/QQA optimization
  • Quantization Bits: 1-bit for layers 0-61, FP16 for last 2 layers
  • Optimization Techniques: Fujitsu QEP, QQA
  • Compatible Hardware: Single GPU (recommended: >= 40GB VRAM)

Developed by: Fujitsu, Cohere and Cohere Labs

For more details on how this model was developed, check out our Press Release (English), Press Release (Japanese) Fujitsu's Tech Report and Cohere's Tech Report.

Usage

The base architecture of this model is Command-A. To load and use the model, please use the CommandA model class:

  1. Load model.safetensors, which contains the quantized weights.
  2. Replace all layers except the last two with bitlinear implementations.
  3. Keep the last two layers with non-quantized weights for optimal performance.
  4. The model requires the included onebit_linear.py for proper quantized layer implementation. The weights contain parameters for each of the OneBit-specific a, S, and b components necessary for reconstruction.
  5. Depending on the level of performance you wish to maintain, you may keep additional layers near the output unquantized.

Note: Direct loading support as an extension of the transformers package is planned for future releases.

Requirements

torch>=2.0.0
transformers>=4.35.0
safetensors>=0.4.0

Performance

  • Memory Usage: 6.7x reduction overall (207GB → 30.2GB)
  • Inference Speed: Optimized for fast generation on single GPU
  • Quality: Enhanced performance through QEP/QQA optimization
  • Compatibility: Single GPU deployment capable

Technical Specifications

  • Original Model: Command-A (c4ai/command-a-03-2025)
  • Quantized Layers: 62 layers (0-61) with 1-bit precision
  • Preserved Layers: 2 layers (62-63) with FP16 precision
  • Compression Technique: OneBit + Fujitsu QEP/QQA
  • Model Size: 30.2GB (from original 207GB)

Future Plans

  • Global and Block-wise Fine-tuning: Explore fine-tuning strategies, including block-wise methods, to further improve accuracy and robustness.
  • Complete Usage Examples: Provide detailed implementation guides for efficient single-GPU deployment.
  • Optimization Updates: Enhance performance with next-generation quantization techniques and improved reconstruction methods.

Currently, the quantization process preserves the last two layers in non-quantized weights to maintain output quality, while applying aggressive 1-bit quantization to the remaining layers. Future releases will integrate block-wise fine-tuning for additional performance gains.

Ethical Considerations

This model inherits the capabilities and limitations of the base Command A model. Please refer to the original model's documentation for ethical guidelines and potential biases.

Model Card Contact

For errors or additional questions about details in this model card, contact [email protected]

Terms of Use:

We hope that the release of this model will make community-based research efforts more accessible, by releasing the weights of a highly performant model to researchers all over the world. This model is governed by a CC-BY-NC, requires also adhering to Cohere Lab's Acceptable Use Policy

Citation

If you use this model, please cite:

@misc{command-a-onebit-hybrid,
  title={Command-A 111B with QEP-Optimized OneBit Extreme Quantization},
  author={Yuma Ichikawa, Yusei Kawakami, Yoshiyuki Ishii, Keiji Kimura and Akira Sakai},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/qep/qep-1bit-extreme}
}

License

This quantized model is released under the same license as the base Command A model (CC-BY-NC-4.0).


Downloads last month
45
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support