Devstral-Small-2-24B-Instruct-SINQ-4bit
This is a 4-bit SINQ-quantized version of mistralai/Devstral-Small-2-24B-Instruct-2512.
- Memory Reduction: ~48GB (BF16) โ ~14GB (4-bit)
- Target Hardware: RTX 3090 / 4090 (24GB VRAM)
Quantization Details
| Property | Value |
|---|---|
| Method | SINQ (Sinkhorn-Normalized Quantization) |
| Bits | 4 |
| Group Size | 64 |
| Calibration | Calibration-free (sinq method) |
| Tiling Mode | 1D |
| Format | Safetensors (sharded) |
Quantization Command
hf-quantizer quantize \
--model-id mistralai/Devstral-Small-2-24B-Instruct-2512 \
--output-dir ./outputs/devstral-sinq-4bit \
--nbits 4 \
--group-size 64 \
--calibration sinq \
--quantized-by maxence-bouvier
Why SINQ?
SINQ uses Sinkhorn iterations to normalize weights before quantization, effectively handling outliers without requiring calibration data. This results in robust performance at aggressive compression levels.
Usage
Installation
# Install SINQ fork with Mistral3 support
pip install git+https://github.com/MaxenceBouvier/SINQ.git@fix/mistral3-conditional-generation
# Install dependencies
pip install transformers accelerate gemlite>=0.5.1.post1
Note: This model requires the forked SINQ version with Mistral3ForConditionalGeneration fixes. The original
huawei-csl/SINQwill not work.
Loading the Model
import torch
from transformers import AutoProcessor
from sinq.patch_model import AutoSINQHFModel
model_id = "maxence-bouvier/Devstral-Small-2-24B-Instruct-SINQ-4bit"
# Load quantized model (handles Hub download automatically)
model = AutoSINQHFModel.from_quantized_safetensors(
model_id,
compute_dtype=torch.float16,
device="cuda",
)
# Load processor (handles tokenization and chat templates)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
Inference
def _build_unicode_to_bytes_map() -> dict[str, int]:
"""Build inverse of GPT-2's bytes_to_unicode mapping."""
bs = (
list(range(ord("!"), ord("~") + 1))
+ list(range(ord("ยก"), ord("ยฌ") + 1))
+ list(range(ord("ยฎ"), ord("รฟ") + 1))
)
cs = bs[:]
n = 0
for b in range(256):
if b not in bs:
bs.append(b)
cs.append(256 + n)
n += 1
return {chr(c): b for b, c in zip(bs, cs)}
_UNICODE_TO_BYTE = _build_unicode_to_bytes_map()
def fix_byte_encoding(text: str) -> str:
"""Fix byte-level BPE encoding for proper emoji/unicode display.
Example: "รฐลยคฤน" -> "๐ค"
"""
try:
byte_values = bytes([_UNICODE_TO_BYTE.get(c, ord(c)) for c in text])
return byte_values.decode("utf-8")
except (UnicodeDecodeError, KeyError, ValueError):
return text
messages = [
{"role": "user", "content": "Write a Python function to check if a number is prime."}
]
text = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
inputs = processor(text=text, return_tensors="pt").to("cuda")
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=512,
do_sample=True,
temperature=0.7,
pad_token_id=processor.tokenizer.eos_token_id,
)
response = processor.decode(outputs[0], skip_special_tokens=True)
response = fix_byte_encoding(response) # Fix emoji/unicode display
print(response)
Note: The
fix_byte_encodinghelper is needed because Mistral uses byte-level BPE tokenization (GPT-2 style). UTF-8 bytes are encoded as individual Unicode characters (e.g., ๐ค becomesรฐลยคฤน). This function reverses that mapping for proper display.
Quick Test
# Minimal test to verify the model works
messages = [{"role": "user", "content": "Say hello"}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, return_tensors="pt").to("cuda")
with torch.no_grad():
out = model.generate(**inputs, max_new_tokens=50, do_sample=False,
pad_token_id=processor.tokenizer.eos_token_id)
response = fix_byte_encoding(processor.decode(out[0], skip_special_tokens=True))
print(response)
Hardware Requirements
| Configuration | Estimated VRAM |
|---|---|
| Inference (4-bit) | ~14 GB |
| + 16k Context | ~19 GB |
Recommended: RTX 3090 / RTX 4090 with 24GB VRAM
Performance
On RTX 3090 with GemLite backend:
- Load time: ~25s (one-time GemLite preparation)
- Inference: ~1 tok/s
Capabilities
Devstral is optimized for:
- Code Generation: Python, JavaScript, Rust, C++, and more
- Reasoning: Debugging and architecture planning
- Instruction Following: Complex multi-step tasks
Original Model
This quantized model is derived from mistralai/Devstral-Small-2-24B-Instruct-2512. Please refer to the original model card for full capabilities, limitations, and intended use cases.
Changelog
- 2025-12-12: Fixed GemLite device mismatch bug that caused inference to hang
Disclaimer
This is a quantized derivative. While SINQ preserves much of the original performance, some degradation may occur compared to full-precision weights. Use responsibly and verify outputs for critical applications.
Quantized by @maxence-bouvier
- Downloads last month
- 1,074
Model tree for maxence-bouvier/Devstral-Small-2-24B-Instruct-SINQ-4bit
Base model
mistralai/Mistral-Small-3.1-24B-Base-2503