Devstral-Small-2-24B-Instruct-SINQ-4bit

This is a 4-bit SINQ-quantized version of mistralai/Devstral-Small-2-24B-Instruct-2512.

  • Memory Reduction: ~48GB (BF16) โ†’ ~14GB (4-bit)
  • Target Hardware: RTX 3090 / 4090 (24GB VRAM)

Quantization Details

Property Value
Method SINQ (Sinkhorn-Normalized Quantization)
Bits 4
Group Size 64
Calibration Calibration-free (sinq method)
Tiling Mode 1D
Format Safetensors (sharded)

Quantization Command

hf-quantizer quantize \
    --model-id mistralai/Devstral-Small-2-24B-Instruct-2512 \
    --output-dir ./outputs/devstral-sinq-4bit \
    --nbits 4 \
    --group-size 64 \
    --calibration sinq \
    --quantized-by maxence-bouvier

Why SINQ?

SINQ uses Sinkhorn iterations to normalize weights before quantization, effectively handling outliers without requiring calibration data. This results in robust performance at aggressive compression levels.

Usage

Installation

# Install SINQ fork with Mistral3 support
pip install git+https://github.com/MaxenceBouvier/SINQ.git@fix/mistral3-conditional-generation

# Install dependencies
pip install transformers accelerate gemlite>=0.5.1.post1

Note: This model requires the forked SINQ version with Mistral3ForConditionalGeneration fixes. The original huawei-csl/SINQ will not work.

Loading the Model

import torch
from transformers import AutoProcessor
from sinq.patch_model import AutoSINQHFModel

model_id = "maxence-bouvier/Devstral-Small-2-24B-Instruct-SINQ-4bit"

# Load quantized model (handles Hub download automatically)
model = AutoSINQHFModel.from_quantized_safetensors(
    model_id,
    compute_dtype=torch.float16,
    device="cuda",
)

# Load processor (handles tokenization and chat templates)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

Inference

def _build_unicode_to_bytes_map() -> dict[str, int]:
    """Build inverse of GPT-2's bytes_to_unicode mapping."""
    bs = (
        list(range(ord("!"), ord("~") + 1))
        + list(range(ord("ยก"), ord("ยฌ") + 1))
        + list(range(ord("ยฎ"), ord("รฟ") + 1))
    )
    cs = bs[:]
    n = 0
    for b in range(256):
        if b not in bs:
            bs.append(b)
            cs.append(256 + n)
            n += 1
    return {chr(c): b for b, c in zip(bs, cs)}

_UNICODE_TO_BYTE = _build_unicode_to_bytes_map()

def fix_byte_encoding(text: str) -> str:
    """Fix byte-level BPE encoding for proper emoji/unicode display.

    Example: "รฐลยคฤน" -> "๐Ÿค—"
    """
    try:
        byte_values = bytes([_UNICODE_TO_BYTE.get(c, ord(c)) for c in text])
        return byte_values.decode("utf-8")
    except (UnicodeDecodeError, KeyError, ValueError):
        return text

messages = [
    {"role": "user", "content": "Write a Python function to check if a number is prime."}
]

text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
inputs = processor(text=text, return_tensors="pt").to("cuda")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        do_sample=True,
        temperature=0.7,
        pad_token_id=processor.tokenizer.eos_token_id,
    )

response = processor.decode(outputs[0], skip_special_tokens=True)
response = fix_byte_encoding(response)  # Fix emoji/unicode display
print(response)

Note: The fix_byte_encoding helper is needed because Mistral uses byte-level BPE tokenization (GPT-2 style). UTF-8 bytes are encoded as individual Unicode characters (e.g., ๐Ÿค— becomes รฐลยคฤน). This function reverses that mapping for proper display.

Quick Test

# Minimal test to verify the model works
messages = [{"role": "user", "content": "Say hello"}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, return_tensors="pt").to("cuda")

with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=50, do_sample=False,
                         pad_token_id=processor.tokenizer.eos_token_id)
response = fix_byte_encoding(processor.decode(out[0], skip_special_tokens=True))
print(response)

Hardware Requirements

Configuration Estimated VRAM
Inference (4-bit) ~14 GB
+ 16k Context ~19 GB

Recommended: RTX 3090 / RTX 4090 with 24GB VRAM

Performance

On RTX 3090 with GemLite backend:

  • Load time: ~25s (one-time GemLite preparation)
  • Inference: ~1 tok/s

Capabilities

Devstral is optimized for:

  • Code Generation: Python, JavaScript, Rust, C++, and more
  • Reasoning: Debugging and architecture planning
  • Instruction Following: Complex multi-step tasks

Original Model

This quantized model is derived from mistralai/Devstral-Small-2-24B-Instruct-2512. Please refer to the original model card for full capabilities, limitations, and intended use cases.

Changelog

  • 2025-12-12: Fixed GemLite device mismatch bug that caused inference to hang

Disclaimer

This is a quantized derivative. While SINQ preserves much of the original performance, some degradation may occur compared to full-precision weights. Use responsibly and verify outputs for critical applications.


Quantized by @maxence-bouvier

Downloads last month
1,074
Safetensors
Model size
13B params
Tensor type
F16
ยท
U8
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for maxence-bouvier/Devstral-Small-2-24B-Instruct-SINQ-4bit