IQuest-Coder-V1-40B-Loop-Instruct-NVFP4

NVFP4-AWQ quantized version of IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct for efficient inference on NVIDIA Blackwell and Grace Hopper GPUs.

Key Details

Property Value
Base Model IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct
Quantization NVFP4 with AWQ (Activation-aware Weight Quantization)
Format Safetensors
Size 23.5 GB (68% reduction from ~75GB)
Recommended Hardware NVIDIA DGX Spark, Grace Hopper, Blackwell GPUs

Quantization Details

This model was quantized using NVIDIA's modelopt library with NVFP4_AWQ_FULL_CFG:

  • Phase 1: Activation statistics caching (~8 min)
  • Phase 2: AWQ parameter search (~2 hours)
  • Phase 3: Clip estimation (~2 hours)

AWQ calibration preserves accuracy by identifying and protecting "salient weights" based on activation magnitudes.

Usage with vLLM

vllm serve Elias-Schwegler/IQuest-Coder-V1-40B-Loop-Instruct-NVFP4 \
    --quantization modelopt \
    --trust-remote-code \
    --port 8000

Usage with SGLang (DGX Spark)

docker run --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -p 8000:8000 \
    lmsysorg/sglang:spark \
    --model Elias-Schwegler/IQuest-Coder-V1-40B-Loop-Instruct-NVFP4 \
    --quantization modelopt_fp4 \
    --trust-remote-code

Sampling Parameters

From the base model recommendations:

  • Temperature: 0.6
  • Top-P: 0.95
  • Top-K: 20
  • Min-P: 0.0
  • Max Tokens: 8192

License

Apache 2.0 (same as base model)

Acknowledgments

Downloads last month
369
Safetensors
Model size
20B params
Tensor type
BF16
F8_E4M3
U8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for Elias-Schwegler/IQuest-Coder-V1-40B-Loop-Instruct-NVFP4

Quantized
(6)
this model