Phi-4-mini-instruct INT4_SYM for Intel NPU

🎉 First NPU-optimized Phi-4-mini model with correct quantization for Intel NPU!

Model Description

This is microsoft/Phi-4-mini-instruct (2.6B parameters) converted to OpenVINO IR format with NPU-specific INT4 symmetric quantization.

Key Difference from Standard OpenVINO Models

Critical Discovery: Intel NPU requires INT4_SYM (symmetric, channel-wise) quantization, not the INT4_ASYM (asymmetric, grouped) used by standard OpenVINO pre-converted models.

Quantization Type	NPU Compatibility
INT4_ASYM (group_size=64)	❌ FAILS (MatMul errors)
INT4_SYM (channel-wise)	✅ WORKS (this model)

Quantization Details

Method: INT4_SYM (symmetric)
Group size: -1 (channel-wise, not grouped)
Calibration: AWQ + scale_estimation on wikitext2 dataset
Distribution: 84% INT4_SYM (128 layers), 16% INT8_ASYM (1 layer)
Size: 2.13 GB

Performance on Intel NPU

Tested on Intel Core Ultra 7 155H (NPU driver v32.0.100.4297):

Speed: 6.8 tok/s
Compilation: 68.5s
Inference: Stable, production-ready

Comparison to other models on same hardware (Intel Core Ultra 7 155H):

Qwen2.5-1.5B-Instruct (INT4_SYM): 10.7 tok/s (0.87 GB) - Baseline performance
Phi-4-mini-instruct (INT4_SYM): 6.8 tok/s (2.13 GB) - 73% more parameters, reasoning capabilities
Performance ratio: ~64% of Qwen speed, but significantly more capable model

Usage

Requirements

pip install openvino-genai huggingface-hub

Python API

from openvino_genai import LLMPipeline

# Load and run on Intel NPU
pipe = LLMPipeline("AhtnaGlen/phi-4-mini-instruct-int4-sym-npu-ov", device="NPU")

# Generate text
response = pipe.generate("Explain quantum computing:", max_new_tokens=100)
print(response)

Streaming

for token in pipe.generate("Write a story:", max_new_tokens=200, stream=True):
    print(token, end='', flush=True)

Why This Matters

Standard OpenVINO Phi-4 models (e.g., OpenVINO/Phi-4-mini-instruct-int4-ov) use INT4_ASYM quantization which fails NPU compilation with errors like:

[ERROR] Channels count of input tensor shape and filter shape must be the same: 0 != 48

This model uses the correct NPU-optimized quantization as specified in Intel's NPU documentation:

optimum-cli export openvino -m microsoft/Phi-4-mini-instruct \
    --weight-format int4 \
    --sym \                    # Symmetric (key for NPU!)
    --group-size -1 \          # Channel-wise (not grouped!)
    --awq --scale-estimation \
    --dataset wikitext2

Model Capabilities

Instruction following: Fine-tuned for chat/instruction tasks
Reasoning: Enhanced reasoning capabilities (Phi-4 series)
Context length: 4096 tokens
NPU acceleration: Full hardware offload to Intel NPU

Hardware Requirements

Intel NPU: Core Ultra 7 155H (tested), or other NPU 3720/4000 series
Driver: v32.0.100.4297 or newer
OpenVINO: 2025.3.0 or newer
Memory: ~3 GB for model + inference

Limitations

NPU only: This model is quantized specifically for Intel NPU
Speed trade-off: 6.8 tok/s vs Qwen2.5-1.5B @ 10.7 tok/s on Intel Core Ultra 7 155H
Size vs capability: Larger model (2.13 GB) but enhanced reasoning and instruction-following
Hardware specific: Performance validated on Intel Core Ultra 7 155H NPU

Citation

If you use this model, please cite:

@misc{phi4-mini-npu-optimized,
  title={Phi-4-mini-instruct INT4_SYM for Intel NPU},
  author={OpenVINO Community},
  year={2025},
  howpublished={\url{https://huggingface.co/AhtnaGlen/phi-4-mini-instruct-int4-sym-npu-ov}},
}

Acknowledgments

Base model: Microsoft Phi-4-mini-instruct
Framework: Intel OpenVINO
Quantization: NNCF (Neural Network Compression Framework)
Discovery: Community finding on NPU quantization requirements

License

MIT (following base model license)

Model Card Contact

For issues or questions about NPU compatibility, please open an issue on the model repository.

Note: This model demonstrates the importance of quantization method selection for hardware-specific optimization. Always verify quantization parameters match target hardware requirements!

Downloads last month: 15

Model tree for AhtnaGlen/phi-4-mini-instruct-int4-sym-npu-ov

Base model

microsoft/Phi-4-mini-instruct

Finetuned

(50)

this model