ABEJA Qwen 2.5 7B Japanese - QNN Optimized
This repository contains the ABEJA Qwen 2.5 7B Japanese model optimized for Qualcomm Neural Network (QNN) deployment.
Model Details
- Base Model: abeja/Qwen2.5-7B-Japanese
 - Architecture: Qwen2ForCausalLM
 - Parameters: ~7.6B
 - Language: Japanese (primary), English (secondary)
 - Quantization: 4-bit NF4
 - Target Hardware: Snapdragon 8cx Gen 2+ (SM8350)
 
Available Formats
1. Quantized PyTorch Model
- Path: 
quantized_simple/ - Format: 4-bit NF4 quantized
 - Size: ~4.5GB (reduced from ~15GB)
 - Usage: Direct inference with transformers
 
2. ONNX Models
- Path: 
onnx/ - Models: 
prefill/model.onnx- Context prefilltoken_gen/model.onnx- Token generation
 - Usage: Cross-platform inference
 
3. Quantized ONNX Models
- Path: 
quantized_onnx/ - Format: Dynamic quantization (INT8)
 - Usage: Optimized ONNX inference
 
4. QNN Compiled Models
- Path: 
qnn_compiled/ - Format: Qualcomm Neural Network format
 - Target: Snapdragon devices
 - Usage: Native ARM64 deployment
 
Usage
Quantized PyTorch Model
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("marcusmi4n/abeja-qwen2.5-7b-japanese-qnn", subfolder="quantized_simple")
tokenizer = AutoTokenizer.from_pretrained("marcusmi4n/abeja-qwen2.5-7b-japanese-qnn", subfolder="quantized_simple")
# Japanese text generation
inputs = tokenizer("γγγ«γ‘γ―γη§γ―", return_tensors="pt")
outputs = model.generate(**inputs, max_length=100, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
ONNX Inference
import onnxruntime as ort
# Load ONNX model
session = ort.InferenceSession("marcusmi4n/abeja-qwen2.5-7b-japanese-qnn/onnx/prefill/model.onnx")
# Run inference...
QNN Deployment
# Deploy to Snapdragon device
adb push marcusmi4n/abeja-qwen2.5-7b-japanese-qnn/qnn_compiled/ /data/local/tmp/qnn_model/
# Use QNN runtime for inference
Performance
- Quantization: 75% size reduction
 - Speed: 2-3x faster inference
 - Memory: ~4.5GB RAM usage
 - Tokens/sec: 8-15 tokens/sec on Snapdragon 8cx Gen 2+
 
Hardware Compatibility
- β Snapdragon 8cx Gen 2+
 - β Snapdragon 8cx Gen 3
 - β Snapdragon 8 Gen 1+
 - β Windows on ARM devices
 - β Microsoft Surface Pro X
 - β Dell Latitude 7420
 
Files Structure
marcusmi4n/abeja-qwen2.5-7b-japanese-qnn/
βββ quantized_simple/          # 4-bit quantized PyTorch model
β   βββ model.safetensors
β   βββ config.json
β   βββ tokenizer.json
β   βββ model_info.json
βββ onnx/                      # ONNX models
β   βββ prefill/model.onnx
β   βββ token_gen/model.onnx
βββ quantized_onnx/            # Quantized ONNX models
β   βββ prefill/model_quantized.onnx
β   βββ token_gen/model_quantized.onnx
βββ qnn_compiled/              # QNN compiled models
β   βββ prefill/
β   βββ token_gen/
β   βββ deployment_info.json
βββ README.md                  # This file
License
Apache 2.0 - Same as base ABEJA Qwen 2.5 model
Citation
@misc{abeja-qwen25-qnn,
  title={ABEJA Qwen 2.5 7B Japanese - QNN Optimized},
  author={QNN Conversion Pipeline},
  year={2025},
  url={https://huggingface.co/marcusmi4n/abeja-qwen2.5-7b-japanese-qnn}
}
Base Model Citation
Please cite the original ABEJA Qwen 2.5 paper:
@article{abeja-qwen2.5,
  title={ABEJA Qwen 2.5: Japanese Language Model},
  author={ABEJA Inc.},
  journal={arXiv preprint},
  year={2024}
}