brandonbeiler
/

InternVL3_5-30B-A3B-FP8-Dynamic

@@ -19,21 +19,8 @@ library_name: vllm
 # 🔥 InternVL3_5-30B-A3B-FP8-Dynamic 🔥
 This is a **fp8 dynamic (w8a8)** version of [OpenGVLab/InternVL3_5-30B-A3B](https://huggingface.co/OpenGVLab/InternVL3_5-30B-A3B), optimized for high-performance inference with vLLM.
 The model utilizes **fp8 dynamic (w8a8)** for optimal performance and deployment.
-## 🚀 Key Features
-- **FP8 Dynamic Quantization**: No calibration required, ready to use immediately
-- **Vision-Language Optimized**: Specialized quantization recipe that preserves visual understanding
-- **vLLM Ready**: Seamless integration with vLLM for production deployment
-- **Memory Efficient**: ~50% memory reduction compared to FP16 original
-- **Performance Boost**: Significant faster inference on H100/L40S GPUs
-## 📊 Model Details
-- **Original Model**: [OpenGVLab/InternVL3_5-30B-A3B](https://huggingface.co/OpenGVLab/InternVL3_5-30B-A3B)
-- **Source Model**: OpenGVLab/InternVL3_5-30B-A3B
-- **Quantized Model**: InternVL3_5-30B-A3B-FP8-Dynamic
-- **Quantization Method**: FP8 Dynamic (W8A8)
-- **Quantization Library**: [LLM Compressor](https://github.com/vllm-project/llm-compressor) v0.7.1
-- **Quantized by**: [brandonbeiler](https://huggingface.co/brandonbeiler)
-## With vLLM OpenAI-Compatible Server
 You can serve the model using vLLM's OpenAI-compatible API server.
@@ -46,6 +33,24 @@ vllm serve brandonbeiler/InternVL3_5-30B-A3B-FP8-Dynamic \
     --max-model-len 32768 \
     --tensor-parallel-size 1 # Adjust based on your GPU setup
 ```
 ## 🏗️ Technical Specifications
 ### Hardware Requirements

 # 🔥 InternVL3_5-30B-A3B-FP8-Dynamic 🔥
 This is a **fp8 dynamic (w8a8)** version of [OpenGVLab/InternVL3_5-30B-A3B](https://huggingface.co/OpenGVLab/InternVL3_5-30B-A3B), optimized for high-performance inference with vLLM.
 The model utilizes **fp8 dynamic (w8a8)** for optimal performance and deployment.
+## Just Run It (vLLM serve)
 You can serve the model using vLLM's OpenAI-compatible API server.
     --max-model-len 32768 \
     --tensor-parallel-size 1 # Adjust based on your GPU setup
 ```
+**Notes**
+- 32k max context length
+- reasoning parser ready to go, requires system prompt to run in thinking mode
+- still investigating tool calling
+## 🚀 Key Features
+- **FP8 Dynamic Quantization**: No calibration required, ready to use immediately
+- **Vision-Language Optimized**: Specialized quantization recipe that preserves visual understanding
+- **vLLM Ready**: Seamless integration with vLLM for production deployment
+- **Memory Efficient**: ~50% memory reduction compared to FP16 original
+- **Performance Boost**: Significant faster inference on H100/L40S GPUs
+## 📊 Model Details
+- **Original Model**: [OpenGVLab/InternVL3_5-30B-A3B](https://huggingface.co/OpenGVLab/InternVL3_5-30B-A3B)
+- **Source Model**: OpenGVLab/InternVL3_5-30B-A3B
+- **Quantized Model**: InternVL3_5-30B-A3B-FP8-Dynamic
+- **Quantization Method**: FP8 Dynamic (W8A8)
+- **Quantization Library**: [LLM Compressor](https://github.com/vllm-project/llm-compressor) v0.7.1
+- **Quantized by**: [brandonbeiler](https://huggingface.co/brandonbeiler)
 ## 🏗️ Technical Specifications
 ### Hardware Requirements