brandonbeiler commited on
Commit
71a16d4
Β·
verified Β·
1 Parent(s): d017b43

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +19 -14
README.md CHANGED
@@ -19,21 +19,8 @@ library_name: vllm
19
  # πŸ”₯ InternVL3_5-30B-A3B-FP8-Dynamic πŸ”₯
20
  This is a **fp8 dynamic (w8a8)** version of [OpenGVLab/InternVL3_5-30B-A3B](https://huggingface.co/OpenGVLab/InternVL3_5-30B-A3B), optimized for high-performance inference with vLLM.
21
  The model utilizes **fp8 dynamic (w8a8)** for optimal performance and deployment.
22
- ## πŸš€ Key Features
23
- - **FP8 Dynamic Quantization**: No calibration required, ready to use immediately
24
- - **Vision-Language Optimized**: Specialized quantization recipe that preserves visual understanding
25
- - **vLLM Ready**: Seamless integration with vLLM for production deployment
26
- - **Memory Efficient**: ~50% memory reduction compared to FP16 original
27
- - **Performance Boost**: Significant faster inference on H100/L40S GPUs
28
- ## πŸ“Š Model Details
29
- - **Original Model**: [OpenGVLab/InternVL3_5-30B-A3B](https://huggingface.co/OpenGVLab/InternVL3_5-30B-A3B)
30
- - **Source Model**: OpenGVLab/InternVL3_5-30B-A3B
31
- - **Quantized Model**: InternVL3_5-30B-A3B-FP8-Dynamic
32
- - **Quantization Method**: FP8 Dynamic (W8A8)
33
- - **Quantization Library**: [LLM Compressor](https://github.com/vllm-project/llm-compressor) v0.7.1
34
- - **Quantized by**: [brandonbeiler](https://huggingface.co/brandonbeiler)
35
 
36
- ## With vLLM OpenAI-Compatible Server
37
 
38
  You can serve the model using vLLM's OpenAI-compatible API server.
39
 
@@ -46,6 +33,24 @@ vllm serve brandonbeiler/InternVL3_5-30B-A3B-FP8-Dynamic \
46
  --max-model-len 32768 \
47
  --tensor-parallel-size 1 # Adjust based on your GPU setup
48
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
 
50
  ## πŸ—οΈ Technical Specifications
51
  ### Hardware Requirements
 
19
  # πŸ”₯ InternVL3_5-30B-A3B-FP8-Dynamic πŸ”₯
20
  This is a **fp8 dynamic (w8a8)** version of [OpenGVLab/InternVL3_5-30B-A3B](https://huggingface.co/OpenGVLab/InternVL3_5-30B-A3B), optimized for high-performance inference with vLLM.
21
  The model utilizes **fp8 dynamic (w8a8)** for optimal performance and deployment.
 
 
 
 
 
 
 
 
 
 
 
 
 
22
 
23
+ ## Just Run It (vLLM serve)
24
 
25
  You can serve the model using vLLM's OpenAI-compatible API server.
26
 
 
33
  --max-model-len 32768 \
34
  --tensor-parallel-size 1 # Adjust based on your GPU setup
35
  ```
36
+ **Notes**
37
+ - 32k max context length
38
+ - reasoning parser ready to go, requires system prompt to run in thinking mode
39
+ - still investigating tool calling
40
+
41
+ ## πŸš€ Key Features
42
+ - **FP8 Dynamic Quantization**: No calibration required, ready to use immediately
43
+ - **Vision-Language Optimized**: Specialized quantization recipe that preserves visual understanding
44
+ - **vLLM Ready**: Seamless integration with vLLM for production deployment
45
+ - **Memory Efficient**: ~50% memory reduction compared to FP16 original
46
+ - **Performance Boost**: Significant faster inference on H100/L40S GPUs
47
+ ## πŸ“Š Model Details
48
+ - **Original Model**: [OpenGVLab/InternVL3_5-30B-A3B](https://huggingface.co/OpenGVLab/InternVL3_5-30B-A3B)
49
+ - **Source Model**: OpenGVLab/InternVL3_5-30B-A3B
50
+ - **Quantized Model**: InternVL3_5-30B-A3B-FP8-Dynamic
51
+ - **Quantization Method**: FP8 Dynamic (W8A8)
52
+ - **Quantization Library**: [LLM Compressor](https://github.com/vllm-project/llm-compressor) v0.7.1
53
+ - **Quantized by**: [brandonbeiler](https://huggingface.co/brandonbeiler)
54
 
55
  ## πŸ—οΈ Technical Specifications
56
  ### Hardware Requirements