Update README.md
Browse files
README.md
CHANGED
|
@@ -94,7 +94,7 @@ print("thinking content:", thinking_content)
|
|
| 94 |
print("content:", content)
|
| 95 |
```
|
| 96 |
|
| 97 |
-
For deployment, you can use `sglang>=0.4.6.post1` or `vllm>=0.8.
|
| 98 |
- SGLang:
|
| 99 |
```shell
|
| 100 |
python -m sglang.launch_server --model-path Qwen/Qwen3-235B-A22B-FP8 --reasoning-parser qwen3
|
|
@@ -104,39 +104,16 @@ For deployment, you can use `sglang>=0.4.6.post1` or `vllm>=0.8.4` or to create
|
|
| 104 |
vllm serve Qwen/Qwen3-235B-A22B-FP8 --enable-reasoning --reasoning-parser deepseek_r1
|
| 105 |
```
|
| 106 |
|
| 107 |
-
For local use, applications such as
|
| 108 |
|
| 109 |
## Note on FP8
|
| 110 |
|
| 111 |
For convenience and performance, we have provided `fp8`-quantized model checkpoint for Qwen3, whose name ends with `-FP8`. The quantization method is fine-grained `fp8` quantization with block size of 128. You can find more details in the `quantization_config` field in `config.json`.
|
| 112 |
|
| 113 |
-
You can use the Qwen3-235B-A22B-FP8 model with serveral inference frameworks, including `transformers`, `
|
| 114 |
However, please pay attention to the following known issues:
|
| 115 |
- `transformers`:
|
| 116 |
- there are currently issues with the "fine-grained fp8" method in `transformers` for distributed inference. You may need to set the environment variable `CUDA_LAUNCH_BLOCKING=1` if multiple devices are used in inference.
|
| 117 |
-
- vLLM:
|
| 118 |
-
- there are currently compatibility issues with `vllm`. For a quick fix, you should make the following changes to `vllm/vllm/model_executor/layers/linear.py`:
|
| 119 |
-
```python
|
| 120 |
-
# these changes are in QKVParallelLinear.weight_loader_v2() of vllm/vllm/model_executor/layers/linear.py
|
| 121 |
-
...
|
| 122 |
-
shard_offset = self._get_shard_offset_mapping(loaded_shard_id)
|
| 123 |
-
shard_size = self._get_shard_size_mapping(loaded_shard_id)
|
| 124 |
-
|
| 125 |
-
# add the following code
|
| 126 |
-
if isinstance(param, BlockQuantScaleParameter):
|
| 127 |
-
weight_block_size = self.quant_method.quant_config.weight_block_size
|
| 128 |
-
block_n, _ = weight_block_size[0], weight_block_size[1]
|
| 129 |
-
shard_offset = (shard_offset + block_n - 1) // block_n
|
| 130 |
-
shard_size = (shard_size + block_n - 1) // block_n
|
| 131 |
-
# end of the modification
|
| 132 |
-
|
| 133 |
-
param.load_qkv_weight(loaded_weight=loaded_weight,
|
| 134 |
-
num_heads=self.num_kv_head_replicas,
|
| 135 |
-
shard_id=loaded_shard_id,
|
| 136 |
-
shard_offset=shard_offset,
|
| 137 |
-
shard_size=shard_size)
|
| 138 |
-
...
|
| 139 |
-
```
|
| 140 |
|
| 141 |
## Switching Between Thinking and Non-Thinking Mode
|
| 142 |
|
|
@@ -310,7 +287,7 @@ YaRN is currently supported by several inference frameworks, e.g., `transformers
|
|
| 310 |
{
|
| 311 |
...,
|
| 312 |
"rope_scaling": {
|
| 313 |
-
"
|
| 314 |
"factor": 4.0,
|
| 315 |
"original_max_position_embeddings": 32768
|
| 316 |
}
|
|
@@ -322,12 +299,12 @@ YaRN is currently supported by several inference frameworks, e.g., `transformers
|
|
| 322 |
|
| 323 |
For `vllm`, you can use
|
| 324 |
```shell
|
| 325 |
-
vllm serve ... --rope-scaling '{"
|
| 326 |
```
|
| 327 |
|
| 328 |
For `sglang`, you can use
|
| 329 |
```shell
|
| 330 |
-
python -m sglang.launch_server ... --json-model-override-args '{"rope_scaling":{"
|
| 331 |
```
|
| 332 |
|
| 333 |
For `llama-server` from `llama.cpp`, you can use
|
|
|
|
| 94 |
print("content:", content)
|
| 95 |
```
|
| 96 |
|
| 97 |
+
For deployment, you can use `sglang>=0.4.6.post1` or `vllm>=0.8.5` or to create an OpenAI-compatible API endpoint:
|
| 98 |
- SGLang:
|
| 99 |
```shell
|
| 100 |
python -m sglang.launch_server --model-path Qwen/Qwen3-235B-A22B-FP8 --reasoning-parser qwen3
|
|
|
|
| 104 |
vllm serve Qwen/Qwen3-235B-A22B-FP8 --enable-reasoning --reasoning-parser deepseek_r1
|
| 105 |
```
|
| 106 |
|
| 107 |
+
For local use, applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3.
|
| 108 |
|
| 109 |
## Note on FP8
|
| 110 |
|
| 111 |
For convenience and performance, we have provided `fp8`-quantized model checkpoint for Qwen3, whose name ends with `-FP8`. The quantization method is fine-grained `fp8` quantization with block size of 128. You can find more details in the `quantization_config` field in `config.json`.
|
| 112 |
|
| 113 |
+
You can use the Qwen3-235B-A22B-FP8 model with serveral inference frameworks, including `transformers`, `sglang`, and `vllm`, as the original bfloat16 model.
|
| 114 |
However, please pay attention to the following known issues:
|
| 115 |
- `transformers`:
|
| 116 |
- there are currently issues with the "fine-grained fp8" method in `transformers` for distributed inference. You may need to set the environment variable `CUDA_LAUNCH_BLOCKING=1` if multiple devices are used in inference.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 117 |
|
| 118 |
## Switching Between Thinking and Non-Thinking Mode
|
| 119 |
|
|
|
|
| 287 |
{
|
| 288 |
...,
|
| 289 |
"rope_scaling": {
|
| 290 |
+
"rope_type": "yarn",
|
| 291 |
"factor": 4.0,
|
| 292 |
"original_max_position_embeddings": 32768
|
| 293 |
}
|
|
|
|
| 299 |
|
| 300 |
For `vllm`, you can use
|
| 301 |
```shell
|
| 302 |
+
vllm serve ... --rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' --max-model-len 131072
|
| 303 |
```
|
| 304 |
|
| 305 |
For `sglang`, you can use
|
| 306 |
```shell
|
| 307 |
+
python -m sglang.launch_server ... --json-model-override-args '{"rope_scaling":{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}}'
|
| 308 |
```
|
| 309 |
|
| 310 |
For `llama-server` from `llama.cpp`, you can use
|