Remove vLLM FP8 Limitation
#2
by
simon-mo
- opened
README.md
CHANGED
|
@@ -115,29 +115,6 @@ You can use the Qwen3-4B-FP8 model with serveral inference frameworks, including
|
|
| 115 |
However, please pay attention to the following known issues:
|
| 116 |
- `transformers`:
|
| 117 |
- there are currently issues with the "fine-grained fp8" method in `transformers` for distributed inference. You may need to set the environment variable `CUDA_LAUNCH_BLOCKING=1` if multiple devices are used in inference.
|
| 118 |
-
- vLLM:
|
| 119 |
-
- there are currently compatibility issues with `vllm`. For a quick fix, you should make the following changes to `vllm/vllm/model_executor/layers/linear.py`:
|
| 120 |
-
```python
|
| 121 |
-
# these changes are in QKVParallelLinear.weight_loader_v2() of vllm/vllm/model_executor/layers/linear.py
|
| 122 |
-
...
|
| 123 |
-
shard_offset = self._get_shard_offset_mapping(loaded_shard_id)
|
| 124 |
-
shard_size = self._get_shard_size_mapping(loaded_shard_id)
|
| 125 |
-
|
| 126 |
-
# add the following code
|
| 127 |
-
if isinstance(param, BlockQuantScaleParameter):
|
| 128 |
-
weight_block_size = self.quant_method.quant_config.weight_block_size
|
| 129 |
-
block_n, _ = weight_block_size[0], weight_block_size[1]
|
| 130 |
-
shard_offset = (shard_offset + block_n - 1) // block_n
|
| 131 |
-
shard_size = (shard_size + block_n - 1) // block_n
|
| 132 |
-
# end of the modification
|
| 133 |
-
|
| 134 |
-
param.load_qkv_weight(loaded_weight=loaded_weight,
|
| 135 |
-
num_heads=self.num_kv_head_replicas,
|
| 136 |
-
shard_id=loaded_shard_id,
|
| 137 |
-
shard_offset=shard_offset,
|
| 138 |
-
shard_size=shard_size)
|
| 139 |
-
...
|
| 140 |
-
```
|
| 141 |
|
| 142 |
## Switching Between Thinking and Non-Thinking Mode
|
| 143 |
|
|
|
|
| 115 |
However, please pay attention to the following known issues:
|
| 116 |
- `transformers`:
|
| 117 |
- there are currently issues with the "fine-grained fp8" method in `transformers` for distributed inference. You may need to set the environment variable `CUDA_LAUNCH_BLOCKING=1` if multiple devices are used in inference.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 118 |
|
| 119 |
## Switching Between Thinking and Non-Thinking Mode
|
| 120 |
|