zhanghanxiao commited on
Commit
ffdcd3d
·
verified ·
1 Parent(s): 3dab585

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +38 -0
README.md CHANGED
@@ -171,6 +171,44 @@ curl -s http://${MASTER_IP}:${PORT}/v1/chat/completions \
171
 
172
  More usage can be found [here](https://docs.sglang.ai/basic_usage/send_request.html)
173
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
174
  ## Finetuning
175
 
176
  We recommend you use [Llama-Factory](https://github.com/hiyouga/LLaMA-Factory) to [finetune Ring](https://github.com/inclusionAI/Ring-V2/blob/main/docs/llamafactory_finetuning.md).
 
171
 
172
  More usage can be found [here](https://docs.sglang.ai/basic_usage/send_request.html)
173
 
174
+ ### vLLM
175
+
176
+ #### Environment Preparation
177
+
178
+ ```bash
179
+ pip install vllm==0.11.0
180
+ ```
181
+
182
+ #### Run Inference:
183
+
184
+ Here is the example to deploy the model with multiple GPU nodes, where the master node IP is ${MASTER_IP}, server port is ${PORT} and the path of model is ${MODEL_PATH}:
185
+
186
+ ```bash
187
+ # step 1. start ray on all nodes
188
+
189
+ # step 2. start vllm server only on node 0:
190
+ vllm serve $MODEL_PATH --port $PORT --served-model-name my_model --trust-remote-code --tensor-parallel-size 8 --pipeline-parallel-size 4 --gpu-memory-utilization 0.85
191
+
192
+
193
+ # This is only an example, please adjust arguments according to your actual environment.
194
+ ```
195
+
196
+ To handle long context in vLLM using YaRN, we need to follow these two steps:
197
+ 1. Add a `rope_scaling` field to the model's `config.json` file, for example:
198
+ ```json
199
+ {
200
+ ...,
201
+ "rope_scaling": {
202
+ "factor": 4.0,
203
+ "original_max_position_embeddings": 32768,
204
+ "type": "yarn"
205
+ }
206
+ }
207
+ ```
208
+ 2. Use an additional parameter `--max-model-len` to specify the desired maximum context length when starting the vLLM service.
209
+
210
+ For detailed guidance, please refer to the vLLM [`instructions`](https://docs.vllm.ai/en/latest/).
211
+
212
  ## Finetuning
213
 
214
  We recommend you use [Llama-Factory](https://github.com/hiyouga/LLaMA-Factory) to [finetune Ring](https://github.com/inclusionAI/Ring-V2/blob/main/docs/llamafactory_finetuning.md).