How to use it correctly with online serving via vllm openai compatible server?
Using the below command not sure if its setup correctly.
vllm serve deepseek-ai/DeepSeek-OCR --no-enable-prefix-caching --mm-processor-cache-gb 0 --logits-processors vllm.model_executor.models.deepseek_ocr:NGramPerReqLogitsProcessor
then calling it this way:
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-OCR",
messages=message,
temperature=0.0,
max_tokens=500,
# ngram logit processor args
extra_body={
"ngram_size": 30,
"window_size": 90,
"whitelist_token_ids": [128821, 128822],
"skip_special_tokens": False, # whitelist: ,
}
)
I am not sure if the parameters passed are affecting anything.
Can someone explain why are those parameters required and are the setup correctly?
corrected serving command:vllm serve deepseek-ai/DeepSeek-OCR --no-enable-prefix-caching --mm-processor-cache-gb 0 --logits-processors vllm.model_executor.models.deepseek_ocr:NGramPerReqLogitsProcessor --enable-log-requests --gpu-memory-utilization 0.4 --chat-template /home/ubuntu/llm-ocr-exp/template_deepseek_ocr.jinja
inference:
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-OCR",
messages=message,
temperature=0.0,
max_tokens=500,
# ngram logit processor args
extra_body={
"vllm_xargs": {
"ngram_size": 30,
"window_size": 90,
# "whitelist_token_ids": [128821, 128822],
},
"skip_special_tokens": False, # whitelist: <td>, </td>
}
)
@dhruvilHV Can I see your --chat-template /home/ubuntu/llm-ocr-exp/template_deepseek_ocr.jinja?
What about the image sizes? Do we need to pass additional arguments to the API call and if so, how? For example, how to signal you want the gundam level of quality through this API call?
What about the image sizes? Do we need to pass additional arguments to the API call and if so, how? For example, how to signal you want the gundam level of quality through this API call?
same question with you, i dont know to choose model if using vLLM serve
For people struggling with vllm, you have to be on the latest dev version (from github) to get it working correctly.
vllm has already a good guide on how to do it for DeepSeek-OCR.
My setup is linux with cuda Turing. Their guide didn't work for my setup, but here is what worked.
git clone https://github.com/vllm-project/vllm.git
cd vllm
uv venv --python 3.13 --seed
source .venv/bin/activate
python use_existing_torch.py
uv pip install -r requirements/build.txt
uv pip install torch torchvision
uv pip install --no-build-isolation -e . --prerelease=allow
This allowed me to have the latest vllm version
vllm --version
0.11.1rc6.dev158+gc3ee80a01.d20251106.cu130
You can then follow the guide from vLLM:
vllm serve models/DeepSeek-OCR --logits_processors vllm.model_executor.models.deepseek_ocr:NGramPerReqLogitsProcessor --no-enable-prefix-caching --mm-processor-cache-gb 0
models/DeepSeek-OCR being the path to the downloaded model.
@dhruvilHV which version of vLLM are you using?
I found that version 0.11 (the latest current release), does not work with DeepSeek-OCR. You will need the latest one from their github page. Refer to my earlier guide.
i confused how to choose exactly model that i want to use like gundam, tiny, large,... when calling API with vLLM ?
Same issue here, can't find a way to set the model: large, gundam etc
I found an issue with setting the model when serving DeepSeek-OCR via vLLM.
vLLM currently uses GUNDAM mode (base_size=1024, image_size=640, crop_mode=True). This is HARDCODED and cannot be changed via environment variables yet.
If you need to change the model fork vLLM and modify constants in:
https://github.com/vllm-project/vllm/blob/main/vllm/transformers_utils/processors/deepseek_ocr.py#L10-L13
vLLM plans to expose this as mm_processor_kwargs in the future:
https://github.com/vllm-project/vllm/blob/main/vllm/transformers_utils/processors/deepseek_ocr.py#L15
(See comment: "TODO(Isotr0py): Expose as mm_kwargs")