How to use it correctly with online serving via vllm openai compatible server?

#55
by dhruvil237 - opened

Using the below command not sure if its setup correctly.
vllm serve deepseek-ai/DeepSeek-OCR --no-enable-prefix-caching --mm-processor-cache-gb 0 --logits-processors vllm.model_executor.models.deepseek_ocr:NGramPerReqLogitsProcessor

then calling it this way:
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-OCR",
messages=message,
temperature=0.0,
max_tokens=500,
# ngram logit processor args
extra_body={
"ngram_size": 30,
"window_size": 90,
"whitelist_token_ids": [128821, 128822],
"skip_special_tokens": False, # whitelist: ,
}
)

I am not sure if the parameters passed are affecting anything.
Can someone explain why are those parameters required and are the setup correctly?

corrected serving command:
vllm serve deepseek-ai/DeepSeek-OCR --no-enable-prefix-caching --mm-processor-cache-gb 0 --logits-processors vllm.model_executor.models.deepseek_ocr:NGramPerReqLogitsProcessor --enable-log-requests --gpu-memory-utilization 0.4 --chat-template /home/ubuntu/llm-ocr-exp/template_deepseek_ocr.jinja

inference:

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-OCR",
    messages=message,
    temperature=0.0,
    max_tokens=500,
    # ngram logit processor args
    extra_body={
        "vllm_xargs": {
            "ngram_size": 30,
            "window_size": 90,
            # "whitelist_token_ids": [128821, 128822],
        },
        "skip_special_tokens": False,  # whitelist: <td>, </td>
    }
)
This comment has been hidden (marked as Resolved)

@dhruvilHV Can I see your --chat-template /home/ubuntu/llm-ocr-exp/template_deepseek_ocr.jinja?

What about the image sizes? Do we need to pass additional arguments to the API call and if so, how? For example, how to signal you want the gundam level of quality through this API call?

What about the image sizes? Do we need to pass additional arguments to the API call and if so, how? For example, how to signal you want the gundam level of quality through this API call?

same question with you, i dont know to choose model if using vLLM serve

@dhruvilHV which version of vLLM are you using?

For people struggling with vllm, you have to be on the latest dev version (from github) to get it working correctly.
vllm has already a good guide on how to do it for DeepSeek-OCR.

My setup is linux with cuda Turing. Their guide didn't work for my setup, but here is what worked.

git clone https://github.com/vllm-project/vllm.git
cd vllm
uv venv --python 3.13 --seed
source .venv/bin/activate
python use_existing_torch.py
uv pip install -r requirements/build.txt
uv pip install torch torchvision
uv pip install --no-build-isolation -e . --prerelease=allow

This allowed me to have the latest vllm version

vllm --version
0.11.1rc6.dev158+gc3ee80a01.d20251106.cu130

You can then follow the guide from vLLM:

vllm serve  models/DeepSeek-OCR --logits_processors vllm.model_executor.models.deepseek_ocr:NGramPerReqLogitsProcessor --no-enable-prefix-caching --mm-processor-cache-gb 0

models/DeepSeek-OCR being the path to the downloaded model.

@dhruvilHV which version of vLLM are you using?

I found that version 0.11 (the latest current release), does not work with DeepSeek-OCR. You will need the latest one from their github page. Refer to my earlier guide.

i confused how to choose exactly model that i want to use like gundam, tiny, large,... when calling API with vLLM ?

Same issue here, can't find a way to set the model: large, gundam etc

I found an issue with setting the model when serving DeepSeek-OCR via vLLM.
vLLM currently uses GUNDAM mode (base_size=1024, image_size=640, crop_mode=True). This is HARDCODED and cannot be changed via environment variables yet.

If you need to change the model fork vLLM and modify constants in:
https://github.com/vllm-project/vllm/blob/main/vllm/transformers_utils/processors/deepseek_ocr.py#L10-L13

vLLM plans to expose this as mm_processor_kwargs in the future:
https://github.com/vllm-project/vllm/blob/main/vllm/transformers_utils/processors/deepseek_ocr.py#L15
(See comment: "TODO(Isotr0py): Expose as mm_kwargs")

Sign up or log in to comment