How to use it correctly with online serving via vllm openai compatible server?

#55

by dhruvil237 - opened 27 days ago

27 days ago

Using the below command not sure if its setup correctly.
vllm serve deepseek-ai/DeepSeek-OCR --no-enable-prefix-caching --mm-processor-cache-gb 0 --logits-processors vllm.model_executor.models.deepseek_ocr:NGramPerReqLogitsProcessor

then calling it this way:
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-OCR",
messages=message,
temperature=0.0,
max_tokens=500,
# ngram logit processor args
extra_body={
"ngram_size": 30,
"window_size": 90,
"whitelist_token_ids": [128821, 128822],
"skip_special_tokens": False, # whitelist: ,
}
)

I am not sure if the parameters passed are affecting anything.
Can someone explain why are those parameters required and are the setup correctly?

dhruvilHV

27 days ago

corrected serving command:
vllm serve deepseek-ai/DeepSeek-OCR --no-enable-prefix-caching --mm-processor-cache-gb 0 --logits-processors vllm.model_executor.models.deepseek_ocr:NGramPerReqLogitsProcessor --enable-log-requests --gpu-memory-utilization 0.4 --chat-template /home/ubuntu/llm-ocr-exp/template_deepseek_ocr.jinja

inference:

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-OCR",
    messages=message,
    temperature=0.0,
    max_tokens=500,
    # ngram logit processor args
    extra_body={
        "vllm_xargs": {
            "ngram_size": 30,
            "window_size": 90,
            # "whitelist_token_ids": [128821, 128822],
        },
        "skip_special_tokens": False,  # whitelist: <td>, </td>
    }
)

thanhhuynhk17

26 days ago

This comment has been hidden (marked as Resolved)

sirabhop

25 days ago

@dhruvilHV Can I see your --chat-template /home/ubuntu/llm-ocr-exp/template_deepseek_ocr.jinja?

dzigald

21 days ago

What about the image sizes? Do we need to pass additional arguments to the API call and if so, how? For example, how to signal you want the gundam level of quality through this API call?

luzox

19 days ago

What about the image sizes? Do we need to pass additional arguments to the API call and if so, how? For example, how to signal you want the gundam level of quality through this API call?

same question with you, i dont know to choose model if using vLLM serve

paolovic

18 days ago

@dhruvilHV which version of vLLM are you using?

3abkari

18 days ago

•

edited 18 days ago

For people struggling with vllm, you have to be on the latest dev version (from github) to get it working correctly.
vllm has already a good guide on how to do it for DeepSeek-OCR.

My setup is linux with cuda Turing. Their guide didn't work for my setup, but here is what worked.

git clone https://github.com/vllm-project/vllm.git
cd vllm
uv venv --python 3.13 --seed
source .venv/bin/activate
python use_existing_torch.py
uv pip install -r requirements/build.txt
uv pip install torch torchvision
uv pip install --no-build-isolation -e . --prerelease=allow

This allowed me to have the latest vllm version

vllm --version
0.11.1rc6.dev158+gc3ee80a01.d20251106.cu130

You can then follow the guide from vLLM:

vllm serve  models/DeepSeek-OCR --logits_processors vllm.model_executor.models.deepseek_ocr:NGramPerReqLogitsProcessor --no-enable-prefix-caching --mm-processor-cache-gb 0

models/DeepSeek-OCR being the path to the downloaded model.

3abkari

18 days ago

@dhruvilHV which version of vLLM are you using?

I found that version 0.11 (the latest current release), does not work with DeepSeek-OCR. You will need the latest one from their github page. Refer to my earlier guide.

luzox

11 days ago

i confused how to choose exactly model that i want to use like gundam, tiny, large,... when calling API with vLLM ?

gsantopaolo

10 days ago

Same issue here, can't find a way to set the model: large, gundam etc

gsantopaolo

10 days ago

I found an issue with setting the model when serving DeepSeek-OCR via vLLM.
vLLM currently uses GUNDAM mode (base_size=1024, image_size=640, crop_mode=True). This is HARDCODED and cannot be changed via environment variables yet.

If you need to change the model fork vLLM and modify constants in:
https://github.com/vllm-project/vllm/blob/main/vllm/transformers_utils/processors/deepseek_ocr.py#L10-L13

vLLM plans to expose this as mm_processor_kwargs in the future:
https://github.com/vllm-project/vllm/blob/main/vllm/transformers_utils/processors/deepseek_ocr.py#L15
(See comment: "TODO(Isotr0py): Expose as mm_kwargs")

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment