File size: 7,639 Bytes

## MERaLiON-AudioLLM vLLM Serving

> [!IMPORTANT]
> MERaLiON-AudioLLM is trained on 30 second audios. This vllm integration supports at most 4mins audio input. 

### Set up Environment

MERaLiON-AudioLLM requires vLLM version `6.4.post1` and transformers `4.46.3`

```bash
pip install vllm==6.4.post1
pip install transformers==4.46.3
```

As the [vLLM documentation](https://docs.vllm.ai/en/stable/models/adding_model.html#out-of-tree-model-integration) recommends, 
we provide a way to register our model via [vLLM plugins](https://docs.vllm.ai/en/stable/design/plugin_system.html#plugin-system). 


```bash
python install .
```


### Offline Inference

Here is an example of offline inference using our custom vLLM class. 

```python
import torch
from vllm import ModelRegistry, LLM, SamplingParams
from vllm.assets.audio import AudioAsset

model_name = "MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION"

llm = LLM(model=model_name,
          tokenizer=model_name,
          limit_mm_per_prompt={"audio": 1},
          trust_remote_code=True,
          dtype=torch.bfloat16
          )

audio_asset = AudioAsset("mary_had_lamb")

question= "Please trancribe this speech."
audio_in_prompt = "Given the following audio context: <SpeechHere>\n\n"

prompt = ("<start_of_turn>user\n"
          f"{audio_in_prompt}Text instruction: {question}<end_of_turn>\n"
          "<start_of_turn>model\n")

sampling_params = SamplingParams(
  temperature=1,
  top_p=9,
  top_k=50,
  repetition_penalty=1.1,
  seed=42,
  max_tokens=1024,
  stop_token_ids=None
)

mm_data = {"audio": [audio_asset.audio_and_sample_rate]}
inputs = {"prompt": prompt, "multi_modal_data": mm_data}

# batch inference
inputs = [inputs] * 2

outputs = llm.generate(inputs, sampling_params=sampling_params)

for o in outputs:
    generated_text = o.outputs[0].text
    print(generated_text)
```

### Serving

Here is an example to start the server via the `vllm serve` command.

```bash
export HF_TOKEN=<your-hf-token>

vllm serve MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION --tokenizer MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION --trust-remote-code --dtype bfloat16 --port 8000
```

To call the server, you can use the [official OpenAI client](https://github.com/openai/openai-python):

```python
import base64

from openai import OpenAI


def get_client(api_key="EMPTY", base_url="http://localhost:8000/v1"):
    client = OpenAI(
        api_key=api_key,
        base_url=base_url,
    )

    models = client.models.list()
    model_name = models.data[0].id
    return client, model_name


def get_response(text_input, base64_audio_input, **params):
    response_obj = client.chat.completions.create(
        messages=[{
            "role":
            "user",
            "content": [
                {
                    "type": "text",
                    "text": f"Text instruction: {text_input}"
                },
                {
                    "type": "audio_url",
                    "audio_url": {
                        "url": f"data:audio/ogg;base64,{base64_audio_input}"
                    },
                },
            ],
        }],
        **params
    )
    return response_obj


#specify input and params
possible_text_inputs = [
    "Please transcribe this speech.",
    "Please summarise the content of this speech.",
    "Please follow the instruction in this speech."
]

audio_bytes = open(f"/path/to/wav/or/mp3/file", "rb").read()
audio_base64 = base64.b64encode(audio_bytes).decode('utf-8')

# use the port number of vllm service.
client, model_name = get_client(base_url="http://localhost:8000/v1")

generation_parameters = dict(
    model=model_name,
    max_completion_tokens=1024,
    temperature=1,
    top_p=9,
    extra_body={
        "repetition_penalty": 1.1,
        "top_k": 50,
        "length_penalty": 1.0
    },
    seed=42
)


response_obj = get_response(possible_text_inputs[0], audio_base64, **generation_parameters)
print(response_obj.choices[0].message.content)
```

Alternatively, you can try calling the server with curl, see the example below. We recommend using the generation config in the json body to fully reproduce the performance.

```bash
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION",
        "messages": [
            {"role": "user", 
            "content": [
                {"type": "text", "text": "Text instruction: <your-instruction>"},
                {"type": "audio_url", "audio_url": {"url": "data:audio/ogg;base64,<your-audio-base64-string>"}}
            ]
            }
        ],
        "max_completion_tokens": 1024,
        "temperature": 1, 
        "top_p": 9, 
        "seed": 42 
    }'
```


### Inference Performance Benchmark

We report average **Time To First Token** (**TTFT**, unit: ms) together with **Inter-Token Latency** (**ITL**, unit: ms) with vLLM instance running on H100 and A100 GPU respectively.  

Input: 120 speech recognition prompts for each input audio length and concurrency combination.\
Output: The corresponding output length of these prompts.


<p style="text-align: center;"><strong>Single NVIDIA H100 GPU (80GiB GPU memory)</strong></p>

<table style="margin: 0px auto;">
  <thead>
    <tr>
      <th>Input Audio Length</th>
      <th style="text-align: center;" colspan="2">30s</th>
      <th style="text-align: center;" colspan="2">1min</th>
      <th style="text-align: center;" colspan="2">2mins</th>
    </tr>
    <tr>
      <th>Concurrent requests</th>
      <th>TTFT (ms)</th>
      <th>ITL (ms)</th>
      <th>TTFT (ms)</th>
      <th>ITL (ms)</th>
      <th>TTFT (ms)</th>
      <th>ITL (ms)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1</td>
      <td>85.8</td>
      <td>9.9</td>
      <td>126.4</td>
      <td>9.6</td>
      <td>214.5</td>
      <td>9.7</td>
    </tr>
    <tr>
      <td>4</td>
      <td>96.9</td>
      <td>11.4</td>
      <td>159.6</td>
      <td>11.1</td>
      <td>258.1</td>
      <td>11.2</td>
    </tr>
    <tr>
      <td>8</td>
      <td>109.6</td>
      <td>13.0</td>
      <td>206.5</td>
      <td>12.7</td>
      <td>261.9</td>
      <td>13.0</td>
    </tr>
    <tr>
      <td>16</td>
      <td>149.9</td>
      <td>16.3</td>
      <td>236.7</td>
      <td>16.2</td>
      <td>299.0</td>
      <td>16.8</td>
    </tr>
  </tbody>
</table>

<p style="text-align: center;"><strong>Single NVIDIA A100 GPU (40GiB GPU memory)</strong></p>

<table style="margin: 0px auto;">
  <thead>
    <tr>
      <th>Input Audio Length</th>
      <th style="text-align: center;" colspan="2">30s</th>
      <th style="text-align: center;" colspan="2">1min</th>
      <th style="text-align: center;" colspan="2">2mins</th>
    </tr>
    <tr>
      <th>Concurrent requests</th>
      <th>TTFT (ms)</th>
      <th>ITL (ms)</th>
      <th>TTFT (ms)</th>
      <th>ITL (ms)</th>
      <th>TTFT (ms)</th>
      <th>ITL (ms)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1</td>
      <td>162.6</td>
      <td>18.0</td>
      <td>195.0</td>
      <td>18.3</td>
      <td>309.9</td>
      <td>18.6</td>
    </tr>
    <tr>
      <td>4</td>
      <td>159.1</td>
      <td>21.1</td>
      <td>226.9</td>
      <td>21.2</td>
      <td>329.5</td>
      <td>21.6</td>
    </tr>
    <tr>
      <td>8</td>
      <td>176.5</td>
      <td>25.2</td>
      <td>305.4</td>
      <td>24.8</td>
      <td>352.5</td>
      <td>25.5</td>
    </tr>
    <tr>
      <td>16</td>
      <td>196.0</td>
      <td>32.0</td>
      <td>329.4</td>
      <td>31.9</td>
      <td>414.7</td>
      <td>33.4</td>
    </tr>
  </tbody>
</table>