File size: 5,041 Bytes

07433a5

---
license: cc-by-nc-4.0
datasets:
- NandemoGHS/Galgame_Gemini_Captions
language:
- ja
base_model:
- Qwen/Qwen3-Omni-30B-A3B-Captioner
---

# Anime-Speech-Japanese-Captioner

This model is a fine-tuned version of [Qwen/Qwen3-Omni-30B-A3B-Captioner](https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Captioner).

This is an audio captioning model specialized for Japanese anime-style or game-style speech. It takes an audio input and generates a detailed description in Japanese, including emotion, speaker profile, mood, speed, prosody, pitch/timbre, style, and an overall caption.

It was fine-tuned using the [NandemoGHS/Galgame_Gemini_Captions](https://huggingface.co/datasets/NandemoGHS/Galgame_Gemini_Captions) dataset.

The training was conducted using the [ms-swift](https://github.com/modelscope/ms-swift) library with the Megatron Backend.

## Intended Use and Limitations

This model is specifically designed for **Japanese game-style or anime-style speech**.

Due to the nature of its training data, it is **not expected to perform well** on:

  * Languages other than Japanese.
  * General conversational speech (e.g., meetings, casual dialogue).

## How to Use (Inference)

We recommend using `vLLM` for inference.

### vLLM Installation Requirements

This model requires building `vLLM` from a recent development commit as it is not yet supported in the latest stable release (v0.11.0 as of this writing).

It has been tested and confirmed to work with commit `18961c5ea62976efc50525b72e40337993c5e4f9`. You must build vLLM from source:

```bash
git clone https://github.com/vllm-project/vllm.git
cd vllm
uv pip install . --torch-backend=auto -v --prerelease=allow
```

This requirement will likely be unnecessary after the `v0.11.1` release.

### Inference Example

Here is a simple inference script using `vLLM`:

```python
import os
import torch

from vllm import LLM, SamplingParams
from transformers import Qwen3OmniMoeProcessor
from qwen_omni_utils import process_mm_info

if __name__ == '__main__':
    # vLLM engine v1 not supported yet
    os.environ['VLLM_USE_V1'] = '0'

    MODEL_PATH = "NandemoGHS/Anime-Speech-Japanese-Captioner-FP8-DYNAMIC"

    llm = LLM(
            model=MODEL_PATH, trust_remote_code=True, gpu_memory_utilization=0.95,
            tensor_parallel_size=torch.cuda.device_count(),
            limit_mm_per_prompt={'audio': 1},
            max_num_seqs=8,
            max_model_len=8192,
            seed=100,
    )

    sampling_params = SamplingParams(
        temperature=0.6,
        top_p=0.95,
        top_k=20,
        max_tokens=4096,
    )

    processor = Qwen3OmniMoeProcessor.from_pretrained(MODEL_PATH)

    # Example audio file
    audio_path = "https://huggingface.co/NandemoGHS/Anime-Speech-Japanese-Captioner/resolve/main/examples/example1.wav"

    messages = [
        {
            "role": "user",
            "content": [
                {"type": "audio", "audio": audio_path}
            ],
        }
    ]

    text = processor.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )
    audios, _, _ = process_mm_info(messages, use_audio_in_video=False)

    inputs = {
        'prompt': text,
        'multi_modal_data': {},
    }

    if audios is not None:
        inputs['multi_modal_data']['audio'] = audios

    outputs = llm.generate([inputs], sampling_params=sampling_params)

    print(outputs[0].outputs[0].text)
```

#### Example Output

This is the caption generated for [this example](https://huggingface.co/NandemoGHS/Anime-Speech-Japanese-Captioner/resolve/main/examples/example1.wav).

```
emotion: ecstatic
profile: お嬢様風の女性声
mood: 快楽、絶頂
speed: 途切れ途切れ
prosody: 息遣いが荒く、感情の起伏が激しい
pitch_timbre: 高め、息多め、喘ぎ声
style: 喘ぎ
notes: 喘ぎ声と吐息が混じり、性的興奮が非常に高い状態。
caption: お嬢様風の女性が快楽に溺れ、喘ぎながら話す。息遣いが荒く、途切れ途切れに感情を爆発させる。性的興奮が最高潮に達している。
```

### Notebook Example

For a more detailed walkthrough, please see the **[inference\_example.ipynb](https://huggingface.co/NandemoGHS/Anime-Speech-Japanese-Captioner/blob/main/inference_example.ipynb)** notebook.

## Output Format

The model outputs a structured description of the audio in Japanese, following this format:

```
emotion: {Emotion of the speech}
profile: {Speaker profile}
mood: {Mood of the speech}
speed: {Speaking speed}
prosody: {Prosody, rhythm}
pitch_timbre:{Pitch, voice quality}
style: {Style of utterance}
notes: {Other relevant notes}
caption: {A comprehensive caption integrating all elements}
```

## License

This model is licensed under **[CC-BY-NC-4.0 License](https://creativecommons.org/licenses/by-nc/4.0/)**.

Furthermore, the training data utilized outputs from **Gemini 2.5 Pro**. Therefore, **any use that competes with or violates the terms of service of Gemini is strictly prohibited.**