NandemoGHS
/

Anime-Speech-Japanese-Captioner

+---
+license: cc-by-nc-4.0
+datasets:
+- NandemoGHS/Galgame_Gemini_Captions
+language:
+- ja
+base_model:
+- Qwen/Qwen3-Omni-30B-A3B-Captioner
+---
+# Anime-Speech-Japanese-Captioner
+This model is a fine-tuned version of [Qwen/Qwen3-Omni-30B-A3B-Captioner](https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Captioner).
+This is an audio captioning model specialized for Japanese anime-style or game-style speech. It takes an audio input and generates a detailed description in Japanese, including emotion, speaker profile, mood, speed, prosody, pitch/timbre, style, and an overall caption.
+It was fine-tuned using the [NandemoGHS/Galgame_Gemini_Captions](https://huggingface.co/datasets/NandemoGHS/Galgame_Gemini_Captions) dataset.
+The training was conducted using the [ms-swift](https://github.com/modelscope/ms-swift) library with the Megatron Backend.
+## Intended Use and Limitations
+This model is specifically designed for **Japanese game-style or anime-style speech**.
+Due to the nature of its training data, it is **not expected to perform well** on:
+  * Languages other than Japanese.
+  * General conversational speech (e.g., meetings, casual dialogue).
+## How to Use (Inference)
+We recommend using `vLLM` for inference.
+### vLLM Installation Requirements
+This model requires building `vLLM` from a recent development commit as it is not yet supported in the latest stable release (v0.11.0 as of this writing).
+It has been tested and confirmed to work with commit `18961c5ea62976efc50525b72e40337993c5e4f9`. You must build vLLM from source:
+```bash
+git clone https://github.com/vllm-project/vllm.git
+cd vllm
+uv pip install . --torch-backend=auto -v --prerelease=allow
+```
+This requirement will likely be unnecessary after the `v0.11.1` release.
+### Inference Example
+Here is a simple inference script using `vLLM`:
+```python
+import os
+import torch
+from vllm import LLM, SamplingParams
+from transformers import Qwen3OmniMoeProcessor
+from qwen_omni_utils import process_mm_info
+if __name__ == '__main__':
+    # vLLM engine v1 not supported yet
+    os.environ['VLLM_USE_V1'] = '0'
+    MODEL_PATH = "NandemoGHS/Anime-Speech-Japanese-Captioner-FP8-DYNAMIC"
+    llm = LLM(
+            model=MODEL_PATH, trust_remote_code=True, gpu_memory_utilization=0.95,
+            tensor_parallel_size=torch.cuda.device_count(),
+            limit_mm_per_prompt={'audio': 1},
+            max_num_seqs=8,
+            max_model_len=8192,
+            seed=100,
+    )
+    sampling_params = SamplingParams(
+        temperature=0.6,
+        top_p=0.95,
+        top_k=20,
+        max_tokens=4096,
+    )
+    processor = Qwen3OmniMoeProcessor.from_pretrained(MODEL_PATH)
+    # Example audio file
+    audio_path = "https://huggingface.co/NandemoGHS/Anime-Speech-Japanese-Captioner/resolve/main/examples/example1.wav"
+    messages = [
+        {
+            "role": "user",
+            "content": [
+                {"type": "audio", "audio": audio_path}
+            ],
+        }
+    ]
+    text = processor.apply_chat_template(
+        messages,
+        tokenize=False,
+        add_generation_prompt=True,
+    )
+    audios, _, _ = process_mm_info(messages, use_audio_in_video=False)
+    inputs = {
+        'prompt': text,
+        'multi_modal_data': {},
+    }
+    if audios is not None:
+        inputs['multi_modal_data']['audio'] = audios
+    outputs = llm.generate([inputs], sampling_params=sampling_params)
+    print(outputs[0].outputs[0].text)
+```
+#### Example Output
+This is the caption generated for [this example](https://huggingface.co/NandemoGHS/Anime-Speech-Japanese-Captioner/resolve/main/examples/example1.wav).
+```
+emotion: ecstatic
+profile: お嬢様風の女性声
+mood: 快楽、絶頂
+speed: 途切れ途切れ
+prosody: 息遣いが荒く、感情の起伏が激しい
+pitch_timbre: 高め、息多め、喘ぎ声
+style: 喘ぎ
+notes: 喘ぎ声と吐息が混じり、性的興奮が非常に高い状態。
+caption: お嬢様風の女性が快楽に溺れ、喘ぎながら話す。息遣いが荒く、途切れ途切れに感情を爆発させる。性的興奮が最高潮に達している。
+```
+### Notebook Example
+For a more detailed walkthrough, please see the **[inference\_example.ipynb](https://huggingface.co/NandemoGHS/Anime-Speech-Japanese-Captioner/blob/main/inference_example.ipynb)** notebook.
+## Output Format
+The model outputs a structured description of the audio in Japanese, following this format:
+```
+emotion: {Emotion of the speech}
+profile: {Speaker profile}
+mood: {Mood of the speech}
+speed: {Speaking speed}
+prosody: {Prosody, rhythm}
+pitch_timbre:{Pitch, voice quality}
+style: {Style of utterance}
+notes: {Other relevant notes}
+caption: {A comprehensive caption integrating all elements}
+```
+## License
+This model is licensed under **[CC-BY-NC-4.0 License](https://creativecommons.org/licenses/by-nc/4.0/)**.
+Furthermore, the training data utilized outputs from **Gemini 2.5 Pro**. Therefore, **any use that competes with or violates the terms of service of Gemini is strictly prohibited.**