--- license: cc-by-nc-4.0 datasets: - NandemoGHS/Galgame_Gemini_Captions language: - ja base_model: - Qwen/Qwen3-Omni-30B-A3B-Captioner --- # Anime-Speech-Japanese-Captioner This model is a fine-tuned version of [Qwen/Qwen3-Omni-30B-A3B-Captioner](https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Captioner). This is an audio captioning model specialized for Japanese anime-style or game-style speech. It takes an audio input and generates a detailed description in Japanese, including emotion, speaker profile, mood, speed, prosody, pitch/timbre, style, and an overall caption. It was fine-tuned using the [NandemoGHS/Galgame_Gemini_Captions](https://huggingface.co/datasets/NandemoGHS/Galgame_Gemini_Captions) dataset. The training was conducted using the [ms-swift](https://github.com/modelscope/ms-swift) library with the Megatron Backend. ## Intended Use and Limitations This model is specifically designed for **Japanese game-style or anime-style speech**. Due to the nature of its training data, it is **not expected to perform well** on: * Languages other than Japanese. * General conversational speech (e.g., meetings, casual dialogue). ## How to Use (Inference) We recommend using `vLLM` for inference. ### vLLM Installation Requirements This model requires building `vLLM` from a recent development commit as it is not yet supported in the latest stable release (v0.11.0 as of this writing). It has been tested and confirmed to work with commit `18961c5ea62976efc50525b72e40337993c5e4f9`. You must build vLLM from source: ```bash git clone https://github.com/vllm-project/vllm.git cd vllm uv pip install . --torch-backend=auto -v --prerelease=allow ``` This requirement will likely be unnecessary after the `v0.11.1` release. ### Inference Example Here is a simple inference script using `vLLM`: ```python import os import torch from vllm import LLM, SamplingParams from transformers import Qwen3OmniMoeProcessor from qwen_omni_utils import process_mm_info if __name__ == '__main__': # vLLM engine v1 not supported yet os.environ['VLLM_USE_V1'] = '0' MODEL_PATH = "NandemoGHS/Anime-Speech-Japanese-Captioner-FP8-DYNAMIC" llm = LLM( model=MODEL_PATH, trust_remote_code=True, gpu_memory_utilization=0.95, tensor_parallel_size=torch.cuda.device_count(), limit_mm_per_prompt={'audio': 1}, max_num_seqs=8, max_model_len=8192, seed=100, ) sampling_params = SamplingParams( temperature=0.6, top_p=0.95, top_k=20, max_tokens=4096, ) processor = Qwen3OmniMoeProcessor.from_pretrained(MODEL_PATH) # Example audio file audio_path = "https://huggingface.co/NandemoGHS/Anime-Speech-Japanese-Captioner/resolve/main/examples/example1.wav" messages = [ { "role": "user", "content": [ {"type": "audio", "audio": audio_path} ], } ] text = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, ) audios, _, _ = process_mm_info(messages, use_audio_in_video=False) inputs = { 'prompt': text, 'multi_modal_data': {}, } if audios is not None: inputs['multi_modal_data']['audio'] = audios outputs = llm.generate([inputs], sampling_params=sampling_params) print(outputs[0].outputs[0].text) ``` #### Example Output This is the caption generated for [this example](https://huggingface.co/NandemoGHS/Anime-Speech-Japanese-Captioner/resolve/main/examples/example1.wav). ``` emotion: ecstatic profile: お嬢様風の女性声 mood: 快楽、絶頂 speed: 途切れ途切れ prosody: 息遣いが荒く、感情の起伏が激しい pitch_timbre: 高め、息多め、喘ぎ声 style: 喘ぎ notes: 喘ぎ声と吐息が混じり、性的興奮が非常に高い状態。 caption: お嬢様風の女性が快楽に溺れ、喘ぎながら話す。息遣いが荒く、途切れ途切れに感情を爆発させる。性的興奮が最高潮に達している。 ``` ### Notebook Example For a more detailed walkthrough, please see the **[inference\_example.ipynb](https://huggingface.co/NandemoGHS/Anime-Speech-Japanese-Captioner/blob/main/inference_example.ipynb)** notebook. ## Output Format The model outputs a structured description of the audio in Japanese, following this format: ``` emotion: {Emotion of the speech} profile: {Speaker profile} mood: {Mood of the speech} speed: {Speaking speed} prosody: {Prosody, rhythm} pitch_timbre:{Pitch, voice quality} style: {Style of utterance} notes: {Other relevant notes} caption: {A comprehensive caption integrating all elements} ``` ## License This model is licensed under **[CC-BY-NC-4.0 License](https://creativecommons.org/licenses/by-nc/4.0/)**. Furthermore, the training data utilized outputs from **Gemini 2.5 Pro**. Therefore, **any use that competes with or violates the terms of service of Gemini is strictly prohibited.**