RWKV ASR is to add audio modality to RWKV7 model which means RWKV7 base model stays unaltered. The model trained a 0.1B rwkv model to convert whisper-large-v3 encoder's latents to RWKV7's latents space which convert the speech into texts according to the text instruction. This design keeps all abilities of LLM and is easy to add more functions to the model such as speech to speech, speech translation, etc. You name it!
The architect looks like:
Usage
Inference sample code is: https://github.com/yynil/RWKVTTS/blob/respark/model/test/test_asr_whisper.py
- Download the weights in this repo. Please note: 10k steps checkpoint training costs around 5k hours which is a very small amount of data and we are continuing training. Also it proves this mode needs less data to achieve a usable stage.
- Download the configuration directories in this repo. Assume you store them to directory YOUR_DIR.
- Run the script like:
python model/test/test_asr_whisper.py --whisper_path $YOUR_DIR/whisper-large-v3/ --audio_lm_path $YOUR_DIR/rwkv7_0.1b_audio_lm_latents/ --llm_path $YOUR_DIR/rwkv7-0.4B-g1a/ --ckpt_path $YOUR_DIR/rwkvasr_whisper_10k.model.bin --audio_path new.mp3
The output looks like:
or in English mode
python model/test/test_asr_whisper.py --whisper_path $YOUR_DIR/whisper-large-v3/ --audio_lm_path $YOUR_DIR/rwkv7_0.1b_audio_lm_latents/ --llm_path /home/yueyulin/models/rwkv7-0.4B-g1a/ --ckpt_path $YOUR_DIR/rwkvasr_whisper_10k.model.bin --audio_path eng2.wav --language english
The output looks like:
Another way to do inference downloading only the trained params stored in rwkv7_0.1b_audio_lm_latents_150k, you can download whisper_large_v3's weights: https://huggingface.co/openai/whisper-large-v3/blob/main/pytorch_model.bin and put it to the whisper directory. And then download the rwkv-0.4b-g1a's weights : https://huggingface.co/fla-hub/rwkv7-0.4B-g1a/blob/main/model.safetensors and put it to the rwkv-0.4b-g1a directory.
Run the script here: https://github.com/yynil/RWKVTTS/blob/respark/model/test/test_asr_whisper_load.py For an example:
python model/test/test_asr_whisper_load.py --whisper_path /home/yueyulin/models/whisper-large-v3/ --audio_lm_path /home/yueyulin/tmp/rwkvasr/rwkv7_0.1b_audio_lm_latents_150k/ --llm_path /home/yueyulin/models/rwkv7-0.4B-g1a/ --audio_path 918.wav
You will get the result like below:



