|
|
--- |
|
|
datasets: |
|
|
- wenet-e2e/wenetspeech |
|
|
- MLCommons/peoples_speech |
|
|
language: |
|
|
- zh |
|
|
- en |
|
|
base_model: |
|
|
- openai/whisper-large-v3 |
|
|
- fla-hub/rwkv7-0.4B-g1a |
|
|
pipeline_tag: automatic-speech-recognition |
|
|
--- |
|
|
|
|
|
RWKV ASR is to add audio modality to RWKV7 model which means RWKV7 base model stays unaltered. The model trained a 0.1B rwkv model to convert whisper-large-v3 encoder's latents to RWKV7's latents space which convert the speech into texts according to the text instruction. |
|
|
This design keeps all abilities of LLM and is easy to add more functions to the model such as speech to speech, speech translation, etc. You name it! |
|
|
|
|
|
The architect looks like: |
|
|
|
|
|
%3C!-- HTML_TAG_END --> |
|
|
|
|
|
# Usage |
|
|
Inference sample code is: |
|
|
https://github.com/yynil/RWKVTTS/blob/respark/model/test/test_asr_whisper.py |
|
|
1. Download the weights in this repo. Please note: 10k steps checkpoint training costs around 5k hours which is a very small amount of data and we are continuing training. Also it proves this mode needs less data to achieve a usable stage. |
|
|
2. Download the configuration directories in this repo. Assume you store them to directory YOUR_DIR. |
|
|
2. Run the script like: |
|
|
```bash |
|
|
python model/test/test_asr_whisper.py --whisper_path $YOUR_DIR/whisper-large-v3/ --audio_lm_path $YOUR_DIR/rwkv7_0.1b_audio_lm_latents/ --llm_path $YOUR_DIR/rwkv7-0.4B-g1a/ --ckpt_path $YOUR_DIR/rwkvasr_whisper_10k.model.bin --audio_path new.mp3 |
|
|
``` |
|
|
The output looks like: |
|
|
|
|
|
%3C%2Fspan%3E%3C!-- HTML_TAG_END --> |
|
|
|
|
|
or in English mode |
|
|
```bash |
|
|
python model/test/test_asr_whisper.py --whisper_path $YOUR_DIR/whisper-large-v3/ --audio_lm_path $YOUR_DIR/rwkv7_0.1b_audio_lm_latents/ --llm_path /home/yueyulin/models/rwkv7-0.4B-g1a/ --ckpt_path $YOUR_DIR/rwkvasr_whisper_10k.model.bin --audio_path eng2.wav --language english |
|
|
``` |
|
|
The output looks like: |
|
|
|
|
|
%3C%2Fspan%3E%3C!-- HTML_TAG_END --> |
|
|
|
|
|
|
|
|
Another way to do inference downloading only the trained params stored in rwkv7_0.1b_audio_lm_latents_150k, you can download whisper_large_v3's weights: https://huggingface.co/openai/whisper-large-v3/blob/main/pytorch_model.bin and put it to the whisper directory. And then download the rwkv-0.4b-g1a's weights : https://huggingface.co/fla-hub/rwkv7-0.4B-g1a/blob/main/model.safetensors and put it to the rwkv-0.4b-g1a directory. |
|
|
|
|
|
Run the script here: https://github.com/yynil/RWKVTTS/blob/respark/model/test/test_asr_whisper_load.py |
|
|
For an example: |
|
|
```bash |
|
|
python model/test/test_asr_whisper_load.py --whisper_path /home/yueyulin/models/whisper-large-v3/ --audio_lm_path /home/yueyulin/tmp/rwkvasr/rwkv7_0.1b_audio_lm_latents_150k/ --llm_path /home/yueyulin/models/rwkv7-0.4B-g1a/ --audio_path 918.wav |
|
|
``` |
|
|
You will get the result like below: |
|
|
|
|
|
%3C%2Fspan%3E%3C!-- HTML_TAG_END --> |
|
|
|