yueyulin
/

rwkv_asr

Automatic Speech Recognition

Model card Files Files and versions

rwkv_asr / README.md

yueyulin's picture

Update README.md

02c8b1f verified 2 months ago

|

history blame contribute delete

3.06 kB

	---
	datasets:
	- wenet-e2e/wenetspeech
	- MLCommons/peoples_speech
	language:
	- zh
	- en
	base_model:
	- openai/whisper-large-v3
	- fla-hub/rwkv7-0.4B-g1a
	pipeline_tag: automatic-speech-recognition
	---

	RWKV ASR is to add audio modality to RWKV7 model which means RWKV7 base model stays unaltered. The model trained a 0.1B rwkv model to convert whisper-large-v3 encoder's latents to RWKV7's latents space which convert the speech into texts according to the text instruction.
	This design keeps all abilities of LLM and is easy to add more functions to the model such as speech to speech, speech translation, etc. You name it!

	The architect looks like:

	![image/png](/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F63a00aa29f1f2baab2034cf8%2F4bM4sOb-0z5bNr1Ng7MhY.png%3C%2Fspan%3E)%3C!-- HTML_TAG_END -->

	# Usage
	Inference sample code is:
	https://github.com/yynil/RWKVTTS/blob/respark/model/test/test_asr_whisper.py
	1. Download the weights in this repo. Please note: 10k steps checkpoint training costs around 5k hours which is a very small amount of data and we are continuing training. Also it proves this mode needs less data to achieve a usable stage.
	2. Download the configuration directories in this repo. Assume you store them to directory YOUR_DIR.
	2. Run the script like:
	```bash
	python model/test/test_asr_whisper.py --whisper_path $YOUR_DIR/whisper-large-v3/ --audio_lm_path $YOUR_DIR/rwkv7_0.1b_audio_lm_latents/ --llm_path $YOUR_DIR/rwkv7-0.4B-g1a/ --ckpt_path $YOUR_DIR/rwkvasr_whisper_10k.model.bin --audio_path new.mp3
	```
	The output looks like:

	![image/png](/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F63a00aa29f1f2baab2034cf8%2Fhvo21_B53PnCdybRlIDyi.png%3C%2Fspan%3E)%3C%2Fspan%3E%3C!-- HTML_TAG_END -->

	or in English mode
	```bash
	python model/test/test_asr_whisper.py --whisper_path $YOUR_DIR/whisper-large-v3/ --audio_lm_path $YOUR_DIR/rwkv7_0.1b_audio_lm_latents/ --llm_path /home/yueyulin/models/rwkv7-0.4B-g1a/ --ckpt_path $YOUR_DIR/rwkvasr_whisper_10k.model.bin --audio_path eng2.wav --language english
	```
	The output looks like:

	![image/png](/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F63a00aa29f1f2baab2034cf8%2F3TpFly4KIM7-5C7W3jM0b.png)%3C%2Fspan%3E%3C!-- HTML_TAG_END -->


	Another way to do inference downloading only the trained params stored in rwkv7_0.1b_audio_lm_latents_150k, you can download whisper_large_v3's weights: https://huggingface.co/openai/whisper-large-v3/blob/main/pytorch_model.bin and put it to the whisper directory. And then download the rwkv-0.4b-g1a's weights : https://huggingface.co/fla-hub/rwkv7-0.4B-g1a/blob/main/model.safetensors and put it to the rwkv-0.4b-g1a directory.

	Run the script here: https://github.com/yynil/RWKVTTS/blob/respark/model/test/test_asr_whisper_load.py
	For an example:
	```bash
	python model/test/test_asr_whisper_load.py --whisper_path /home/yueyulin/models/whisper-large-v3/ --audio_lm_path /home/yueyulin/tmp/rwkvasr/rwkv7_0.1b_audio_lm_latents_150k/ --llm_path /home/yueyulin/models/rwkv7-0.4B-g1a/ --audio_path 918.wav
	```
	You will get the result like below:

	![image](/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F63a00aa29f1f2baab2034cf8%2F1-fSz-MGokhAj4C6Cjwzr.png%3C%2Fspan%3E)%3C%2Fspan%3E%3C!-- HTML_TAG_END -->