--- datasets: - anyspeech/ipapack_plus_train_1 - anyspeech/ipapack_plus_train_2 - anyspeech/ipapack_plus_train_3 - anyspeech/ipapack_plus_train_4 language: multilingual library_name: espnet license: cc-by-4.0 metrics: - pfer - cer tags: - espnet - audio - phone-recognition - automatic-speech-recognition - grapheme-to-phoneme - phoneme-to-grapheme pipeline_tag: automatic-speech-recognition --- 🐁POWSM is the first phonetic foundation model that can perform four phone-related tasks: Phone Recognition (PR), Automatic Speech Recognition (ASR), audio-guided grapheme-to-phoneme conversion (G2P), and audio-guided phoneme-to-grapheme conversion (P2G). Based on [Open Whisper-style Speech Model (OWSM)](https://www.wavlab.org/activities/2024/owsm/) and trained with [IPAPack++](https://huggingface.co/anyspeech), POWSM outperforms or matches specialized PR models of similar size while jointly supporting G2P, P2G, and ASR. To use the pre-trained model, please install `espnet` and `espnet_model_zoo`. The requirements are: ``` torch espnet espnet_model_zoo ``` **The recipe can be found in ESPnet:** https://github.com/espnet/espnet/tree/master/egs2/powsm/s2t1 ### Example script for PR/ASR/G2P/P2G Our models are trained on 16kHz audio with a fixed duration of 20s. When using the pre-trained model, please ensure the input speech is 16kHz and pad or truncate it to 20s. To distinguish phone entries from BPE tokens that share the same Unicode, we enclose every phone in slashes and treat them as special tokens. For example, /pʰɔsəm/ would be tokenized as /pʰ//ɔ//s//ə//m/. ```python from espnet2.bin.s2t_inference import Speech2Text import soundfile as sf # or librosa task = '' s2t = Speech2Text.from_pretrained( "espnet/powsm", device="cuda", lang_sym='', # ISO 639-3; set to for unseen languages task_sym=task, # , , , ) speech, rate = sf.read("sample.wav", sr=16000) prompt = "" # G2P: set to ASR transcript; P2G: set to phone transcription with slashes pred = s2t(speech, text_prev=prompt)[0][0] if task == '' or task == ': pred = pred.replace("/", "") print(pred) ``` #### Other tasks See `force_align.py` in [ESPnet recipe](https://github.com/espnet/espnet/tree/master/egs2/powsm/s2t1) to try out CTC forced alignment with POWSM's encoder! LID is learned implicitly during training, and you may run it with the script below: ```python from espnet2.bin.s2t_inference_language import Speech2Language import soundfile as sf # or librosa s2t = Speech2Language.from_pretrained( "espnet/powsm", device="cuda", nbest=1, # number of possible languages to return first_lang_sym="", # fixed; defined in vocab list last_lang_sym="" # fixed; defined in vocab list ) speech, rate = sf.read("sample.wav", sr=16000) pred = model(speech)[0] # a list of lang-prob pair print(pred) ``` ### Citations ```BibTex @article{powsm } ```