espnet
/

powsm

@@ -36,7 +36,7 @@ espnet_model_zoo
 **The recipe can be found in ESPnet:** https://github.com/espnet/espnet/tree/master/egs2/powsm/s2t1
-### Example script
 Our models are trained on 16kHz audio with a fixed duration of 20s. When using the pre-trained model, please ensure the input speech is 16kHz and pad or truncate it to 20s.
@@ -50,9 +50,8 @@ task = '<pr>'
 s2t = Speech2Text.from_pretrained(
     "espnet/powsm",
     device="cuda",
-    generate_interctc_outputs=False,
     lang_sym='<eng>',   # ISO 639-3; set to <unk> for unseen languages
-    task_sym='<pr>',    # <pr>, <asr>, <g2p>, <p2g>
 )
 speech, rate = sf.read("sample.wav", sr=16000)
@@ -63,8 +62,28 @@ if task == '<pr>' or task == '<g2p>:
 print(pred)
 ```
 See `force_align.py` in [ESPnet recipe](https://github.com/espnet/espnet/tree/master/egs2/powsm/s2t1) to try out CTC forced alignment with POWSM's encoder!
 ### Citations

 **The recipe can be found in ESPnet:** https://github.com/espnet/espnet/tree/master/egs2/powsm/s2t1
+### Example script for PR/ASR/G2P/P2G
 Our models are trained on 16kHz audio with a fixed duration of 20s. When using the pre-trained model, please ensure the input speech is 16kHz and pad or truncate it to 20s.
 s2t = Speech2Text.from_pretrained(
     "espnet/powsm",
     device="cuda",
     lang_sym='<eng>',   # ISO 639-3; set to <unk> for unseen languages
+    task_sym=task,    # <pr>, <asr>, <g2p>, <p2g>
 )
 speech, rate = sf.read("sample.wav", sr=16000)
 print(pred)
 ```
+#### Other tasks
 See `force_align.py` in [ESPnet recipe](https://github.com/espnet/espnet/tree/master/egs2/powsm/s2t1) to try out CTC forced alignment with POWSM's encoder!
+LID is learned implicitly during training, and you may run it with the script below:
+```python
+from espnet2.bin.s2t_inference_language import Speech2Language
+import soundfile as sf      # or librosa
+s2t = Speech2Language.from_pretrained(
+    "espnet/powsm",
+    device="cuda",
+    nbest=1,                # number of possible languages to return
+    first_lang_sym="<afr>", # fixed; defined in vocab list
+    last_lang_sym="<zul>"   # fixed; defined in vocab list
+)
+speech, rate = sf.read("sample.wav", sr=16000)
+pred = model(speech)[0]     # a list of lang-prob pair
+print(pred)
+```
 ### Citations