cjli commited on
Commit
647fd96
·
1 Parent(s): b59aa36

update readme: example script

Browse files
Files changed (1) hide show
  1. README.md +22 -3
README.md CHANGED
@@ -36,7 +36,7 @@ espnet_model_zoo
36
  **The recipe can be found in ESPnet:** https://github.com/espnet/espnet/tree/master/egs2/powsm/s2t1
37
 
38
 
39
- ### Example script
40
 
41
  Our models are trained on 16kHz audio with a fixed duration of 20s. When using the pre-trained model, please ensure the input speech is 16kHz and pad or truncate it to 20s.
42
 
@@ -50,9 +50,8 @@ task = '<pr>'
50
  s2t = Speech2Text.from_pretrained(
51
  "espnet/powsm",
52
  device="cuda",
53
- generate_interctc_outputs=False,
54
  lang_sym='<eng>', # ISO 639-3; set to <unk> for unseen languages
55
- task_sym='<pr>', # <pr>, <asr>, <g2p>, <p2g>
56
  )
57
 
58
  speech, rate = sf.read("sample.wav", sr=16000)
@@ -63,8 +62,28 @@ if task == '<pr>' or task == '<g2p>:
63
  print(pred)
64
  ```
65
 
 
 
66
  See `force_align.py` in [ESPnet recipe](https://github.com/espnet/espnet/tree/master/egs2/powsm/s2t1) to try out CTC forced alignment with POWSM's encoder!
67
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
68
 
69
  ### Citations
70
 
 
36
  **The recipe can be found in ESPnet:** https://github.com/espnet/espnet/tree/master/egs2/powsm/s2t1
37
 
38
 
39
+ ### Example script for PR/ASR/G2P/P2G
40
 
41
  Our models are trained on 16kHz audio with a fixed duration of 20s. When using the pre-trained model, please ensure the input speech is 16kHz and pad or truncate it to 20s.
42
 
 
50
  s2t = Speech2Text.from_pretrained(
51
  "espnet/powsm",
52
  device="cuda",
 
53
  lang_sym='<eng>', # ISO 639-3; set to <unk> for unseen languages
54
+ task_sym=task, # <pr>, <asr>, <g2p>, <p2g>
55
  )
56
 
57
  speech, rate = sf.read("sample.wav", sr=16000)
 
62
  print(pred)
63
  ```
64
 
65
+ #### Other tasks
66
+
67
  See `force_align.py` in [ESPnet recipe](https://github.com/espnet/espnet/tree/master/egs2/powsm/s2t1) to try out CTC forced alignment with POWSM's encoder!
68
 
69
+ LID is learned implicitly during training, and you may run it with the script below:
70
+
71
+ ```python
72
+ from espnet2.bin.s2t_inference_language import Speech2Language
73
+ import soundfile as sf # or librosa
74
+
75
+ s2t = Speech2Language.from_pretrained(
76
+ "espnet/powsm",
77
+ device="cuda",
78
+ nbest=1, # number of possible languages to return
79
+ first_lang_sym="<afr>", # fixed; defined in vocab list
80
+ last_lang_sym="<zul>" # fixed; defined in vocab list
81
+ )
82
+
83
+ speech, rate = sf.read("sample.wav", sr=16000)
84
+ pred = model(speech)[0] # a list of lang-prob pair
85
+ print(pred)
86
+ ```
87
 
88
  ### Citations
89