andtseren commited on
Commit
43dcffc
·
verified ·
1 Parent(s): f47bf88

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +71 -71
README.md CHANGED
@@ -1,71 +1,71 @@
1
- ---
2
- license: cc-by-4.0
3
- language:
4
- - tl
5
- metrics:
6
- - wer
7
- library_name: nemo
8
- tags:
9
- - FastConformer
10
- - ASR
11
- - automatic-speech-recognition
12
- ---
13
- **FastConformer-Hybrid Large (tl)**
14
- <style>
15
- img {
16
- display: inline;
17
- }
18
- </style>
19
-
20
- | [![Model architecture](https://img.shields.io/badge/Model_Arch-FastConformer--Transducer_CTC-lightgrey#model-badge)](#model-architecture)
21
- | [![Model size](https://img.shields.io/badge/Params-115M-lightgrey#model-badge)](#model-architecture)
22
- | [![Language](https://img.shields.io/badge/Language-tl-lightgrey#model-badge)](#datasets)
23
- ---
24
- ## Model Architecture
25
- The model is based on the **FastConformer** [architecture](https://arxiv.org/pdf/2305.05084) with a hybrid Transducer and CTC loss function. The model uses a Byte Pair Encoding (BPE) tokenizer with a vocabulary size of 1,024 tokens. Only characters from the Filipino alphabet and **'** are included in the tokenizer vocabulary.
26
- ### Datasets
27
- The model is trained on a combination of supervised and semi-supervised transcribed datasets totaling approximately 520 hours of audio.
28
- ## Benchmark Results
29
- We evaluate thr model using the WER metric **Word Error Rate (WER)**. To ensure consistency and fairness in comparison, we manually apply **Text Normalization**, including the handling of numbers, lowercase letters, and punctuation.
30
-
31
- *To take **Magic data tech** dataset, please register and download it [here](https://www.magicdatatech.com/datasets/asr/mdt-asr-e017-filipinotagalog-scripted-speech-corpus-1630305526).*
32
- Please review Terms of Use and Privacy Policy and License.
33
-
34
- | STT | Model | #Params | Fleurs test (fil_ph) | Magic data tech |
35
- |-----|--------------------------------------------------------------------------------|---------|----------------------|-----------------|
36
- | 1 | **FastConformer-Hybrid Large**(our) | 115M | 9.34% | **16.10%** |
37
- | 2 | [whisper-large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo) | 809M | 11.60% | 16.43% |
38
- | 3 | [11labs](https://elevenlabs.io/app/speech-to-text) | --- | 9.19% | 21.08% |
39
- | 4 | [Google](https://cloud.google.com/speech-to-text/v2/docs/chirp_2-model) | --- | **7.42%** | 28.79% |
40
-
41
- ### Transcribing using Python
42
- 1. Install NeMo:
43
- ```shell
44
- pip install nemo_toolkit[asr]
45
- ```
46
- 2. Load the pre-trained model:
47
- ```python
48
- import nemo.collections.asr as nemo_asr
49
-
50
- nemo_model_path = '<path_to_model>/stt_tl_fastconformer_hybrid_large.nemo'
51
- asr_model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.restore_from(nemo_model_path)
52
- asr_model.eval()
53
- ```
54
- 3. Transcribe a single audio file:
55
- ```python
56
- path2wav = 'audio.wav'
57
- output = asr_model.transcribe([path2wav])
58
- print(output[0].text)
59
- ```
60
- 4. Or transcribe multiple audio files:
61
- ```python
62
- audio_file_list = ['audio1.wav', 'audio2.wav', 'audio3.wav']
63
- sys_tran = asr_model.transcribe(audio=audio_file_list,
64
- batch_size=len(audio_file_list),
65
- return_hypotheses=True,
66
- num_workers=0)
67
- for s, utt in zip(sys_tran, audio_file_list):
68
- print(f"{utt}: {s.text}")
69
- ```
70
-
71
- For more details, please refer to the [NeMo ASR documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/intro.html).
 
1
+ ---
2
+ license: cc-by-4.0
3
+ language:
4
+ - tl
5
+ metrics:
6
+ - wer
7
+ library_name: nemo
8
+ tags:
9
+ - FastConformer
10
+ - ASR
11
+ - automatic-speech-recognition
12
+ ---
13
+ **FastConformer-Hybrid Large (tl)**
14
+ <style>
15
+ img {
16
+ display: inline;
17
+ }
18
+ </style>
19
+
20
+ | [![Model architecture](https://img.shields.io/badge/Model_Arch-FastConformer--Transducer_CTC-lightgrey#model-badge)](#model-architecture)
21
+ | [![Model size](https://img.shields.io/badge/Params-115M-lightgrey#model-badge)](#model-architecture)
22
+ | [![Language](https://img.shields.io/badge/Language-tl-lightgrey#model-badge)](#datasets)
23
+ ---
24
+ ## Model Architecture
25
+ The model is based on the **FastConformer** [architecture](https://arxiv.org/pdf/2305.05084) with a hybrid Transducer and CTC loss function. The model uses a Byte Pair Encoding (BPE) tokenizer with a vocabulary size of 1,024 tokens. Only characters from the Filipino alphabet and **'** are included in the tokenizer vocabulary.
26
+ ### Datasets
27
+ The model is trained on a combination of supervised and semi-supervised transcribed datasets totaling approximately 520 hours of audio.
28
+ ## Benchmark Results
29
+ We evaluate thr model using the WER metric **Word Error Rate (WER)**. To ensure consistency and fairness in comparison, we manually apply **Text Normalization**, including the handling of numbers, lowercase letters, and punctuation.
30
+
31
+ *To take **Magic data tech** dataset, please register and download it [here](https://www.magicdatatech.com/datasets/asr/mdt-asr-e017-filipinotagalog-scripted-speech-corpus-1630305526).*
32
+ Please review Terms of Use and Privacy Policy and License.
33
+
34
+ | STT | Model | #Params | Fleurs test (fil_ph) | Magic data tech |
35
+ |-----|--------------------------------------------------------------------------------|---------|----------------------|-----------------|
36
+ | 1 | **FastConformer-Hybrid Large**(our) | 115M | 9.34% | **16.10%** |
37
+ | 2 | [whisper-large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo) | 809M | 11.60% | 16.43% |
38
+ | 3 | [ElevenLabs](https://elevenlabs.io/app/speech-to-text) | --- | 9.19% | 21.08% |
39
+ | 4 | [Google](https://cloud.google.com/speech-to-text/v2/docs/chirp_2-model) | --- | **7.42%** | 28.79% |
40
+
41
+ ### Transcribing using Python
42
+ 1. Install NeMo:
43
+ ```shell
44
+ pip install nemo_toolkit[asr]
45
+ ```
46
+ 2. Load the pre-trained model:
47
+ ```python
48
+ import nemo.collections.asr as nemo_asr
49
+
50
+ nemo_model_path = '<path_to_model>/stt_tl_fastconformer_hybrid_large.nemo'
51
+ asr_model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.restore_from(nemo_model_path)
52
+ asr_model.eval()
53
+ ```
54
+ 3. Transcribe a single audio file:
55
+ ```python
56
+ path2wav = 'audio.wav'
57
+ output = asr_model.transcribe([path2wav])
58
+ print(output[0].text)
59
+ ```
60
+ 4. Or transcribe multiple audio files:
61
+ ```python
62
+ audio_file_list = ['audio1.wav', 'audio2.wav', 'audio3.wav']
63
+ sys_tran = asr_model.transcribe(audio=audio_file_list,
64
+ batch_size=len(audio_file_list),
65
+ return_hypotheses=True,
66
+ num_workers=0)
67
+ for s, utt in zip(sys_tran, audio_file_list):
68
+ print(f"{utt}: {s.text}")
69
+ ```
70
+
71
+ For more details, please refer to the [NeMo ASR documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/intro.html).