NCSpeech
/

stt_tl_fastconformer_hybrid_large

@@ -1,71 +1,71 @@
----
-license: cc-by-4.0
-language:
-- tl
-metrics:
-- wer
-library_name: nemo
-tags:
-- FastConformer
-- ASR
-- automatic-speech-recognition
----
-**FastConformer-Hybrid Large (tl)**
-<style>
-img {
- display: inline;
-}
-</style>
-| [![Model architecture](https://img.shields.io/badge/Model_Arch-FastConformer--Transducer_CTC-lightgrey#model-badge)](#model-architecture)
-| [![Model size](https://img.shields.io/badge/Params-115M-lightgrey#model-badge)](#model-architecture)
-| [![Language](https://img.shields.io/badge/Language-tl-lightgrey#model-badge)](#datasets)
----
-## Model Architecture
-The model is based on the **FastConformer** [architecture](https://arxiv.org/pdf/2305.05084) with a hybrid Transducer and CTC loss function. The model uses a Byte Pair Encoding (BPE) tokenizer with a vocabulary size of 1,024 tokens. Only characters from the Filipino alphabet and **'** are included in the tokenizer vocabulary.
-### Datasets
-The model is trained on a combination of supervised and semi-supervised transcribed datasets totaling approximately 520 hours of audio.
-## Benchmark Results
-We evaluate thr model using the WER metric **Word Error Rate (WER)**. To ensure consistency and fairness in comparison, we manually apply **Text Normalization**, including the handling of numbers, lowercase letters, and punctuation.
-*To take **Magic data tech** dataset, please register and download it [here](https://www.magicdatatech.com/datasets/asr/mdt-asr-e017-filipinotagalog-scripted-speech-corpus-1630305526).*
-Please review Terms of Use and Privacy Policy and License.
-| STT | Model                                                                          | #Params | Fleurs test (fil_ph) | Magic data tech |
-|-----|--------------------------------------------------------------------------------|---------|----------------------|-----------------|
-| 1   | **FastConformer-Hybrid Large**(our)                                            | 115M    | 9.34%                | **16.10%**      |
-| 2   | [whisper-large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo) | 809M    | 11.60%               | 16.43%          |
-| 3   | [11labs](https://elevenlabs.io/app/speech-to-text)                             | ---     | 9.19%                | 21.08%          |
-| 4   | [Google](https://cloud.google.com/speech-to-text/v2/docs/chirp_2-model)        | ---     | **7.42%**            | 28.79%          |
-### Transcribing using Python
-1. Install NeMo:
-```shell
-pip install nemo_toolkit[asr]
-```
-2. Load the pre-trained model:
-```python
-import nemo.collections.asr as nemo_asr
-nemo_model_path = '<path_to_model>/stt_tl_fastconformer_hybrid_large.nemo'
-asr_model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.restore_from(nemo_model_path)
-asr_model.eval()
-```
-3. Transcribe a single audio file:
-```python
-path2wav = 'audio.wav'
-output = asr_model.transcribe([path2wav])
-print(output[0].text)
-```
-4. Or transcribe multiple audio files:
-```python
-audio_file_list = ['audio1.wav', 'audio2.wav', 'audio3.wav']
-sys_tran = asr_model.transcribe(audio=audio_file_list,
-                                batch_size=len(audio_file_list),
-                                return_hypotheses=True,
-                                num_workers=0)
-for s, utt in zip(sys_tran, audio_file_list):
-    print(f"{utt}: {s.text}")
-```
-For more details, please refer to the [NeMo ASR documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/intro.html).

+---
+license: cc-by-4.0
+language:
+- tl
+metrics:
+- wer
+library_name: nemo
+tags:
+- FastConformer
+- ASR
+- automatic-speech-recognition
+---
+**FastConformer-Hybrid Large (tl)**
+<style>
+img {
+ display: inline;
+}
+</style>
+| [![Model architecture](https://img.shields.io/badge/Model_Arch-FastConformer--Transducer_CTC-lightgrey#model-badge)](#model-architecture)
+| [![Model size](https://img.shields.io/badge/Params-115M-lightgrey#model-badge)](#model-architecture)
+| [![Language](https://img.shields.io/badge/Language-tl-lightgrey#model-badge)](#datasets)
+---
+## Model Architecture
+The model is based on the **FastConformer** [architecture](https://arxiv.org/pdf/2305.05084) with a hybrid Transducer and CTC loss function. The model uses a Byte Pair Encoding (BPE) tokenizer with a vocabulary size of 1,024 tokens. Only characters from the Filipino alphabet and **'** are included in the tokenizer vocabulary.
+### Datasets
+The model is trained on a combination of supervised and semi-supervised transcribed datasets totaling approximately 520 hours of audio.
+## Benchmark Results
+We evaluate thr model using the WER metric **Word Error Rate (WER)**. To ensure consistency and fairness in comparison, we manually apply **Text Normalization**, including the handling of numbers, lowercase letters, and punctuation.
+*To take **Magic data tech** dataset, please register and download it [here](https://www.magicdatatech.com/datasets/asr/mdt-asr-e017-filipinotagalog-scripted-speech-corpus-1630305526).*
+Please review Terms of Use and Privacy Policy and License.
+| STT | Model                                                                          | #Params | Fleurs test (fil_ph) | Magic data tech |
+|-----|--------------------------------------------------------------------------------|---------|----------------------|-----------------|
+| 1   | **FastConformer-Hybrid Large**(our)                                            | 115M    | 9.34%                | **16.10%**      |
+| 2   | [whisper-large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo) | 809M    | 11.60%               | 16.43%          |
+| 3   | [ElevenLabs](https://elevenlabs.io/app/speech-to-text)                         | ---     | 9.19%                | 21.08%          |
+| 4   | [Google](https://cloud.google.com/speech-to-text/v2/docs/chirp_2-model)        | ---     | **7.42%**            | 28.79%          |
+### Transcribing using Python
+1. Install NeMo:
+```shell
+pip install nemo_toolkit[asr]
+```
+2. Load the pre-trained model:
+```python
+import nemo.collections.asr as nemo_asr
+nemo_model_path = '<path_to_model>/stt_tl_fastconformer_hybrid_large.nemo'
+asr_model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.restore_from(nemo_model_path)
+asr_model.eval()
+```
+3. Transcribe a single audio file:
+```python
+path2wav = 'audio.wav'
+output = asr_model.transcribe([path2wav])
+print(output[0].text)
+```
+4. Or transcribe multiple audio files:
+```python
+audio_file_list = ['audio1.wav', 'audio2.wav', 'audio3.wav']
+sys_tran = asr_model.transcribe(audio=audio_file_list,
+                                batch_size=len(audio_file_list),
+                                return_hypotheses=True,
+                                num_workers=0)
+for s, utt in zip(sys_tran, audio_file_list):
+    print(f"{utt}: {s.text}")
+```
+For more details, please refer to the [NeMo ASR documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/intro.html).