--- license: cc-by-4.0 library_name: nemo pipeline_tag: automatic-speech-recognition language: - tl - fil tags: - asr - automatic-speech-recognition - fastconformer - tagalog - filipino - nemo - speech datasets: - google/fleurs model-index: - name: stt_tl_fastconformer_hybrid_large results: - task: type: automatic-speech-recognition name: ASR dataset: name: FLEURS (fil_ph) type: google/fleurs args: fil_ph metrics: - type: wer name: WER value: 9.34 - task: type: automatic-speech-recognition name: ASR dataset: name: Magic Data Tech (Tagalog) type: other url: >- https://www.magicdatatech.com/datasets/asr/mdt-asr-e017-filipinotagalog-scripted-speech-corpus-1630305526 metrics: - type: wer name: WER value: 16.1 --- # FastConformer-Hybrid Large (Tagalog/Filipino) [![Arch](https://img.shields.io/badge/Arch-FastConformer_Hybrid-lightgrey)](#model-architecture) [![Params](https://img.shields.io/badge/Params-115M-lightgrey)](#model-architecture) [![Language](https://img.shields.io/badge/Language-tl%2Ffil-lightgrey)](#datasets) Production-oriented ASR model for **Tagalog/Filipino**, built on **FastConformer**. ## Model Architecture The model is based on the **FastConformer** [architecture](https://arxiv.org/pdf/2305.05084) with a hybrid Transducer and CTC loss function. The model uses a Byte Pair Encoding (BPE) tokenizer with a vocabulary size of 1,024 tokens. Only characters from the Filipino alphabet and the apostrophe (') are included in the tokenizer vocabulary. ## Datasets The model is trained on a combination of supervised and semi-supervised transcribed datasets totaling approximately 520 hours of audio. For external benchmarking we use: - **FLEURS** (`fil_ph` split) - **Magic Data Tech (Tagalog)** — To take **Magic data tech** dataset, please register and download it [here](https://www.magicdatatech.com/datasets/asr/mdt-asr-e017-filipinotagalog-scripted-speech-corpus-1630305526). Please review Terms of Use and Privacy Policy and License. ## Benchmark Results We evaluate the model using the WER metric **Word Error Rate (WER)**. To ensure consistency and fairness in comparison, we manually apply **Text Normalization**, including the handling of numbers, lowercase letters, and punctuation. | STT | Model | #Params | Fleurs test (fil_ph) | Magic data tech | |-----|--------------------------------------------------------------------------------|---------|----------------------|-----------------| | 1 | **FastConformer-Hybrid Large**(our) | 115M | 9.34% | **16.10%** | | 2 | [whisper-large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo) | 809M | 11.60% | 16.43% | | 3 | [ElevenLabs](https://elevenlabs.io/app/speech-to-text) | --- | 9.19% | 21.08% | | 4 | [Google](https://cloud.google.com/speech-to-text/v2/docs/chirp_2-model) | --- | **7.42%** | 28.79% | ## Audio & I/O - Expected input: mono WAV, **16 kHz**, PCM16 (recommended). - Other formats are supported if your audio loader converts to 16 kHz mono float PCM. ### Transcribing using Python 1. Install NeMo: ```shell pip install nemo_toolkit[asr] ``` 2. Download the model checkpoint: ```python from huggingface_hub import hf_hub_download nemo_model_path = hf_hub_download( repo_id="NCSpeech/stt_tl_fastconformer_hybrid_large", filename="stt_tl_fastconformer_hybrid_large.nemo", ) print(nemo_model_path) # local path to .nemo ``` 3. Load the pre-trained model: ```python import nemo.collections.asr as nemo_asr asr_model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.restore_from(nemo_model_path) asr_model.eval() ``` 4. Transcribe a single audio file: ```python path2wav = 'audio.wav' output = asr_model.transcribe([path2wav]) print(output[0].text) ``` 5. Or transcribe multiple audio files: ```python audio_file_list = ['audio1.wav', 'audio2.wav', 'audio3.wav'] sys_tran = asr_model.transcribe(audio=audio_file_list, batch_size=len(audio_file_list), return_hypotheses=True, num_workers=0) for s, utt in zip(sys_tran, audio_file_list): print(f"{utt}: {s.text}") ``` For more details, please refer to the [NeMo ASR documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/intro.html).