---
license: cc-by-4.0
library_name: nemo
pipeline_tag: automatic-speech-recognition
language:
- tl
- fil
tags:
- asr
- automatic-speech-recognition
- fastconformer
- tagalog
- filipino
- nemo
- speech
datasets:
- google/fleurs
model-index:
- name: stt_tl_fastconformer_hybrid_large
  results:
  - task:
      type: automatic-speech-recognition
      name: ASR
    dataset:
      name: FLEURS (fil_ph)
      type: google/fleurs
      args: fil_ph
    metrics:
    - type: wer
      name: WER
      value: 9.34
  - task:
      type: automatic-speech-recognition
      name: ASR
    dataset:
      name: Magic Data Tech (Tagalog)
      type: other
      url: >-
        https://www.magicdatatech.com/datasets/asr/mdt-asr-e017-filipinotagalog-scripted-speech-corpus-1630305526
    metrics:
    - type: wer
      name: WER
      value: 16.1
---

# FastConformer-Hybrid Large (Tagalog/Filipino)

[![Arch](https://img.shields.io/badge/Arch-FastConformer_Hybrid-lightgrey)](#model-architecture)
[![Params](https://img.shields.io/badge/Params-115M-lightgrey)](#model-architecture)
[![Language](https://img.shields.io/badge/Language-tl%2Ffil-lightgrey)](#datasets)

Production-oriented ASR model for **Tagalog/Filipino**, built on **FastConformer**.

## Model Architecture
The model is based on the **FastConformer** [architecture](https://arxiv.org/pdf/2305.05084) with a hybrid Transducer and CTC loss function. 
The model uses a Byte Pair Encoding (BPE) tokenizer with a vocabulary size of 1,024 tokens. Only characters from the Filipino alphabet and the apostrophe (') are included in the tokenizer vocabulary.

## Datasets
The model is trained on a combination of supervised and semi-supervised transcribed datasets totaling approximately 520 hours of audio.
For external benchmarking we use:
- **FLEURS** (`fil_ph` split)  
- **Magic Data Tech (Tagalog)** — To take **Magic data tech** dataset, please register and download it [here](https://www.magicdatatech.com/datasets/asr/mdt-asr-e017-filipinotagalog-scripted-speech-corpus-1630305526). Please review Terms of Use and Privacy Policy and License.

## Benchmark Results
We evaluate the model using the WER metric **Word Error Rate (WER)**. To ensure consistency and fairness in comparison, we manually apply **Text Normalization**, including the handling of numbers, lowercase letters, and punctuation.

| STT | Model                                                                          | #Params | Fleurs test (fil_ph) | Magic data tech |
|-----|--------------------------------------------------------------------------------|---------|----------------------|-----------------|
| 1   | **FastConformer-Hybrid Large**(our)                                            | 115M    | 9.34%                | **16.10%**      |
| 2   | [whisper-large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo) | 809M    | 11.60%               | 16.43%          |
| 3   | [ElevenLabs](https://elevenlabs.io/app/speech-to-text)                         | ---     | 9.19%                | 21.08%          |
| 4   | [Google](https://cloud.google.com/speech-to-text/v2/docs/chirp_2-model)        | ---     | **7.42%**            | 28.79%          |

## Audio & I/O
- Expected input: mono WAV, **16 kHz**, PCM16 (recommended).
- Other formats are supported if your audio loader converts to 16 kHz mono float PCM.

### Transcribing using Python

1. Install NeMo:
```shell
pip install nemo_toolkit[asr]
```
2. Download the model checkpoint:
```python
from huggingface_hub import hf_hub_download
nemo_model_path = hf_hub_download(
    repo_id="NCSpeech/stt_tl_fastconformer_hybrid_large",
    filename="stt_tl_fastconformer_hybrid_large.nemo",
)
print(nemo_model_path)  # local path to .nemo
```
3. Load the pre-trained model:
```python
import nemo.collections.asr as nemo_asr

asr_model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.restore_from(nemo_model_path)
asr_model.eval()
```
4. Transcribe a single audio file:
```python
path2wav = 'audio.wav'
output = asr_model.transcribe([path2wav])
print(output[0].text)
```
5. Or transcribe multiple audio files:
```python
audio_file_list = ['audio1.wav', 'audio2.wav', 'audio3.wav']
sys_tran = asr_model.transcribe(audio=audio_file_list,
                                batch_size=len(audio_file_list),
                                return_hypotheses=True,
                                num_workers=0)
for s, utt in zip(sys_tran, audio_file_list):
    print(f"{utt}: {s.text}")
```

For more details, please refer to the [NeMo ASR documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/intro.html).