FastConformer-Hybrid Large (Tagalog/Filipino)
Production-oriented ASR model for Tagalog/Filipino, built on FastConformer with a hybrid Transducer + CTC objective and a BPE tokenizer (vocab size 1,024).
The tokenizer includes characters from the Filipino alphabet and the apostrophe (').
Model Architecture
The model is based on the FastConformer architecture with a hybrid Transducer and CTC loss function. The model uses a Byte Pair Encoding (BPE) tokenizer with a vocabulary size of 1,024 tokens. Only characters from the Filipino alphabet and the apostrophe (') are included in the tokenizer vocabulary.
Datasets
The model is trained on a combination of supervised and semi-supervised transcribed datasets totaling approximately 520 hours of audio. For external benchmarking we use:
- FLEURS (fil_phsplit)
- Magic Data Tech (Tagalog) โ requires registration to access.
Benchmark Results
We evaluate the model using the WER metric Word Error Rate (WER). To ensure consistency and fairness in comparison, we manually apply Text Normalization, including the handling of numbers, lowercase letters, and punctuation.
To take Magic data tech dataset, please register and download it here. Please review Terms of Use and Privacy Policy and License.
| STT | Model | #Params | Fleurs test (fil_ph) | Magic data tech | 
|---|---|---|---|---|
| 1 | FastConformer-Hybrid Large(our) | 115M | 9.34% | 16.10% | 
| 2 | whisper-large-v3-turbo | 809M | 11.60% | 16.43% | 
| 3 | ElevenLabs | --- | 9.19% | 21.08% | 
| 4 | --- | 7.42% | 28.79% | 
Audio & I/O
- Expected input: mono WAV, 16 kHz, PCM16 (recommended).
- Other formats are supported if your audio loader converts to 16 kHz mono float PCM.
Download from Hub
Python (recommended)
from huggingface_hub import hf_hub_download
nemo_path = hf_hub_download(
    repo_id="NCSpeech/stt_tl_fastconformer_hybrid_large",
    filename="stt_tl_fastconformer_hybrid_large.nemo",
)
print(nemo_path)  # local path to .nemo
### Transcribing using Python
1. Install NeMo:
```shell
pip install nemo_toolkit[asr]
- Load the pre-trained model:
import nemo.collections.asr as nemo_asr
nemo_model_path = '<path_to_model>/stt_tl_fastconformer_hybrid_large.nemo'
asr_model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.restore_from(nemo_model_path)
asr_model.eval()
- Transcribe a single audio file:
path2wav = 'audio.wav'
output = asr_model.transcribe([path2wav])
print(output[0].text)
- Or transcribe multiple audio files:
audio_file_list = ['audio1.wav', 'audio2.wav', 'audio3.wav']
sys_tran = asr_model.transcribe(audio=audio_file_list,
                                batch_size=len(audio_file_list),
                                return_hypotheses=True,
                                num_workers=0)
for s, utt in zip(sys_tran, audio_file_list):
    print(f"{utt}: {s.text}")
For more details, please refer to the NeMo ASR documentation.
- Downloads last month
- 5
Dataset used to train NCSpeech/stt_tl_fastconformer_hybrid_large
Evaluation results
- WER on FLEURS (fil_ph)self-reported9.340
- WER on Magic Data Tech (Tagalog)self-reported16.100
