FastConformer-Hybrid Large (Tagalog/Filipino)

Production-oriented ASR model for Tagalog/Filipino, built on FastConformer with a hybrid Transducer + CTC objective and a BPE tokenizer (vocab size 1,024).
The tokenizer includes characters from the Filipino alphabet and the apostrophe (').

Model Architecture

The model is based on the FastConformer architecture with a hybrid Transducer and CTC loss function. The model uses a Byte Pair Encoding (BPE) tokenizer with a vocabulary size of 1,024 tokens. Only characters from the Filipino alphabet and the apostrophe (') are included in the tokenizer vocabulary.

Datasets

The model is trained on a combination of supervised and semi-supervised transcribed datasets totaling approximately 520 hours of audio. For external benchmarking we use:

FLEURS (fil_ph split)
Magic Data Tech (Tagalog) — requires registration to access.

Benchmark Results

We evaluate the model using the WER metric Word Error Rate (WER). To ensure consistency and fairness in comparison, we manually apply Text Normalization, including the handling of numbers, lowercase letters, and punctuation.

To take Magic data tech dataset, please register and download it here. Please review Terms of Use and Privacy Policy and License.

STT	Model	#Params	Fleurs test (fil_ph)	Magic data tech
1	FastConformer-Hybrid Large(our)	115M	9.34%	16.10%
2	whisper-large-v3-turbo	809M	11.60%	16.43%
3	ElevenLabs	---	9.19%	21.08%
4	Google	---	7.42%	28.79%

Audio & I/O

Expected input: mono WAV, 16 kHz, PCM16 (recommended).
Other formats are supported if your audio loader converts to 16 kHz mono float PCM.

Download from Hub

Python (recommended)

from huggingface_hub import hf_hub_download
nemo_path = hf_hub_download(
    repo_id="NCSpeech/stt_tl_fastconformer_hybrid_large",
    filename="stt_tl_fastconformer_hybrid_large.nemo",
)
print(nemo_path)  # local path to .nemo

### Transcribing using Python
1. Install NeMo:
```shell
pip install nemo_toolkit[asr]

Load the pre-trained model:

import nemo.collections.asr as nemo_asr

nemo_model_path = '<path_to_model>/stt_tl_fastconformer_hybrid_large.nemo'
asr_model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.restore_from(nemo_model_path)
asr_model.eval()

Transcribe a single audio file:

path2wav = 'audio.wav'
output = asr_model.transcribe([path2wav])
print(output[0].text)

Or transcribe multiple audio files:

audio_file_list = ['audio1.wav', 'audio2.wav', 'audio3.wav']
sys_tran = asr_model.transcribe(audio=audio_file_list,
                                batch_size=len(audio_file_list),
                                return_hypotheses=True,
                                num_workers=0)
for s, utt in zip(sys_tran, audio_file_list):
    print(f"{utt}: {s.text}")

For more details, please refer to the NeMo ASR documentation.

Downloads last month: 5

Dataset used to train NCSpeech/stt_tl_fastconformer_hybrid_large

Evaluation results

WER on FLEURS (fil_ph)
self-reported

9.340
WER on Magic Data Tech (Tagalog)
self-reported

16.100

View on Papers With Code