stt_tl_fastconformer_hybrid_large / README.md

roanlane03

small updates

65c5ccc verified 8 days ago

preview code

raw

history blame contribute delete

4.63 kB

metadata

license: cc-by-4.0
library_name: nemo
pipeline_tag: automatic-speech-recognition
language:
  - tl
  - fil
tags:
  - asr
  - automatic-speech-recognition
  - fastconformer
  - tagalog
  - filipino
  - nemo
  - speech
datasets:
  - google/fleurs
model-index:
  - name: stt_tl_fastconformer_hybrid_large
    results:
      - task:
          type: automatic-speech-recognition
          name: ASR
        dataset:
          name: FLEURS (fil_ph)
          type: google/fleurs
          args: fil_ph
        metrics:
          - type: wer
            name: WER
            value: 9.34
      - task:
          type: automatic-speech-recognition
          name: ASR
        dataset:
          name: Magic Data Tech (Tagalog)
          type: other
          url: >-
            https://www.magicdatatech.com/datasets/asr/mdt-asr-e017-filipinotagalog-scripted-speech-corpus-1630305526
        metrics:
          - type: wer
            name: WER
            value: 16.1

FastConformer-Hybrid Large (Tagalog/Filipino)

Production-oriented ASR model for Tagalog/Filipino, built on FastConformer.

Model Architecture

The model is based on the FastConformer architecture with a hybrid Transducer and CTC loss function. The model uses a Byte Pair Encoding (BPE) tokenizer with a vocabulary size of 1,024 tokens. Only characters from the Filipino alphabet and the apostrophe (') are included in the tokenizer vocabulary.

Datasets

The model is trained on a combination of supervised and semi-supervised transcribed datasets totaling approximately 520 hours of audio. For external benchmarking we use:

FLEURS (fil_ph split)
Magic Data Tech (Tagalog) — To take Magic data tech dataset, please register and download it here. Please review Terms of Use and Privacy Policy and License.

Benchmark Results

We evaluate the model using the WER metric Word Error Rate (WER). To ensure consistency and fairness in comparison, we manually apply Text Normalization, including the handling of numbers, lowercase letters, and punctuation.

STT	Model	#Params	Fleurs test (fil_ph)	Magic data tech
1	FastConformer-Hybrid Large(our)	115M	9.34%	16.10%
2	whisper-large-v3-turbo	809M	11.60%	16.43%
3	ElevenLabs	---	9.19%	21.08%
4	Google	---	7.42%	28.79%

Audio & I/O

Expected input: mono WAV, 16 kHz, PCM16 (recommended).
Other formats are supported if your audio loader converts to 16 kHz mono float PCM.

Transcribing using Python

Install NeMo:

pip install nemo_toolkit[asr]

Download the model checkpoint:

from huggingface_hub import hf_hub_download
nemo_model_path = hf_hub_download(
    repo_id="NCSpeech/stt_tl_fastconformer_hybrid_large",
    filename="stt_tl_fastconformer_hybrid_large.nemo",
)
print(nemo_model_path)  # local path to .nemo

Load the pre-trained model:

import nemo.collections.asr as nemo_asr

asr_model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.restore_from(nemo_model_path)
asr_model.eval()

Transcribe a single audio file:

path2wav = 'audio.wav'
output = asr_model.transcribe([path2wav])
print(output[0].text)

Or transcribe multiple audio files:

audio_file_list = ['audio1.wav', 'audio2.wav', 'audio3.wav']
sys_tran = asr_model.transcribe(audio=audio_file_list,
                                batch_size=len(audio_file_list),
                                return_hypotheses=True,
                                num_workers=0)
for s, utt in zip(sys_tran, audio_file_list):
    print(f"{utt}: {s.text}")

For more details, please refer to the NeMo ASR documentation.