File size: 4,626 Bytes
43dcffc
 
 
2b1c7da
 
65c5ccc
 
43dcffc
65c5ccc
 
 
 
 
 
 
2b1c7da
65c5ccc
2b1c7da
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
65c5ccc
 
2b1c7da
 
 
65c5ccc
43dcffc
2b1c7da
 
 
 
 
 
 
65c5ccc
2b1c7da
43dcffc
2b1c7da
 
 
 
43dcffc
2b1c7da
 
65c5ccc
2b1c7da
43dcffc
2b1c7da
43dcffc
 
 
 
 
 
 
 
2b1c7da
 
 
 
65c5ccc
2b1c7da
65c5ccc
 
 
 
 
2b1c7da
 
65c5ccc
2b1c7da
 
 
65c5ccc
43dcffc
65c5ccc
43dcffc
 
 
 
 
 
65c5ccc
43dcffc
 
 
 
 
65c5ccc
43dcffc
 
 
 
 
 
 
 
 
 
65c5ccc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
---
license: cc-by-4.0
library_name: nemo
pipeline_tag: automatic-speech-recognition
language:
- tl
- fil
tags:
- asr
- automatic-speech-recognition
- fastconformer
- tagalog
- filipino
- nemo
- speech
datasets:
- google/fleurs
model-index:
- name: stt_tl_fastconformer_hybrid_large
  results:
  - task:
      type: automatic-speech-recognition
      name: ASR
    dataset:
      name: FLEURS (fil_ph)
      type: google/fleurs
      args: fil_ph
    metrics:
    - type: wer
      name: WER
      value: 9.34
  - task:
      type: automatic-speech-recognition
      name: ASR
    dataset:
      name: Magic Data Tech (Tagalog)
      type: other
      url: >-
        https://www.magicdatatech.com/datasets/asr/mdt-asr-e017-filipinotagalog-scripted-speech-corpus-1630305526
    metrics:
    - type: wer
      name: WER
      value: 16.1
---

# FastConformer-Hybrid Large (Tagalog/Filipino)

[![Arch](https://img.shields.io/badge/Arch-FastConformer_Hybrid-lightgrey)](#model-architecture)
[![Params](https://img.shields.io/badge/Params-115M-lightgrey)](#model-architecture)
[![Language](https://img.shields.io/badge/Language-tl%2Ffil-lightgrey)](#datasets)

Production-oriented ASR model for **Tagalog/Filipino**, built on **FastConformer**.

## Model Architecture
The model is based on the **FastConformer** [architecture](https://arxiv.org/pdf/2305.05084) with a hybrid Transducer and CTC loss function. 
The model uses a Byte Pair Encoding (BPE) tokenizer with a vocabulary size of 1,024 tokens. Only characters from the Filipino alphabet and the apostrophe (') are included in the tokenizer vocabulary.

## Datasets
The model is trained on a combination of supervised and semi-supervised transcribed datasets totaling approximately 520 hours of audio.
For external benchmarking we use:
- **FLEURS** (`fil_ph` split)  
- **Magic Data Tech (Tagalog)** — To take **Magic data tech** dataset, please register and download it [here](https://www.magicdatatech.com/datasets/asr/mdt-asr-e017-filipinotagalog-scripted-speech-corpus-1630305526). Please review Terms of Use and Privacy Policy and License.

## Benchmark Results
We evaluate the model using the WER metric **Word Error Rate (WER)**. To ensure consistency and fairness in comparison, we manually apply **Text Normalization**, including the handling of numbers, lowercase letters, and punctuation.

| STT | Model                                                                          | #Params | Fleurs test (fil_ph) | Magic data tech |
|-----|--------------------------------------------------------------------------------|---------|----------------------|-----------------|
| 1   | **FastConformer-Hybrid Large**(our)                                            | 115M    | 9.34%                | **16.10%**      |
| 2   | [whisper-large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo) | 809M    | 11.60%               | 16.43%          |
| 3   | [ElevenLabs](https://elevenlabs.io/app/speech-to-text)                         | ---     | 9.19%                | 21.08%          |
| 4   | [Google](https://cloud.google.com/speech-to-text/v2/docs/chirp_2-model)        | ---     | **7.42%**            | 28.79%          |

## Audio & I/O
- Expected input: mono WAV, **16 kHz**, PCM16 (recommended).
- Other formats are supported if your audio loader converts to 16 kHz mono float PCM.

### Transcribing using Python

1. Install NeMo:
```shell
pip install nemo_toolkit[asr]
```
2. Download the model checkpoint:
```python
from huggingface_hub import hf_hub_download
nemo_model_path = hf_hub_download(
    repo_id="NCSpeech/stt_tl_fastconformer_hybrid_large",
    filename="stt_tl_fastconformer_hybrid_large.nemo",
)
print(nemo_model_path)  # local path to .nemo
```
3. Load the pre-trained model:
```python
import nemo.collections.asr as nemo_asr

asr_model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.restore_from(nemo_model_path)
asr_model.eval()
```
4. Transcribe a single audio file:
```python
path2wav = 'audio.wav'
output = asr_model.transcribe([path2wav])
print(output[0].text)
```
5. Or transcribe multiple audio files:
```python
audio_file_list = ['audio1.wav', 'audio2.wav', 'audio3.wav']
sys_tran = asr_model.transcribe(audio=audio_file_list,
                                batch_size=len(audio_file_list),
                                return_hypotheses=True,
                                num_workers=0)
for s, utt in zip(sys_tran, audio_file_list):
    print(f"{utt}: {s.text}")
```

For more details, please refer to the [NeMo ASR documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/intro.html).