Automatic Speech Recognition
Transformers
PyTorch
Indonesian
wav2vec2
speech-recognition
indonesian
xlsr-53

Indonesian Automatic Speech Recognition with XLSR-53

A fine-tuned model for Automatic Speech Recognition (ASR) in Indonesian, achieving competitive performance with a significantly reduced Word Error Rate (WER) using a KenLM language model.

How to Use · Evaluation Results · Citation · Try on Spaces · Read the Paper


This repository contains the official fine-tuned model from the research paper "Indonesian Automatic Speech Recognition with XLSR-53". The study focuses on developing a robust Indonesian ASR system by fine-tuning the pre-trained cross-lingual XLSR-53 (facebook/wav2vec2-large-xlsr-53) model.

The key contribution of this work is demonstrating that a competitive Word Error Rate (WER) can be achieved with a relatively small dataset (24 hours). The model's accuracy is significantly boosted by integrating a 4-gram KenLM language model, which successfully reduces the WER from 20% to 12% on the Common Voice test set.

Proposed Methodology
Proposed Methodology


Model Details

  • Base Model: This model is built upon the wav2vec 2.0 architecture, specifically the XLSR-53 pre-trained model (facebook/wav2vec2-large-xlsr-53) which was trained on 53 languages.
  • Task: Automatic Speech Recognition (ASR).
  • Language: Indonesian (id).
  • Library: Transformers.
  • Framework: The approach involves fine-tuning the pre-trained model using a Connectionist Temporal Classification (CTC) loss function.

Authors

  • Panji Arisaputra
  • Amalia Zahra

Computer Science Department, BINUS Graduate Program, Bina Nusantara University, Jakarta, Indonesia.


Datasets Used for Training

A total of three speech datasets were combined to fine-tune the model, and an additional large text corpus was used to build the language model.

Speech Data for Fine-Tuning:

The total combined duration of speech data is 24 hours, 18 minutes, and 1 second.

  • TITML-IDN: A clean speech corpus containing 14.5 hours of audio from 20 speakers reading phonetically balanced sentences.
  • Magic Data: A 3.5-hour corpus of scripted daily-use sentences from 10 speakers, recorded in various environments.
  • Common Voice (Indonesian): A crowdsourced dataset containing ~6.2 hours of speech from 170 speakers in diverse, non-clean environments.

Text Data for Language Model:

  • In addition to the transcripts from the speech datasets, the OSCAR corpus (unshuffled_deduplicated_id subset) was used to build the KenLM language model. To ensure a balanced vocabulary, only 6% of its 2.3 billion Indonesian words were included.

Data Preprocessing

The datasets underwent a standardized preprocessing pipeline:

  1. Data Splitting: Datasets were split into training (90%) and validation (10%) sets.
  2. Audio Standardization: All audio files were converted to WAV format with a single channel and resampled to a 16 kHz sampling rate to match the pre-trained model's requirements.
  3. Text Normalization: Transcriptions were cleaned by removing special characters and converting all text to lowercase to create a unified vocabulary.

Evaluation and Results

The model was evaluated against a similar model from a previous study by Syahputra & Zahra (2021), using the Word Error Rate (WER) metric. The evaluation on the Common Voice test split serves as the primary benchmark.

The results show that this XLSR-53 model outperforms the previous wav2vec 2.0-based model. The integration of a 4-gram KenLM language model was crucial, providing an 8% absolute reduction in WER (from 20% down to 12%).

Model Data Training & Validation Language Model Test Set WER (%)
This Study (XLSR-53) TITML-IDN + Magic Data + Common Voice (24h 18m) Common Voice 20.306%
4-gram KenLM Common Voice 12,213%
Benchmark (Syahputra & Zahra, 2021) BahasaKita batch 10 – 12 (75h) Common Voice 21.000%
3-gram KenLM Common Voice 41.000%

*WER results extracted from Table 3 of the research paper. The benchmark model's high WER with LM is noted in the paper.


How to Use

You can use this model with the transformers library pipeline. For optimal performance, as demonstrated in the research paper, we strongly recommend integrating the provided 4-gram KenLM language model.

pip install transformers torch torchaudio librosa
# For decoding with the language model:
pip install pyctcdecode==0.4.0 kenlm

Without Language Model

from transformers import AutoProcessor, AutoModelForCTC, pipeline
import torch
import librosa

# Load processor and model processor = AutoProcessor.from_pretrained("panjiarisaputra/indonesian-asr-xlsr-53") model = AutoModelForCTC.from_pretrained("panjiarisaputra/indonesian-asr-xlsr-53")

# Initialize ASR pipeline asr_pipeline = pipeline( "automatic-speech-recognition", model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor, device=0 if torch.cuda.is_available() else -1 )

# Load an audio file (must be 16kHz, mono) audio_path = "path/to/your/audio.wav" speech_array, sampling_rate = librosa.load(audio_path, sr=16000)

# Run transcription transcription = asr_pipeline(speech_array) print(transcription) # Output: {'text': '...transcribed text...'}

With Language Model Integration (Recommended)

For the best accuracy and lowest Word Error Rate (WER), use pyctcdecode with the 4-gram KenLM model (e.g., 4gram.arpa) created during the research.

from transformers import AutoProcessor, AutoModelForCTC
from pyctcdecode import build_ctcdecoder
import torch
import librosa

# Load processor and model
processor = AutoProcessor.from_pretrained("panjiariputra/indonesian-xlsr_53-LARGE-4gram")
model = AutoModelForCTC.from_pretrained("panjiariputra/indonesian-xlsr_53-LARGE-4gram")

# Get vocabulary and build the decoder with the language model
vocab_dict = processor.tokenizer.get_vocab()
sorted_vocab_dict = {k: v for k, v in sorted(vocab_dict.items(), key=lambda item: item[1])}

decoder = build_ctcdecoder(
    labels=list(sorted_vocab_dict.keys()),
    kenlm_model_path="path/to/your/4gram.arpa"  # Path to your KenLM model
)

# Load audio (16kHz, mono)
audio_path = "path/to/your/audio.wav"
speech_array, _ = librosa.load(audio_path, sr=16000)

# Get model logits
with torch.no_grad():
    inputs = processor(speech_array, sampling_rate=16000, return_tensors="pt", padding=True)
    logits = model(**inputs).logits.cpu().numpy()[0]

# Decode using KenLM
lm_transcription = decoder.decode(logits)
print({"text": lm_transcription})
# Output: {'text': '...more accurate transcribed text...'}

Publication and Citation

This work was published in Ingénierie des Systèmes d'Information, Vol. 27, No. 6, December, 2022. You can download the full paper here. If you use this model or the findings from the paper in your research, please cite:

@article{Arisaputra2022XLSR53,
  author    = {Panji Arisaputra and Amalia Zahra},
  title     = {Indonesian Automatic Speech Recognition with XLSR-53},
  journal   = {Ingénierie des Systèmes d'Information},
  volume    = {27},
  number    = {6},
  pages     = {973--982},
  year      = {2022},
  doi       = {10.18280/isi.270614}
}
Downloads last month
3
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for panjiariputra/indonesian-xlsr_53-LARGE-4gram

Finetuned
(307)
this model

Dataset used to train panjiariputra/indonesian-xlsr_53-LARGE-4gram

Space using panjiariputra/indonesian-xlsr_53-LARGE-4gram 1