Acoustic and language models
Acoustic model built using QuartzNet15x5 architecture and trained using NeMo toolkit
Three n-gram language models created using KenLM Language Model Toolkit
- LM built on Common Crawl Russian dataset
- LM built on Golos train set
- LM built on Common Crawl and Golos datasets together (50/50)
| Archives | Size | Links | 
|---|---|---|
| QuartzNet15x5_golos.nemo | 68 MB | https://sc.link/ZMv | 
| KenLMs.tar | 4.8 GB | https://sc.link/YL0 | 
Golos data and models are also available in the hub of pre-trained models, datasets, and containers - DataHub ML Space. You can train the model and deploy it on the high-performance SberCloud infrastructure in ML Space - full-cycle machine learning development platform for DS-teams collaboration based on the Christofari Supercomputer.
Evaluation
Percents of Word Error Rate for different test sets
| Decoder \ Test set | Crowd test | Farfield test | MCV1 dev | MCV1 test | 
|---|---|---|---|---|
| Greedy decoder | 4.389 % | 14.949 % | 9.314 % | 11.278 % | 
| Beam Search with Common Crawl LM | 4.709 % | 12.503 % | 6.341 % | 7.976 % | 
| Beam Search with Golos train set LM | 3.548 % | 12.384 % | - | - | 
| Beam Search with Common Crawl and Golos LM | 3.318 % | 11.488 % | 6.4 % | 8.06 % | 
1 Common Voice - Mozilla's initiative to help teach machines how real people speak.
Resources
[arxiv.org] Golos: Russian Dataset for Speech Research
[habr.com] Как улучшить распознавание русской речи до 3% WER с помощью открытых данных
