Papers
arxiv:2508.15418

LLaSO: A Foundational Framework for Reproducible Research in Large Language and Speech Model

Published on Aug 21
· Submitted by Yirong Sun on Aug 22
Authors:
,
,
,
,
,
,

Abstract

LLaSO is an open framework for large-scale speech-language modeling that provides alignment data, instruction-tuning datasets, and evaluation benchmarks to enhance reproducibility and performance.

AI-generated summary

The development of Large Speech-Language Models (LSLMs) has been slowed by fragmented architectures and a lack of transparency, hindering the systematic comparison and reproducibility of research. Unlike in the vision-language domain, the LSLM field suffers from the common practice of releasing model weights without their corresponding training data and configurations. To address these critical gaps, we introduce LLaSO, the first fully open, end-to-end framework for large-scale speech-language modeling. LLaSO provides the community with three essential resources: (1) LLaSO-Align, a 12M-instance speech-text alignment corpus; (2) LLaSO-Instruct, a 13.5M-instance multi-task instruction-tuning dataset; and (3) LLaSO-Eval, a reproducible benchmark for standardized evaluation. To validate our framework, we build and release LLaSO-Base, a 3.8B-parameter reference model trained exclusively on our public data. It achieves a normalized score of 0.72, establishing a strong, reproducible baseline that surpasses comparable models. Our analysis reveals that while broader training coverage enhances performance, significant generalization gaps persist on unseen tasks, particularly in pure audio scenarios. By releasing the complete stack of data, benchmarks, and models, LLaSO establishes a foundational open standard to unify research efforts and accelerate community-driven progress in LSLMs. We release the code, dataset, pretrained models, and results in https://github.com/EIT-NLP/LLaSO.

Community

Paper author Paper submitter

We introduce LLaSO, the first fully open, end-to-end stack for large-scale speech–language modeling.
It unifies corpus, benchmark, and reference models in one framework:

  • LLaSO-Instruct (13.5M) multi-task instruction tuning dataset
  • LLaSO-Align (12M) speech–text alignment dataset
  • LLaSO-Eval (15K) stratified benchmark
  • LLaSO-Base (3.8B) two-stage trained reference model

👉 Code: https://github.com/EIT-NLP/LLaSO
👉 Datasets: https://huggingface.co/datasets?search=LLaSO
👉 Model: https://huggingface.co/YirongSun/LLaSO-Base-3.8B-Instruct

We are currently uploading LLaSO-Instruct and will soon release LLaSO-Align.
Feedback and contributions are very welcome!

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 3

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2508.15418 in a Space README.md to link it from this page.

Collections including this paper 3