🧠 SQaLe: Enabling new Text-to-SQL models with our massive dataset

Community Article Published November 19, 2025

TL;DR

SQaLe is a large-scale text-to-SQL dataset built from over 139,000 database schemas and more than 500,000 validated triples of schema, question, and query. It was created to address the limits of existing resources in scale, diversity, and realism, providing a foundation for training and evaluating models that translate natural language into SQL. The dataset reflects real schema complexity and can be loaded directly from the Hugging Face Hub for research or fine-tuning:

from datasets import load_dataset

dataset = load_dataset("trl-lab/SQaLe-text-to-SQL-dataset", split="train")
example = dataset[0]
print(example["schema"], example["question"], example["query"])

Link to the paper: OpenReview

Link to dataset: trl-lab/SQaLe-text-to-SQL-dataset


Why we built SQaLe

Recent advances in large language models have made remarkable progress in translating natural language into SQL. However, most benchmarks contain only a few thousand examples, which limits the ability to train or test models that need to generalize to new databases. Many also rely on small academic schemas with few tables and standardized naming conventions, while production databases are far more complex and diverse.

SQaLe was developed to close this gap. It offers a resource that is large enough to support the training of LLMs, realistic enough to reflect real schema variability, and validated to ensure that each SQL query is executable and aligned with its natural-language question. The goal is to move text-to-SQL research closer to real-world performance and enable more reliable training and evaluation of new text-to-SQL models.


How it was created

Overview of the SQaLe generation pipeline.

  1. Schema gathering and extension The process begins with 22,989 schemas sourced from SchemaPile, a large collection of real relational database schemas. Each schema is extended using a large language model while maintaining realistic naming, normalization, and foreign-key structures. This results in a total of 135,875 individual schemas.

  2. Question synthesis For every schema, diverse natural-language questions are generated based on examples from Spider 2.0 and BIRD. The questions vary in style and difficulty and are designed to elicit queries with different numbers of joins and operators.

  3. SQL generation and validation Candidate SQL statements are created and then validated through execution against their corresponding schemas. Only queries that run successfully and align semantically with their question are kept.

This pipeline, executed at scale on up to one hundred GPUs, produced 517,676 verified triples that combine schema, question, and query information.


Dataset at a glance

Statistic SQaLe Spider 2.0 BIRD SynSQL
Schemas 135,875 236 80 16,575
Median tables per schema 91 7 5 10
Median columns per schema 435 89 39 72
Foreign keys 13,201,052 0 526 159,547
Triples 517,676 250 10,962 2,544,390
Queries with JOIN 76 % 72 % 76 % 89 %
Column Count Distribution Involved Table Count Distribution

Key characteristics

  • Realistic schema complexity covering databases from small single-domain schemas to large enterprise systems.
  • Diverse query composition including aggregation, nested subqueries, set operations, and comparisons.
  • Natural variation in phrasing and intent with questions that capture everyday analytical language.
  • Execution-validated SQL that ensures consistency between question, schema, and query.

Example sample

Schema (DDL): CREATE TABLE employees (id INT, name TEXT, dept TEXT, salary INT, … );
Question: "Find total salary by department."
Query: SELECT dept, SUM(salary) FROM employees GROUP BY dept;

How to use the dataset

You can load SQaLe directly from the Hugging Face Datasets library:

from datasets import load_dataset

dataset = load_dataset("trl-lab/SQaLe-text-to-SQL-dataset", split="train")

# Peek at a sample triple
example = dataset[0]
print(example["schema"], example["question"], example["query"])

Each entry contains the full database schema, a natural-language question, the corresponding SQL query, and metadata such as join counts and token lengths.

You can use the dataset to pretrain or fine-tune sequence-to-sequence models for text-to-SQL generation, to benchmark schema reasoning, or to design curriculum learning experiments based on query complexity.


Intended uses

  • Training and evaluation of text-to-SQL and semantic parsing models.
  • Research on schema understanding, compositional generalization, and join reasoning.
  • Benchmarking LLMs on realistic database contexts.
  • Creating sub-datasets for focused experiments on query type or schema scale.

Citation

If you use SQaLe in your research, please cite:

@inproceedings{
  wolff2025sqale,
  title={{SQ}aLe: A large text-to-{SQL} corpus grounded in real schemas},
  author={Cornelius Wolff and Daniel Gomm and Madelon Hulsebos},
  booktitle={EurIPS 2025 Workshop: AI for Tabular Data},
  year={2025},
  url={https://openreview.net/forum?id=6PsKDjgoEy}
}

Closing thoughts

SQaLe represents a significant step forward toward realistic large-scale text-to-SQL research, yet it is not the final answer. While its scale and schema diversity far exceed existing benchmarks, it still falls short of the true data requirements needed to train and evaluate the next generation of very large models. Building datasets that combine the realism of production environments with the massive scale demanded by modern architectures remains an open challenge. There is still much to explore in generating even broader, more varied, and context-rich text-to-SQL corpora that reflect how databases are actually used in practice. SQaLe is a foundation to build upon—a step toward the comprehensive, high-fidelity resources that will ultimately power the next wave of natural-language interfaces for structured data.

Community

·
Article author

Jap, I forgot the most important link in the first version of this post :D
It is now added right at the beginning. Thanks! :) 

Sign up or log in to comment