🧠 SQaLe: Enabling new Text-to-SQL models with our massive dataset
TL;DR
SQaLe is a large-scale text-to-SQL dataset built from over 139,000 database schemas and more than 500,000 validated triples of schema, question, and query. It was created to address the limits of existing resources in scale, diversity, and realism, providing a foundation for training and evaluating models that translate natural language into SQL. The dataset reflects real schema complexity and can be loaded directly from the Hugging Face Hub for research or fine-tuning:
from datasets import load_dataset
dataset = load_dataset("trl-lab/SQaLe-text-to-SQL-dataset", split="train")
example = dataset[0]
print(example["schema"], example["question"], example["query"])
Link to the paper: OpenReview
Link to dataset: trl-lab/SQaLe-text-to-SQL-dataset
Why we built SQaLe
Recent advances in large language models have made remarkable progress in translating natural language into SQL. However, most benchmarks contain only a few thousand examples, which limits the ability to train or test models that need to generalize to new databases. Many also rely on small academic schemas with few tables and standardized naming conventions, while production databases are far more complex and diverse.
SQaLe was developed to close this gap. It offers a resource that is large enough to support the training of LLMs, realistic enough to reflect real schema variability, and validated to ensure that each SQL query is executable and aligned with its natural-language question. The goal is to move text-to-SQL research closer to real-world performance and enable more reliable training and evaluation of new text-to-SQL models.
How it was created
Schema gathering and extension The process begins with 22,989 schemas sourced from SchemaPile, a large collection of real relational database schemas. Each schema is extended using a large language model while maintaining realistic naming, normalization, and foreign-key structures. This results in a total of 135,875 individual schemas.
Question synthesis For every schema, diverse natural-language questions are generated based on examples from Spider 2.0 and BIRD. The questions vary in style and difficulty and are designed to elicit queries with different numbers of joins and operators.
SQL generation and validation Candidate SQL statements are created and then validated through execution against their corresponding schemas. Only queries that run successfully and align semantically with their question are kept.
This pipeline, executed at scale on up to one hundred GPUs, produced 517,676 verified triples that combine schema, question, and query information.
Dataset at a glance
| Statistic | SQaLe | Spider 2.0 | BIRD | SynSQL |
|---|---|---|---|---|
| Schemas | 135,875 | 236 | 80 | 16,575 |
| Median tables per schema | 91 | 7 | 5 | 10 |
| Median columns per schema | 435 | 89 | 39 | 72 |
| Foreign keys | 13,201,052 | 0 | 526 | 159,547 |
| Triples | 517,676 | 250 | 10,962 | 2,544,390 |
| Queries with JOIN | 76 % | 72 % | 76 % | 89 % |
Key characteristics
- Realistic schema complexity covering databases from small single-domain schemas to large enterprise systems.
- Diverse query composition including aggregation, nested subqueries, set operations, and comparisons.
- Natural variation in phrasing and intent with questions that capture everyday analytical language.
- Execution-validated SQL that ensures consistency between question, schema, and query.
Example sample
Schema (DDL): CREATE TABLE employees (id INT, name TEXT, dept TEXT, salary INT, … );
Question: "Find total salary by department."
Query: SELECT dept, SUM(salary) FROM employees GROUP BY dept;
How to use the dataset
You can load SQaLe directly from the Hugging Face Datasets library:
from datasets import load_dataset
dataset = load_dataset("trl-lab/SQaLe-text-to-SQL-dataset", split="train")
# Peek at a sample triple
example = dataset[0]
print(example["schema"], example["question"], example["query"])
Each entry contains the full database schema, a natural-language question, the corresponding SQL query, and metadata such as join counts and token lengths.
You can use the dataset to pretrain or fine-tune sequence-to-sequence models for text-to-SQL generation, to benchmark schema reasoning, or to design curriculum learning experiments based on query complexity.
Intended uses
- Training and evaluation of text-to-SQL and semantic parsing models.
- Research on schema understanding, compositional generalization, and join reasoning.
- Benchmarking LLMs on realistic database contexts.
- Creating sub-datasets for focused experiments on query type or schema scale.
Citation
If you use SQaLe in your research, please cite:
@inproceedings{
wolff2025sqale,
title={{SQ}aLe: A large text-to-{SQL} corpus grounded in real schemas},
author={Cornelius Wolff and Daniel Gomm and Madelon Hulsebos},
booktitle={EurIPS 2025 Workshop: AI for Tabular Data},
year={2025},
url={https://openreview.net/forum?id=6PsKDjgoEy}
}
Closing thoughts
SQaLe represents a significant step forward toward realistic large-scale text-to-SQL research, yet it is not the final answer. While its scale and schema diversity far exceed existing benchmarks, it still falls short of the true data requirements needed to train and evaluate the next generation of very large models. Building datasets that combine the realism of production environments with the massive scale demanded by modern architectures remains an open challenge. There is still much to explore in generating even broader, more varied, and context-rich text-to-SQL corpora that reflect how databases are actually used in practice. SQaLe is a foundation to build upon—a step toward the comprehensive, high-fidelity resources that will ultimately power the next wave of natural-language interfaces for structured data.


