arxiv:2601.04582

Aligning Text, Code, and Vision: A Multi-Objective Reinforcement Learning Framework for Text-to-Visualization

Published on Jan 8

· Submitted by

Mizanur on Jan 14

York University

Upvote

Authors:

Abstract

A reinforcement learning framework for text-to-visualization generation that improves chart quality and code execution by optimizing multiple objectives using post-execution feedback.

AI-generated summary

Text-to-Visualization (Text2Vis) systems translate natural language queries over tabular data into concise answers and executable visualizations. While closed-source LLMs generate functional code, the resulting charts often lack semantic alignment and clarity, qualities that can only be assessed post-execution. Open-source models struggle even more, frequently producing non-executable or visually poor outputs. Although supervised fine-tuning can improve code executability, it fails to enhance overall visualization quality, as traditional SFT loss cannot capture post-execution feedback. To address this gap, we propose RL-Text2Vis, the first reinforcement learning framework for Text2Vis generation. Built on Group Relative Policy Optimization (GRPO), our method uses a novel multi-objective reward that jointly optimizes textual accuracy, code validity, and visualization quality using post-execution feedback. By training Qwen2.5 models (7B and 14B), RL-Text2Vis achieves a 22% relative improvement in chart quality over GPT-4o on the Text2Vis benchmark and boosts code execution success from 78% to 97% relative to its zero-shot baseline. Our models significantly outperform strong zero-shot and supervised baselines and also demonstrate robust generalization to out-of-domain datasets like VIS-Eval and NVBench. These results establish GRPO as an effective strategy for structured, multimodal reasoning in visualization generation. We release our code at https://github.com/vis-nlp/RL-Text2Vis.

View arXiv page View PDF Add to collection

Community

mizanurr

Paper submitter about 5 hours ago

Generating working visualization code is not enough. Charts must be semantically correct and visually meaningful.

We introduce RL-Text2Vis, the first reinforcement-learning framework for Text-to-Visualization, using post-execution feedback to jointly optimize:

✔️ textual accuracy
✔️ code executability
✔️ visualization quality

📈 Results:
• +22% relative improvement in chart quality over GPT-4o
• Code execution success boosted from 78% → 97%
• Strong generalization to out-of-domain benchmarks

This work demonstrates the power of multi-objective RL for structured, multimodal reasoning.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.04582 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.04582 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.04582 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.