Papers
arxiv:2512.00319

RL-Struct: A Lightweight Reinforcement Learning Framework for Reliable Structured Output in LLMs

Published on Nov 29
Authors:
,

Abstract

A reinforcement learning framework with a multi-dimensional reward function improves structured output accuracy in large language models by addressing the structure gap without significant computational overhead.

AI-generated summary

Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language generation and reasoning. However, their integration into automated software ecosystems is often hindered by the "Structure Gap" - the inherent tension between the probabilistic nature of token generation and the deterministic requirements of structured data formats (e.g., JSON, XML). Traditional Supervised Fine-Tuning (SFT) often fails to enforce strict syntactic constraints, leading to "hallucinated" keys or malformed structures, while constrained decoding methods impose significant inference latency. In this paper, we propose a lightweight, efficient Reinforcement Learning (RL) framework to bridge this gap. We introduce a novel Multi-dimensional Reward Function that decomposes the structured output task into a hierarchy of constraints: structural integrity, format correctness, content accuracy, and validity. Leveraging Gradient Regularized Policy Optimization (GRPO), we enable the model to internalize these constraints without the need for a separate critic network, reducing peak VRAM usage by 40% compared to PPO. We validate our approach on multiple tasks, including complex recipe generation and structured math reasoning (GSM8K-JSON). Experimental results demonstrate that our method achieves 89.7% structural accuracy and 92.1% JSON validity, significantly outperforming both zero-shot baselines (e.g., GPT-3.5) and SFT on larger models like LLaMA-3-8B. Furthermore, we provide a detailed analysis of training dynamics, revealing a distinct self-paced curriculum where the model sequentially acquires syntactic proficiency before semantic accuracy. Our model is publicly available at https://huggingface.co/Freakz3z/Qwen-JSON.

Community

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2512.00319 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.00319 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.