Personalized Text-to-Image Generation with Auto-Regressive Models
Abstract
Auto-regressive models achieve comparable performance to diffusion models in personalized image synthesis through a two-stage training strategy that optimizes text embeddings and fine-tunes transformer layers.
Personalized image synthesis has emerged as a pivotal application in text-to-image generation, enabling the creation of images featuring specific subjects in diverse contexts. While diffusion models have dominated this domain, auto-regressive models, with their unified architecture for text and image modeling, remain underexplored for personalized image generation. This paper investigates the potential of optimizing auto-regressive models for personalized image synthesis, leveraging their inherent multimodal capabilities to perform this task. We propose a two-stage training strategy that combines optimization of text embeddings and fine-tuning of transformer layers. Our experiments on the auto-regressive model demonstrate that this method achieves comparable subject fidelity and prompt following to the leading diffusion-based personalization methods. The results highlight the effectiveness of auto-regressive models in personalized image generation, offering a new direction for future research in this area.
Community
This paper investigates the potential of optimizing auto-regressive models for personalized image synthesis, leveraging their inherent multimodal capabilities to perform this task. We propose a two-stage training strategy that combines optimization of text embeddings and fine-tuning of transformer layers. Our experiments on the auto-regressive model demonstrate that this method achieves comparable subject fidelity and prompt following to the leading diffusion-based personalization methods. The results highlight the effectiveness of auto-regressive models in personalized image generation, offering a new direction for future research in this area.
Github: https://github.com/KaiyueSun98/T2I-Personalization-with-AR
This paper investigates the potential of optimizing auto-regressive models for personalized image synthesis, leveraging their inherent multimodal capabilities to perform this task. We propose a two-stage training strategy that combines optimization of text embeddings and fine-tuning of transformer layers. Our experiments on the auto-regressive model demonstrate that this method achieves comparable subject fidelity and prompt following to the leading diffusion-based personalization methods. The results highlight the effectiveness of auto-regressive models in personalized image generation, offering a new direction for future research in this area.
Github: https://github.com/KaiyueSun98/T2I-Personalization-with-AR
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Proxy-Tuning: Tailoring Multimodal Autoregressive Models for Subject-Driven Image Generation (2025)
- Fine-Tuning Visual Autoregressive Models for Subject-Driven Generation (2025)
- Towards More Accurate Personalized Image Generation: Addressing Overfitting and Evaluation Bias (2025)
- ConceptGuard: Continual Personalized Text-to-Image Generation with Forgetting and Confusion Mitigation (2025)
- FaR: Enhancing Multi-Concept Text-to-Image Diffusion via Concept Fusion and Localized Refinement (2025)
- Transfer between Modalities with MetaQueries (2025)
- Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
 You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: 
@librarian-bot
	 recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
 
					 
					 
					 
					 
						 
					