Post
				
				
							3727
					Implemented a custom multimodal GRPO trainer that scales for Small VLMs, supports cpu and gpu with vllm + flash attention. Using SmolVLM-256M-Instruct reference & reward model, wasn’t trained for long btw, still got some sparks of “thinking”:)
Code: https://github.com/Jaykef/ai-algorithms/blob/main/grpo_multimodal_reasoner.ipynb
	
		
	Code: https://github.com/Jaykef/ai-algorithms/blob/main/grpo_multimodal_reasoner.ipynb
 
							