MOSS-TTSD: Text to Spoken Dialogue Generation
Transcribe audio or YouTube videos into text
Generate images preserving face identity