Habr | Project Page | Technical Report (soon) | Original Github | πŸ€— Diffusers

Kandinsky 5.0 T2V Lite - Diffusers

This repository provides the πŸ€— Diffusers integration for Kandinsky 5.0 T2V Lite - a lightweight video generation model (2B parameters) that ranks #1 among open-source models in its class.

Project Updates

  • πŸ”₯ 2025/09/29: We have open-sourced Kandinsky 5.0 T2V Lite a lite (2B parameters) version of Kandinsky 5.0 Video text-to-video generation model.
  • πŸš€ Diffusers Integration: Now available with easy-to-use πŸ€— Diffusers pipeline!

Kandinsky 5.0 T2V Lite

Kandinsky 5.0 T2V Lite is a lightweight video generation model (2B parameters) that ranks #1 among open-source models in its class. It outperforms larger Wan models (5B and 14B) and offers the best understanding of Russian concepts in the open-source ecosystem.

We provide 8 model variants, each optimized for different use cases:

  • SFT model β€” delivers the highest generation quality
  • CFG-distilled β€” runs 2Γ— faster
  • Diffusion-distilled β€” enables low-latency generation with minimal quality loss (6Γ— faster)
  • Pretrain model β€” designed for fine-tuning by researchers and enthusiasts

Basic Usage

import torch
from diffusers import Kandinsky5T2VPipeline
from diffusers.utils import export_to_video

# Load the pipeline
pipe = Kandinsky5T2VPipeline.from_pretrained(
    "ai-forever/Kandinsky-5.0-T2V-Lite-pretrain-10s-Diffusers", 
    torch_dtype=torch.bfloat16
)
pipe = pipe.to("cuda")

pipe.transformer.set_attention_backend("flex")
pipe.transformer.compile(mode="max-autotune-no-cudagraphs", dynamic=True)

# Generate video
prompt = "A cat and a dog baking a cake together in a kitchen."
negative_prompt = "Static, 2D cartoon, cartoon, 2d animation, paintings, images, worst quality, low quality, ugly, deformed, walking backwards"

output = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    height=512,
    width=768,
    num_frames=241,
    num_inference_steps=50,
    guidance_scale=5.0,
).frames[0]

## Save the video
export_to_video(output, "output.mp4", fps=24, quality=9)

Using Different Model Variants

import torch
from diffusers import Kandinsky5T2VPipeline

# 5s SFT model (highest quality)
pipe_sft = Kandinsky5T2VPipeline.from_pretrained(
    "ai-forever/Kandinsky-5.0-T2V-Lite-sft-5s-Diffusers",
    torch_dtype=torch.bfloat16
)

# 5s Distilled 16-step model (fastest)
pipe_distill = Kandinsky5T2VPipeline.from_pretrained(
    "ai-forever/Kandinsky-5.0-T2V-Lite-distilled16steps-5s-Diffusers", 
    torch_dtype=torch.bfloat16
)

# 5s No-CFG model (balanced speed/quality)
pipe_nocfg = Kandinsky5T2VPipeline.from_pretrained(
    "ai-forever/Kandinsky-5.0-T2V-Lite-nocfg-5s-Diffusers",
    torch_dtype=torch.bfloat16
)

# 5s Pretrain model (most diverse)
pipe_pretrain = Kandinsky5T2VPipeline.from_pretrained(
    "ai-forever/Kandinsky-5.0-T2V-Lite-pretrain-5s-Diffusers",
    torch_dtype=torch.bfloat16
)

# 10s SFT model (highest quality)
pipe_sft = Kandinsky5T2VPipeline.from_pretrained(
    "ai-forever/Kandinsky-5.0-T2V-Lite-sft-10s-Diffusers",
    torch_dtype=torch.bfloat16
)

# 10s Distilled 16-step model (fastest)
pipe_distill = Kandinsky5T2VPipeline.from_pretrained(
    "ai-forever/Kandinsky-5.0-T2V-Lite-distilled16steps-10s-Diffusers", 
    torch_dtype=torch.bfloat16
)

# 10s No-CFG model (balanced speed/quality)
pipe_nocfg = Kandinsky5T2VPipeline.from_pretrained(
    "ai-forever/Kandinsky-5.0-T2V-Lite-nocfg-10s-Diffusers",
    torch_dtype=torch.bfloat16
)

# 10s Pretrain model (most diverse)
pipe_pretrain = Kandinsky5T2VPipeline.from_pretrained(
    "ai-forever/Kandinsky-5.0-T2V-Lite-pretrain-10s-Diffusers",
    torch_dtype=torch.bfloat16
)

Architecture

Latent diffusion pipeline with Flow Matching.

Diffusion Transformer (DiT) as the main generative backbone with cross-attention to text embeddings.

Qwen2.5-VL and CLIP provides text embeddings

HunyuanVideo 3D VAE encodes/decodes video into a latent space

DiT is the main generative module using cross-attention to condition on text

Pipeline Architecture
Model Architecture

Examples

Kandinsky 5.0 T2V Lite SFT

Kandinsky 5.0 T2V Lite Distill
Results Side-by-Side Evaluation The evaluation is based on the expanded prompts from the Movie Gen benchmark.
Distill Side-by-Side Evaluation
VBench Results
Beta Testing You can apply to participate in the beta testing of the Kandinsky Video Lite via the telegram bot.
@misc{kandinsky2025,
    author = {Alexey Letunovskiy, Maria Kovaleva, Ivan Kirillov, Lev Novitskiy, Denis Koposov,
              Dmitrii Mikhailov, Anna Averchenkova, Andrey Shutkin, Julia Agafonova, Olga Kim,
              Anastasiia Kargapoltseva, Nikita Kiselev, Vladimir Arkhipkin, Vladimir Korviakov,
              Nikolai Gerasimenko, Denis Parkhomenko, Anna Dmitrienko, Anastasia Maltseva,
              Kirill Chernyshev, Ilia Vasiliev, Viacheslav Vasilev, Vladimir Polovnikov,
              Yury Kolabushin, Alexander Belykh, Mikhail Mamaev, Anastasia Aliaskina,
              Tatiana Nikulina, Polina Gavrilova, Denis Dimitrov},
    title = {Kandinsky 5.0: A family of diffusion models for Video & Image generation},
    howpublished = {\url{https://github.com/ai-forever/Kandinsky-5}},
    year = 2025
}

@misc{mikhailov2025nablanablaneighborhoodadaptiveblocklevel,
      title={$\nabla$NABLA: Neighborhood Adaptive Block-Level Attention}, 
      author={Dmitrii Mikhailov and Aleksey Letunovskiy and Maria Kovaleva and Vladimir Arkhipkin
              and Vladimir Korviakov and Vladimir Polovnikov and Viacheslav Vasilev
              and Evelina Sidorova and Denis Dimitrov},
      year={2025},
      eprint={2507.13546},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.13546}, 
}
Downloads last month
43
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including ai-forever/Kandinsky-5.0-T2V-Lite-pretrain-10s-Diffusers