NVIDIA GeForce RTX 5060 Ti and Wan2.2 model

Hello,
With 4 NVIDIA GeForce RTX 5060 Ti graphics cards with 8GB VRAM and Wan2.2 model, how many seconds of video can I produce from a photo?

Thank you.

1 Like

It’s best to assume that the length of a video generated in a single run will be within 3 to 5 seconds.


1. Direct answer (what you can realistically expect)

On a PC with 4 × NVIDIA GeForce RTX 5060 Ti (8 GB VRAM each) and Wan2.2 series models, running locally (ComfyUI / Diffusers):

  • For a single image→video clip at “normal” quality:

    • Plan on about 3–5 seconds of video per clip.
    • This holds for the main Wan2.2 video models (TI2V-5B, T2V-A14B, I2V-A14B) at reasonable resolutions (480p–720p).
  • Adding more GPUs (4 cards instead of 1) does not increase the length of a single clip.

    • Instead, you can render several 3–5 second clips in parallel (up to 4 at once).
    • Longer final videos (30–60+ seconds) are made by stitching many short clips in an editor.

So, per one generation “from a photo”, your realistic target is:

One Wan2.2 clip = about 3–5 seconds of video.
Four GPUs = more of these 3–5s clips at once, not one huge 20–30s clip.


2. Why the limit is 3–5 seconds per clip (model-level constraints)

Wan2.2 is designed around short clips. The key references all line up:

  • The official Wan2.2 TI2V-5B card states that this model generates 720p video at 24 fps, and benchmarks it as producing a 5-second 720p clip on a single consumer GPU. (modelscope.cn)

  • Wiro and other frontends describe the same 5B model as supporting both text-to-video and image-to-video at 720p/24fps. (wiro.ai)

  • Scenario’s official Wan2.2 help page says plainly:

    • “Models work best with clips under 5 seconds in length, with optimal results using 120 frames or fewer” at 480p or 720p. (Scenario)
  • InstaSD’s Wan2.2 guide reinforces this:

    • “Wan2.2 works best with clips no longer than 5 seconds. Frame count ≤ 120 works well; 24 fps for cinematic, 16 fps for tests.” (instasd.com)
  • Fal’s Wan 2.2 API guide benchmarks specifically 5-second, 720p @ 24 fps clips (TI2V-5B ≈ 9 minutes per 5 s on an RTX 4090; 14B models on 8-GPU clusters). (blog.fal.ai)

In other words:

  • The architecture and training regime of Wan2.2 are tuned around:

    • Resolution: 480p–720p
    • Frame rate: ~24 fps (16 fps for cheaper tests)
    • Frame count: up to ~120 frames
  • Duration is simply:

[
\text{seconds} = \frac{\text{frame count}}{\text{fps}}
]

Examples:

  • 81 frames @ 24 fps ≈ 3.4 seconds
  • 120 frames @ 24 fps = 5.0 seconds
  • 80 frames @ 16 fps = 5.0 seconds

Going much beyond this (more frames) is possible but leaves the “recommended” zone and tends to degrade quality or stability.


3. What your GPUs actually change (and what they don’t)

You have 4 × 5060 Ti 8 GB. Important facts:

  1. Each GPU has its own 8 GB VRAM
    VRAM does not simply add up to “32 GB” for a single Wan2.2 job. Out of the box, ComfyUI / Diffusers run each workflow on one GPU at a time.

  2. Wan2.2 TI2V-5B is already optimized for 8 GB

    • The official ComfyUI Wan2.2 tutorial states:

      • “The Wan2.2 5B version should fit well on 8 GB VRAM with the ComfyUI native offloading.” (Comfy Docs)
    • The ModelScope card notes that TI2V-5B can generate a 5-second 720p video on a consumer GPU without special optimization, implying that 8–12 GB cards can run it with offload. (modelscope.cn)

  3. Chimolog’s Wan2.2 GPU benchmarks focus on 5-second clips and show:

    • Tests in ComfyUI using Wan2.2 360p / 480p / 720p with real workflows (including the popular Kijai/EasyWan22 pipeline). (It’s a little tight.)

    • For 720p 5-second clips with that Kijai workflow, cards with ≤12 GB VRAM generally failed (OOM), and they conclude:

      • To stably generate 5 seconds at 720p in that specific workflow, you realistically need at least an RTX 5060 Ti 16 GB or similar. (It’s a little tight.)

    This tells you:

    • 8 GB cards can absolutely run Wan2.2, especially at 480p or lighter settings.

    • For full 5s @ 720p using heavier Kijai/EasyWan22 workflows, 8 GB is borderline; you may need to:

      • Lower resolution, or
      • Shorten clips, or
      • Use the heavier Comfy “native” offload mode that prioritizes fitting over speed.
  4. Four GPUs = four lanes
    In practice, on your PC:

    • Per clip: Wan2.2 still behaves like it’s on an 8 GB card → 3–5 seconds.
    • Per machine: you can run four such clips in parallel (one per GPU), or run them sequentially to build longer videos.

Advanced multi-GPU sharding (FSDP / DeepSpeed or ComfyUI-MultiGPU / DisTorch) can spread Wan2.2/14B across multiple GPUs + RAM, but this mainly lets you:

  • Run bigger models (14B) or higher resolution,
  • Not stretch a single clip much beyond the ~5-second temporal window.

4. How this breaks down by Wan2.2 variant on your 4× 5060 Ti

4.1 Wan2.2 TI2V-5B (the main “from a photo” model)

  • Specs: 5B dense model, 720p @ 24 fps, unified text+image→video. (filtrix.ai)
  • Designed to generate up to ~5-second clips at that resolution on consumer GPUs. (modelscope.cn)

On your 8 GB 5060 Ti:

  • Safe, everyday settings:

    • 480p, 16–24 fps, 49–81 frames → about 3–5 seconds.
    • This matches Chimolog’s analysis, which shows VRAM usage staying reasonable between 49 and 81 frames at “HD-ish” resolution and explicitly calls 3–5 seconds (49–81 frames) the recommended length. (It’s a little tight.)
  • At 720p, with careful offload (Comfy native):

    • Still target ~3–5 seconds (e.g., 81–120 frames @ 24 fps), but:

      • Expect slower render times than on a 4090, and
      • Use aggressive VRAM-saving options (split VAE, model offload, etc.).

Per clip, you do not exceed ~5 seconds comfortably; for longer content you chain clips.

4.2 Wan2.2 T2V-A14B / I2V-A14B (MoE 14B series)

  • Specs from the 14B model cards:

    • 14B active parameters, MoE.
    • Video at 480p and 720p, also used as 5-second benchmark in official docs and Fal’s guide. (Hugging Face)

On your hardware:

  • Running 14B naively on a single 8 GB is not realistic; it wants much more VRAM or multi-GPU. (Hugging Face)
  • With quantization (GGUF) + multi-GPU sharding (ComfyUI-MultiGPU, FSDP/Ulysses), you can make it fit and run at 480p.

The per-clip duration is still ~3–5 seconds:

  • The 14B series is benchmarked on 5-second 720p/480p clips, just like 5B. (blog.fal.ai)
  • The extra capacity (14B vs 5B, or multi-GPU) mainly buys quality, detail, or resolution, not longer per-clip duration.

4.3 Special hosted variants (e.g. Wan2.2-Fun-Control)

  • Some cloud-hosted Wan2.2 variants like Wan2.2-Fun-Control advertise up to 120s at 720p because they run on large multi-GPU servers and use specialized pipelines. (wavespeed.ai)
  • If you call those APIs from your PC, your local GPUs don’t limit clip length—the provider does.
  • For local ComfyUI / Diffusers on your 4× 5060 Ti, you should still think in terms of 3–5 seconds per clip.

5. Putting it all together for your question

Question:
“With 4 × RTX 5060 Ti (8 GB VRAM) and Wan2.2 series, how many seconds of video can I produce from a photo?”

Answer, in practical terms:

  1. Per single Wan2.2 clip (local, from one photo):

    • Realistic, recommended range: about 3–5 seconds of video

      • 480p or 720p,
      • 16–24 fps,
      • ~49–81 (or up to ~120) frames depending on fps and workflow.
  2. Per GPU:

    • Each 5060 Ti 8 GB behaves like one 3–5-second Wan2.2 lane.
    • The clip length is set by model design and frame count, not the number of GPUs.
  3. With 4 GPUs together:

    • You still get 3–5 seconds per clip, but you can generate:

      • 4 clips in parallel, or
      • Many clips one after another.
    • To make a longer video (for example, 60 seconds), you:

      • Generate ~12 clips of 5 seconds each (using all 4 GPUs to speed this up),
      • Then stitch those clips in an editor.

So the clear, easy planning rule for your setup is:

Think: 3–5 seconds of video per Wan2.2 clip from a photo, per GPU.
Your 4 GPUs multiply how many of those 3–5s clips you can produce,
but they don’t extend a single clip beyond the model’s short-clip design.

Hello,
Thank you so much for your reply.
So how do services like Runway ML produce long videos?

1 Like

Hm.


Runway gets “long videos” by chaining short generations, not by generating one giant continuous clip in a single pass.

How Runway makes longer videos in practice

1) Generate a short clip first

You generate an initial Gen-3 clip. Then Runway exposes an Extensions panel for that clip. (academy.runwayml.com)

2) Use “Extend” to add more time in chunks

Runway’s own docs say you can extend a completed generation up to three times. (Runway)
Runway’s changelog also states extensions add an additional 5 or 10 seconds and can reach up to 40 seconds total. (Runway)

Concrete caps Runway documents:

  • Gen-3 Alpha: extend up to 40 seconds total (via up to 3 extensions). (Runway)
  • Gen-3 Alpha Turbo: extend up to 34 seconds total, when the original clip is 10 seconds. (Runway)

So the product behavior is:

short clip → extend → extend → extend → one longer clip (up to the cap)

3) For longer than ~34–40 seconds, they stitch multiple clips

Runway’s own guidance on “longer videos and films” is essentially: generate multiple clips and assemble them. (ACM Digital Library)
That is how you get “minutes”: you make many shots, then edit them together.

4) “Expand Video” is not “Extend”

Runway’s Expand Video feature is about expanding the frame boundaries / aspect ratio, not making the timeline longer. (Runway)

What’s going on under the hood (simple explanation)

Runway does not fully publish the internal algorithm, but the behavior matches a widely used pattern in video generation:

A) Continuation conditioning

An “extension” typically works like:

  • Take the existing clip (often the last frames or a compressed representation).
  • Generate the next chunk so motion continues smoothly.
  • Optionally add a prompt for steering the continuation (Runway explicitly supports adding text for more control during extension). (academy.runwayml.com)

B) Sliding-window generation (common research approach)

In long-video diffusion research, a common approach is to split a long target video into overlapping short clips (a “sliding window”), generate them, then blend overlaps to reduce seams. (CVF Open Access)

Important nuance: this research description does not prove Runway uses the exact same technique. It explains why “extend in chunks” is the standard way to get longer outputs from models that are naturally short-clip generators.

Why cloud services feel more capable than local models

Even if the strategy is “short clips + extension,” a service like Runway can make it feel smooth because it can afford:

  • Big GPUs and lots of VRAM.
  • Multiple attempts behind the scenes.
  • Extra refinement passes and better post-processing.

But there are still explicit product caps on how long a single extended clip can be (34–40 seconds for Gen-3, per their docs). (Runway)

Summary

  • Runway makes longer videos by generating short clips and extending them up to a documented cap. (Runway)
  • For videos longer than that, it’s multiple clips (shots) + editing. (ACM Digital Library)
  • Technically, this matches common long-video methods like sliding-window / overlapping segment generation used in the research literature. (CVF Open Access)

Links:

1 Like

Hello,
Thanks again.
I would be grateful if you could clarify something for me. How many seconds of video can I produce with an 8, 12, 16, 24, and 32 gigabyte graphics card, respectively? For example, if I want to produce a 10-second video, how much VRAM should my graphics card have?

1 Like

VRAM is important, but it primarily affects the resolution and sophistication of the videos that can be generated. The length of the generated video is largely influenced by the architecture. Currently, the practical option is to stitch together multiple short output videos.


Wan2.2 does not scale “seconds per clip” linearly with VRAM. Wan2.2 is tuned for short clips, and VRAM mainly decides what resolution and workflow you can run reliably.

Wan2.2 guidance from multiple sources is consistent: best results under ~5 seconds, and ≤120 frames (with 24 fps default, 16 fps for faster testing). (Instasd)
Wan2.2 TI2V-5B specifically targets 720p @ 24 fps and is described as producing up to 5 seconds. (fal.ai)


What “seconds of video” means (two different meanings)

  1. One continuous generation (one run)
    This is what the model outputs in a single shot. Wan2.2 is usually 3–5 seconds per run for best stability. (Instasd)

  2. Final video length (after editing)
    A “10-second video” is usually made by stitching two 5-second clips (or three ~3–4 second clips). This is how most people work with short-clip generators.


Table: What each VRAM tier buys you with Wan2.2 (local generation)

Assume Wan2.2 TI2V-5B unless noted, using common ComfyUI-style workflows.

GPU VRAM What’s realistic and comfortable Typical “good” seconds per single run Can it do a clean 10 seconds in ONE run? Practical way to make a 10-second video
8 GB 480p is the practical baseline. 720p may work only with very aggressive offload and specific workflows. Official ComfyUI docs say 5B can fit on 8 GB with native offloading. (ComfyUIDocument) 3–5 s Not recommended Make 2 × 5 s at 480p, stitch them.
12 GB 480p is comfortable in most setups. 720p can still be fragile depending on workflow. Chimolog’s heavy 720p benchmark says 12 GB and below fails there. (It’s a little tight.) 3–5 s Not recommended Make 2 × 5 s at 480p (or light 720p if it fits), stitch.
16 GB First tier that is reliably “720p-friendly” across heavier workflows. Chimolog’s 720p benchmark implies you need more than 12 GB to survive 720p in that setup. (It’s a little tight.) 3–5 s Still not ideal (quality often degrades past 5 s) Make 2 × 5 s at 720p, stitch.
24 GB 720p is easy, more headroom for heavier graphs, fewer compromises, more stable runs. Still the model’s “sweet spot” is short clips. (Scenario) 3–5 s Sometimes possible but often worse quality Still best: 2 × 5 s, stitch.
32 GB Same as 24 GB but even more breathing room. Helps with larger model variants and complex pipelines. Does not “turn Wan2.2 into a long-clip model.” (Scenario) 3–5 s Possible to attempt but not guaranteed clean Best: 2 × 5 s, stitch.

Why the “seconds per run” column barely changes

Because Wan2.2 itself is optimized for short clips. Sources explicitly say it performs best under ~5 seconds and around ≤120 frames. (Instasd)
VRAM mostly decides whether you can do that at 480p vs 720p, and whether you must rely on offload. (It’s a little tight.)


If you want a 10-second video, how much VRAM should you target?

Best-practice answer (recommended)

If “10 seconds” means a final edited video:

  • 10 seconds at 480p: 12 GB is a comfortable target. 8 GB can work with the right workflow and offload. (It’s a little tight.)
  • 10 seconds at 720p: target 16 GB for reliable local work. This matches independent 720p benchmark behavior where ≤12 GB fails in a heavy 720p workflow. (It’s a little tight.)

In both cases, you usually generate two 5-second clips and stitch.

“One continuous 10-second clip” answer (hard mode)

Wan2.2 guidance says it works best under 5 seconds, and community experiments suggest quality often falls off when pushing longer (even if it runs). (Instasd)
If you insist on trying a single-run 10 seconds, 24–32 GB gives you the best chance to fit the extra frames at decent resolution, but it still may not look good because the limitation becomes temporal coherence, not VRAM. (Reddit)


Short summary

  • Per Wan2.2 run: plan 3–5 seconds almost regardless of VRAM. (Instasd)

  • VRAM decides resolution and stability:

  • For a 10-second final video: best approach is 2 × 5 seconds stitched.


To generate longer videos than Wan2.2’s usual ~3–5 seconds, you have to change the generation approach or model, not just add VRAM.

  • Stop doing “one-shot” generation. Wan2.2 is typically used with ≤120 frames and “works best” under ~5 seconds. Pushing past that often causes repetition or drift. (Scenario)
  • Generate in chunks and continue. Make multiple short clips (for example 5s + 5s) and “continue” each next clip from the previous clip’s last frames, then stitch in editing. This is the standard practical way to get 10–60+ seconds from short-clip models. (Scenario)
  • Use overlap instead of hard cuts. Generate overlapping segments (a sliding window) and blend the overlap to reduce visible seams. FreeNoise describes this explicitly as dividing clips into overlapped windows. (OpenReview)
  • Use long-video inference methods. Techniques like FreeNoise (noise rescheduling + windowed temporal attention) and LongDiff (training-free components to address long-video failure modes) are specifically designed to extend short-video diffusion models to longer videos. (arXiv)
  • Or change the model or training. Many video diffusion models are trained on a limited number of frames, which is why they struggle to stay consistent for long durations. Fixing that at the root means training (or re-training) for longer temporal windows. (arXiv)
1 Like

Hello,
Thanks again.
1- If I want to produce an 8-10 second video in 720p or 1080p quality, then what model is suitable and how much VRAM should my graphics card have?

2- If I want to produce an 8-10 second video in 720p or 1080p with a graphics card with 12 or 16 GB of VRAM, then what model is suitable?

3- How many seconds of video can a model produce?

1 Like

Hmm, generating long videos locally using consumer GPUs we can normally buy isn’t really that easy yet
I think it’s cheaper to just switch between commercial services as needed…
Unless you have specific requirements, of course.


1) 8–10 seconds at 720p or 1080p: what OSS model, what VRAM?

If you mean “one clip, 8–10 seconds, looks like 720p/1080p”:

Best practical OSS pick today: CogVideoX1.5-5B-I2V

  • It explicitly supports 5 or 10 seconds output. (Hugging Face)
  • It runs up to 1360×768 (often called “768p-class”; it is close to 720p). (Hugging Face)
  • VRAM guidance: the model card lists ~9GB minimum for single-GPU BF16 inference with optimizations. So 12GB works, 16GB is more comfortable (less swapping, fewer OOM surprises). (Hugging Face)
  • If you turn off optimizations, VRAM needs can jump a lot (the model card warns VRAM can increase heavily and mentions optimizations like CPU offload, VAE slicing, VAE tiling). (Hugging Face)

What about “true 1080p generation”?
Most OSS video models still do not natively generate 1920×1080 8–10 second clips on consumer VRAM in a clean, repeatable way. The common OSS approach is:

  1. generate at the model’s native resolution (often ~768p-class), then
  2. upscale to 1080p (spatial upscaling) and optionally smooth motion (temporal upscaling / interpolation).

If you want an OSS ecosystem that explicitly supports this “generate then upscale” workflow, LTX-Video is relevant because it has official upscalers (spatial and temporal) and also supports video extension workflows. (GitHub)

  • LTX-Video also has multiple variants (13B, 2B, distilled, FP8) listed on its model card. (Hugging Face)
  • It has community and official notes that it can run on very low VRAM only at small settings (example: 512×512, 50 frames with tricks). (Hugging Face)
  • For 720p/1080p-looking results, you generally step up to larger variants and rely on upscalers. That usually means more VRAM is better, but LTX does not give a single universal “X GB required” number in the primary docs.

A “researchy but heavy” OSS option for 10 seconds: Pyramid Flow

  • It explicitly targets up to 10 seconds at 768p and 24 FPS, and supports image-to-video. (GitHub)
  • But the authors state large VRAM needs for the 768p version (around 40GB). (Hugging Face)
  • They do provide CPU offloading modes to run under <12GB or even <8GB, but it will be much slower. (GitHub)

So, for “8–10 seconds at 720p/1080p-quality” on a normal desktop GPU, CogVideoX1.5-5B-I2V + upscaling is currently the cleanest OSS answer. (Hugging Face)


2) 8–10 seconds at 720p/1080p on 12GB or 16GB VRAM: what OSS model?

Best fit: CogVideoX1.5-5B-I2V

  • Designed for 5 or 10 seconds. (Hugging Face)
  • Single GPU BF16 minimum ~9GB with optimizations. That places it squarely in 12GB and 16GB territory. (Hugging Face)
  • If you are tight on VRAM, use the memory-saving options the model card calls out (sequential CPU offload, VAE slicing, VAE tiling) and consider INT8 quantization. Expect slower speed if you lean hard on offload or INT8. (Hugging Face)

Also plausible (but more “workflow-dependent”): LTX-Video (2B / distilled / FP8)

  • LTX-Video publishes multiple lighter variants (2B, distilled, FP8) and also points to quantized and caching acceleration projects (example: TeaCache, 8-bit model integrations) which can reduce memory or speed up inference. (GitHub)
  • For 12–16GB, you typically use the smaller or quantized variants and rely on its upscalers for “1080p-looking” output. (Hugging Face)

Not a good fit for 12–16GB if you want native 10s at 768p: Pyramid Flow 768p (unless you accept heavy offload and slow runs). (Hugging Face)


3) “How many seconds of video can a model produce?”

There is no single number. It depends on:

  • the model’s trained context (how many frames it was built to handle),
  • the FPS it outputs at,
  • and the pipeline limits (what the implementation supports without falling apart).

Concrete examples from OSS model docs:

  • CogVideoX-5B: 6 seconds, 8 FPS, 720×480. (Hugging Face)
  • CogVideoX1.5-5B-I2V: 5 or 10 seconds, 16 FPS, up to 1360×768. (Hugging Face)
  • Pyramid Flow (768p checkpoint): up to 10 seconds at 24 FPS and 768p. (GitHub)
  • LTX-Video: the project announces support for long shot generation up to 60 seconds (model/pipeline dependent). (GitHub)

Important practical point: even if a repo claims “supports longer,” quality and consistency often degrade as you push duration. The standard “pro” approach is still: generate shorter chunks, then extend or stitch.


Two quick “gotchas” people hit

Multi-GPU is not the same as “one big VRAM pool.”

  • Example: CogVideoX1.5’s model card lists multi-GPU inference memory as ~24GB per GPU (diffusers). That means 4×8GB does not magically become “one 32GB card” for that model. (Hugging Face)

1080p usually means upscaling in OSS today.

  • “Native 1080p, 10 seconds, one shot” is still mostly a closed-model feature. In OSS, you typically generate at 768p-class then upscale.

Links worth using


Summary

  • 12–16GB VRAM + 8–10 seconds: pick CogVideoX1.5-5B-I2V, then upscale to 1080p if needed. (Hugging Face)
  • “True” 1080p in OSS: usually means generate lower-res then upscale, not native 1920×1080. (GitHub)
  • Max seconds is model-specific: examples range from 6s (CogVideoX-5B) to 10s (CogVideoX1.5, Pyramid Flow) and even longer in some pipelines (LTX-Video long shot). (Hugging Face)