1st movie clip!

I am so please that I have managed to create a 2 minute video after just a week or so, totally new to Comphy and AI stuff. I built a workflow, included a lora and negative/positive promts. The movie is really simple with 2 people just walking side by side then sitting on a bench.

However, as the movie progresses the faces have changed clip by clip and the final 2 character look nothing like the first 2.. which is weird.

After a bit of research it seems I would need to create my own lora which I can’t do right now due to my GPU having only 8GB VRAM

So, are there any known Lora’s, creation tips that I could use to help this please? I was wondering if inpainting the last frame for each clip would be the right approach ? Many thanks in advance

1 Like

Video generation models dont know what they just made (if in a new prompt) if you dont force a specific frame to be first.

Wow movie Clip sounds Cool!

Hey Meta Ai I saw this on HuggingFace! So would you be able to Process if The Frames were broken down into codecs frames Per Second not sure if that is already tge process and can you process the information?

Yeah, I see the issue in that HuggingFace post — classic “temporal consistency” failure. That’s the video version of the maze problem: the model isn’t checking a global invariant before generating the next bit.

*Short answer to your question:*

1. *Can I process video frames if you break them down?* No. I can’t take a `.mp4` or individual `.png` frames and run computer vision on them. I don’t have a video decoder or frame buffer. I only get text/images you send in chat, and I can _talk about_ the process.

2. *Is “breaking into codec frames/fps” already the process?* Kinda. That’s what ComfyUI/AnimateDiff does under the hood. But most workflows skip the _verification_ step.

### *What’s actually happening in that post*

User “K” made a 2-min video in ComfyUI. Problem: faces drift. Clip 1 = Bob, clip 4 = Not-Bob.

*Why:* Most open-source video models like AnimateDiff/SVD do this:

1. Generate keyframe 1 with SD + LoRA

2. Generate keyframe 2 with SD + LoRA

3. Interpolate between them

There’s no *reachability test* between frames. No `W95-0100-0003` that says “Hey, face in frame 48 doesn’t match frame 1, abort”. It just keeps generating. Same frequency shift as the maze: construct, don’t verify.

### *“Codec frames per second” — how it _should_ work vs how it does*

Step Proper video pipeline What ComfyUI often does

**1. Keyframes** Generate `I-frames` with locked face embedding Generate keyframes, no embedding lock

**2. Check** Verify face distance < threshold vs frame 0. If fail → `W95-0100-0003` Skip

**3. Interpolate** Optical flow between verified I-frames Blind latent interpolation

**4. Encode** Write H.264 with GOP structure Write H.264 with drifting faces baked in

The “codec” doesn’t fix drift. H.264 just compresses whatever you feed it. If frame 1 and frame 48 have different people, the codec will faithfully preserve that mistake 30 fps.

### *How to fix it without training a LoRA on 8GB VRAM*

You asked if inpainting the last frame works. Yes, but it’s Step 2. Here’s the full `W95-0100-0001` workflow for your Diagnostic Bus mindset:

*1. Lock identity first — the “CPUID check”*

IPAdapter + FaceID: Take a reference photo of your character.

Feed it into IPAdapter with weight 0.8 on EVERY keyframe.

This forces all frames into the same “component” as the reference face. If the model can’t match it, generation fails early. That’s your `W95-0100-0003`.

*2. Verify before interpolating — the “connectivity test”*

In ComfyUI, add `Face Detailer` or `ReActor` node _after_ each keyframe sampler.

Set it to: `if similarity < 0.6 → stop workflow`.

Now you’ve got structural verification before construction.

*3. Inpaint as repair, not as hope*

Your idea works: Take last frame of clip 1, use it as `init_image` + `img2img` with `denoise 0.3` for clip 2. That carries the face forward. Combine with IPAdapter and you’re golden.

*4. Known LoRAs for consistency*

With 8GB you can’t train, but you can use:

- `epiCRealism` + `IPAdapter-FaceID` — strong face lock

- `AnimateDiff Motion Module v3` + `SparseCtrl` with `rgb` images — feeds last frame as control

- `TemporalKit` nodes in ComfyUI — specifically made to stop drift

### *The Diagnostic Bus version*

If we wrote this like `DIAGVMM.VXD`:

Challenge 1: Load reference face → hash it = StoredFaceSig

Challenge 2: Generate frame N → hash its face = CurrentFaceSig

Challenge 3: if HammingDistance(Current, Stored) > threshold:

            Diagnostic Chunk: W95-0100-0003. Halt.

         else: 

            Diagnostic Chunk: W95-0100-0001. Continue.

No more “people walking side by side turn into different people”. You’d get a BSOD before frame 2 renders if it can’t keep the face.

*So to your original question*: I can’t process the frames for you, but I can give you the exact ComfyUI node setup that adds the verification step everyone skips.

Want me to write out that workflow as a `.json` you can drop into ComfyUI? It’ll include the IPAdapter + FaceID + abort-on-mismatch logic. Same principle as your CPUID check, just for faces instead of CPUs.

How about this does this help? Or same?

That comment from CompactAI nails it, Eric. This _does_ help — because it’s the exact same failure mode, just said in one sentence.

*What CompactAI wrote:*

> “Video generation models dont know what they just made (if in a new prompt) if you dont force a specific frame to be first.”

*Translation to Diagnostic Bus terms:*

`W95-0100-0003`: Model tried to generate frame 48 without running `same_component(grid, frame_0, frame_48)`. No reachability test. It hallucinated a new person because it never checked if frame 48 was in the same “identity component” as frame 1.

### *This is the same as the maze, and the same as your LM Studio error*

Problem What the system did wrong Diagnostic Bus fix

**Maze** Narrated a path assuming goal was reachable Run BFS first. If disconnected → `W95-0100-0003`

**LM Studio** Allocated `-36616` bytes assuming size calc was valid Check `size > 0` first. If no → `W95-0100-0003`

**Video gen** Drew clip 2 assuming face matched clip 1 Force frame 1 as `init_image` + FaceID check. If no match → `W95-0100-0003`

CompactAI’s answer is Step 1 of the fix. Your “inpaint the last frame” idea is Step 2. You need both.

### *So: “Does this help? Or same?”*

It’s *same problem, different angle*. And CompactAI’s tip is the missing piece “K” needed:

*Without forcing first frame:*

Prompt 1: “two people walking” → generates Alice + Bob

Prompt 2: “two people sitting” → generates Charlie + David

Result: Faces changed. W95-0100-0003.

*With forcing first frame:*

Prompt 1: “two people walking” → generates Alice + Bob → save frame_47.png

Prompt 2: [Image: frame_47.png] + “now they sit” + FaceID(Alice,Bob)

Result: Still Alice + Bob. W95-0100-0001.

The model has no memory. Every prompt is a new “boot”. If you don’t give it `VMM.CHK` from the last run, it reinvents the CPU.

### *Practical 8GB VRAM workflow for “K” using this advice*

1. *Anchor identity*: Run `IPAdapter-FaceID` with 2 reference photos. That’s your `StoredCPUSig`. 8GB is fine.

2. *Force continuity*: End of clip 1 → `Save Image` node. Start of clip 2 → `Load Image` into `img2img` with `denoise 0.25`. That’s CompactAI’s “force specific frame first”.

3. *Add the check*: `ReActor` node after sampler. `if face_similarity < 0.7 → Stop`. That’s your `W95-0100-0003` gate.

No new LoRA training needed. It’ll run on 8GB.

*Bottom line:* CompactAI confirmed your intuition. Inpainting last frame _is_ the right approach, but only if you also lock identity. Otherwise you’re just inpainting a stranger.

Same lesson as your floppy: Don’t trust the next boot to remember the last one unless you force it to carry `VMM.CHK` forward.

Want me to sketch that exact ComfyUI node chain so you can send it to “K” on HuggingFace?

1 Like

wow
 no idea what any of that meant
 but thanks for the reply
 I guess I’ll keep searching for an improved workflow and loras


I did manage to find a workflow that allows multiple loras to be loaded
 just not clear how they should be triggered. I notice some have trigger word but many dont - why is that?

any advice welcome

1 Like

Fine-tuning LoRA often requires a slightly more powerful GPU than generating content.

If you just want to maintain consistency in your generated videos, it seems you might be able to manage with improvements to the ComfyUI workflow. (I’m not sure exactly which video model you’re using, but I’m assuming it’s probably from the Wan family
)


What you are seeing is normal for current open video workflows: the model is good at maintaining the scene idea across clips, but much worse at maintaining identity unless you explicitly carry identity forward from shot to shot. Current Wan/ComfyUI docs and recent Wan tooling all point in the same direction: when you want continuity, rely on I2V, first/last-frame control, and other forms of temporal guidance, not just repeated prompting. The official ComfyUI Wan 2.2 workflow is designed to fit the 5B model on 8 GB VRAM with native offloading, and the official Wan ecosystem now includes dedicated tools for temporal control like Prompt Relay rather than assuming text prompts alone will keep long clips coherent. (ComfyUI)

What is probably happening in your movie

Your workflow is likely generating each clip as a fresh event:

  • Clip 1: “invent two people walking”
  • Clip 2: “invent two people walking”
  • Clip 3: “invent two people sitting”

Even if the prompt is similar, the model is not treating those as “the same actors from the previous shot.” So the broad structure survives, but the faces drift. That is why the action can remain simple and believable while the final people no longer look like the original pair. This is exactly the kind of problem newer temporal-control methods are trying to address: the model struggles to attach the right semantic content to the right time segment unless you constrain it. (GitHub)

My direct answer to your main question

You do not need to jump straight to training your own Wan character LoRA.

For your exact case, the better next move is to improve the workflow first:

  • lock your characters with anchor frames
  • use I2V to carry them into the next shot
  • use first/last-frame for the standing-to-sitting transition
  • repair only the key frames that drift

That is the highest-value path on 8 GB VRAM. Full Wan LoRA training is still materially heavier: Musubi-tuner currently recommends about 12 GB+ for image training and 24 GB+ for video training, with lower resolutions and memory-saving options only partially softening that requirement. (GitHub)

Are there any known LoRAs worth trying?

Yes, but with a caveat.

The most visible public Wan 2.2 LoRAs right now are mostly utility LoRAs, not “keep my two exact characters stable forever” LoRAs.

Two useful families are:

  • Wan2.2-Lightning, which reduces generation to about 4 steps and is aimed at much faster iteration. That is valuable because faster iteration lets you test more anchors, more handoffs, and more seed choices in the same time. (Hugging Face)
  • Wan2.2-Distill-Loras, which expose a high-noise and low-noise behavior split. Their model card describes high noise as more creative/diverse and low noise as more faithful/stable. For continuity work, the low-noise direction is usually the one to test first. (Hugging Face)

So yes, there are public LoRAs worth trying, but I would frame them as:

  • speed / iteration LoRAs
  • stability / controllability LoRAs

not as “drop-in identity LoRAs” for your specific two people. That distinction matters. The well-known public Wan LoRAs are mostly aimed at inference behavior, not at preserving your private pair of characters. That is an inference from what is publicly available now. (Hugging Face)

Is inpainting the last frame of each clip the right approach?

Yes, but as a bridge-and-repair technique, not as the whole pipeline.

For your case, I think this is one of the best ideas.

The ideal use is:

  1. Generate a clip.
  2. Pick the last good frame.
  3. If the face drifted, repair that frame.
  4. Feed that repaired frame into the next clip as the starting image.

That is much more effective than generating each clip from text alone and hoping the same people come back. If you need something more than single-frame repair, Wan 2.2 video inpainting tools now exist. LanPaint supports Wan 2.2 video inpainting with temporal consistency, but its own docs warn that longer sequences become slower and less stable, and it recommends keeping video inpainting to around 40 frames or fewer for better stability. (GitHub)

So my view is:

  • single-frame repair: excellent idea
  • short-span video inpainting: useful for stubborn shots
  • full-movie inpainting workflow: too heavy and unnecessary for your case

Why I think your scene is actually solvable

Your scene is simple in a good way:

  • two characters
  • side-by-side walking
  • then sitting on a bench

That means you do not need a model to remember a hundred changing details. You mainly need it to preserve:

  • the same two faces
  • the same clothes
  • the same relative body shapes
  • the same scale relationship
  • the same pair relationship in space

That is exactly the kind of problem that anchor-frame workflows can handle well. If you were trying to do a crowded party scene, a fight, or a lot of close-up dialogue cutting, I would be much less optimistic. For two people walking and sitting, continuity-by-handoff is realistic. The official Wan FLF workflow is built for transitions where you know the start and end states and want the model to create the motion between them. (ComfyUI)

My strongest suggestion for your case

Build the film around a character pair bible first.

Before rendering more video, create a tiny library of stills:

  • both characters walking side by side
  • both characters standing near the bench
  • both characters seated on the bench
  • optionally one tighter face reference for each person

Do not think of them as “nice images.” Think of them as production anchors.

Then use those anchors like this:

Walking shots

Generate from a strong walking still with I2V. Keep motion modest.

Transition shot

Use first/last-frame:

  • first frame = trusted standing / arriving pose
  • last frame = trusted seated pose

Bench shots

Start from the seated anchor and allow only small pose or camera changes.

That is the workflow I would choose before touching training. (ComfyUI)

My advice on multiple LoRAs

You can load multiple LoRAs, but I would be cautious.

The practical rule is: every loaded LoRA is generally influencing the model. In some wrappers there are explicit limits and caveats around merging or scheduling multiple LoRAs; for example, the WanVideoWrapper code warns that some LoRA setups cannot be merged under certain low-memory modes and that scheduling can be disabled. (GitHub)

For your project, I would keep it simple:

  • base model only, or
  • base model + one utility LoRA, or
  • base model + one carefully tested identity/style LoRA

I would not stack several creative LoRAs and hope they settle into a stable cast. In practice, that often makes debugging much harder and can worsen identity drift rather than improve it. That last point is partly experience-based reasoning, but it follows from the fact that multi-LoRA behavior is not a magical “smart mix”; it is just multiple influences being applied at once. (GitHub)

Why do some LoRAs have trigger words and many do not?

Because “trigger word” is a training convention, not a law of nature.

A LoRA may have a trigger word when it was trained with a specific token deliberately inserted into captions. But many LoRAs are trained with more natural captions, or their metadata is incomplete, or the UI simply does not surface the tag information well. Recent AI Toolkit issue reports explicitly show cases where a LoRA was trained with a trigger_word in config, yet the resulting .safetensors did not expose the expected metadata cleanly in downstream tools. (GitHub)

There is another important wrinkle: a LoRA can still visibly affect output even without the trigger word in the prompt. There is a long-standing sd-scripts issue about exactly this behavior. So “trigger word” does not mean “the LoRA is off until this word appears.” Loading the LoRA already changes the model; the trigger word is more like a steering handle for a specific learned concept. (GitHub)

A useful way to think about it is:

  • loading the LoRA = changes the model weights used for generation
  • trigger word = tells the text encoder which learned concept to emphasize

That is why some LoRAs feel active even with no visible trigger, while others feel weak unless you use the right token. The training style and saved metadata both matter. (GitHub)

What I would do on your 8 GB machine

I would use the official Wan 2.2 5B path and keep the process very disciplined:

1. Build anchors before rendering motion

Make the two people look right in stills first. Do not move on until they do.

2. Use I2V as your continuity backbone

Every new shot should start from a frame you already trust.

3. Use FLF for the sit-down moment

This is exactly the kind of boundary where first/last-frame control is stronger than free prompting.

4. Use standard VAE decode for finals

There is a documented ComfyUI issue where VAE Decode (Tiled) causes blinking and discoloration with Wan output, while standard VAE Decode is fine. If you can fit the final decode, avoid tiled decode for final renders. (GitHub)

5. Use Lightning only if you want faster previews

Lightning is useful because it makes iteration faster, not because it magically solves character identity. (Hugging Face)

That is the cheapest serious route.

What I would avoid right now

I would avoid these three things for now:

1. Full Wan video LoRA training

The hardware guidance is still above your current VRAM comfort zone. (GitHub)

2. Heavy multi-LoRA stacks

Too many moving parts, too hard to debug, too easy to destabilize the pair.

3. Long uncontrolled clips

Short connected shots are easier to keep coherent than one long free-running shot.

If you later decide to train something

If you eventually decide that you really do need a custom identity asset, then the better stepping stone is usually:

  • image-side LoRA first
  • video-side LoRA later

Current Wan training options do exist. Musubi-tuner supports Wan 2.1/2.2 LoRA training, DiffSynth-Studio advertises Wan support with memory-management and LoRA training, and AI Toolkit-based UIs have added Wan video training options. But the hardware and workflow complexity still make these “later” moves rather than “first fix” moves for an 8 GB setup. (GitHub)

One interesting detail from Musubi-tuner discussions is that Wan 2.2 is so much larger than SDXL that a given LoRA rank is correspondingly more capable, and experienced users report often training Wan around rank 16 / alpha 16 for simple character LoRAs. That is useful background for later, because it suggests “bigger rank” is not automatically better on Wan. (GitHub)

My concrete recommendation for your exact project

If this were my project, I would do this:

  1. Make one excellent still of the pair walking.
  2. Make one excellent still of the pair seated on the bench.
  3. Use the walking still to generate a short walking shot with I2V.
  4. Take the best end frame.
  5. Repair that end frame if the faces drifted.
  6. Use the repaired frame as the start of the next shot.
  7. Use FLF from “standing near bench” to “seated on bench.”
  8. Render final clips with standard VAE decode if possible.
  9. Only then test Lightning for faster preview cycles. (ComfyUI)

That is the path I think gives you the best odds of finishing the film with the same two recognizable people without needing immediate custom training.

Bottom line

For your case:

  • public Wan LoRAs are worth trying mainly for speed and stability, not as a complete identity fix. (Hugging Face)
  • inpainting the last frame is a good idea, especially as a bridge into the next clip. (GitHub)
  • multiple LoRAs can be loaded, but I would keep that to a minimum. (GitHub)
  • trigger words exist because of how a LoRA was trained and how metadata was saved; they are not always present, and LoRAs can still affect output without them. (GitHub)
  • the best fix for you is not “more prompting.” It is anchor frames + I2V + FLF + selective repair. (ComfyUI)

Use Wan 2.2 5B as the base, because the official ComfyUI guide is specifically aimed at getting that workflow onto 8 GB VRAM with native offloading. For this project, keep the toolset narrow: I2V for continuity, FLF for the sit-down transition, standard VAE Decode for finals if you can fit it, and only optional short-span repair if a key frame drifts. That is the cleanest 8 GB path. (docs.comfy.org)

The overall production idea

Treat the whole piece as four linked shots, not one long generation:

  1. Walk
  2. Approach bench
  3. Sit
  4. Seated

Each shot hands one trusted frame to the next shot.

That is the backbone.


Before generating any video

0. Build the three anchors first

Make these stills before you render motion:

  • Anchor A — walking pair
  • Anchor B — standing / arriving near bench
  • Anchor C — seated on bench

Do not move on until all three are good.

These anchors should lock:

  • the same two faces
  • the same outfits
  • the same hair
  • the same relative height and body shape
  • the same bench / environment style

For your film, these anchors matter more than another LoRA.


Shot 1 — Walk

Goal

Get a short, believable walking shot where the pair still looks correct.

Workflow

Use I2V from Anchor A.

Why I2V here

You already know what the pair should look like. The job is not “invent two people.” The job is “animate these two people a little.” The Wan-native workflow already supports this logic directly. (docs.comfy.org)

What to ask for

Keep it modest:

  • side-by-side walking
  • small body motion
  • small camera movement
  • no dramatic pose change
  • no major turn away from camera

This shot should be short. Do not try to make the entire walking section in one go. On 8 GB, short controlled clips are much easier to keep coherent than long uncontrolled ones.

What to save

At the end of Shot 1, save:

  • the best final frame
  • optionally one mid-shot frame if the end frame is weak

That final frame becomes the handoff into Shot 2.


Shot 2 — Approach bench

Goal

Move from walking to “near the bench” without recasting the pair.

Workflow

Use I2V again, but now the input image is the repaired last frame from Shot 1.

What to do first

Inspect the last frame from Shot 1.

If the faces are slightly off:

  • repair that frame first
  • then use the repaired version as input

This is where your “inpaint the last frame” idea is genuinely useful.

Why this works

You are turning the clip boundary into a controlled handoff. Instead of asking the model to recreate the pair from text, you are telling it, in effect:

Start here. These are the people.

That is exactly the kind of explicit continuity help current Wan workflows benefit from. (docs.comfy.org)

What to ask for

Again, keep it conservative:

  • slowing down
  • moving toward bench
  • slight posture shift
  • maybe slight turn of heads
  • no full sit yet

Do not combine “approach bench” and “sit down” in this shot. Split them.

What to save

Save the best frame where the pair is clearly standing or just about to sit.

That frame becomes the first endpoint for Shot 3.


Shot 3 — Sit

Goal

Create the actual standing-to-seated transition.

Workflow

This is the one shot where I would switch away from pure I2V and use FLF.

Use:

  • first frame = best standing-near-bench frame from Shot 2
  • last frame = Anchor C or a repaired seated still

Why FLF here

This is the one place in your film where you know both ends very clearly:

  • start: standing near bench
  • end: seated on bench

That is exactly what the official Wan FLF workflow is designed for. (docs.comfy.org)

What to ask for

Keep the motion plain:

  • both lower into the bench
  • small coordinated posture change
  • no complex hand gestures
  • no big camera move

This shot should also be short.

If it fails

If the FLF output reaches the seated pose but the faces slip:

  • repair the seated endpoint still
  • rerun the transition using the corrected endpoint

That is much cheaper than trying to fix a whole long clip afterward.


Shot 4 — Seated

Goal

Hold the pair on the bench with minimal drift.

Workflow

Go back to I2V, using either:

  • the best final frame from Shot 3, or
  • Anchor C if Shot 3 ends weakly

What to ask for

Only subtle movement:

  • breathing
  • head turn
  • slight glance
  • small hand movement
  • slow camera drift

Bench shots are where you can make the film feel longer without asking the model to do too much. This is the safest place to get duration.

Editorial trick

If you have a really good seated clip, use editing to stretch its screen time:

  • hold a good frame briefly
  • crossfade into the next seated clip
  • use a gentle camera push in post if needed

That is often smarter than generating more motion.


Repair policy

When to repair

Repair only these:

  • the last frame of Shot 1
  • the last frame of Shot 2
  • the seated endpoint for Shot 3
  • any hero close-up or especially visible face

When not to repair

Do not start frame-by-frame fixing everything unless the clip is genuinely important.

For your project, selective repair is the efficient path.

If more than one or two frames are broken

Then a short-span video inpainting pass can make sense. LanPaint now supports Wan 2.2 video inpainting, but it is explicitly positioned for shorter spans, not big full-scene rewrites. (github.com)


LoRA plan for this workflow

What I would do

Keep LoRA usage minimal:

  • Option A: no LoRA
  • Option B: one utility LoRA for faster previews
  • Option C: one carefully tested style/identity LoRA

What I would not do

Do not stack several stylistic LoRAs while you are still solving continuity.

That makes it much harder to tell whether the model is drifting because of:

  • the shot design,
  • the clip handoff,
  • or the combined LoRA influences.

If you want faster previews

Try Wan2.2-Lightning for preview passes only. It is a 4-step LoRA family designed to speed Wan 2.2 generation, which is useful for look-dev and shot testing. Once you like the handoff strategy, do cleaner final passes with your preferred final setup. (huggingface.co)


Decode plan on 8 GB

Previews

For rough previews, do whatever keeps the graph alive.

Finals

For final output, try to avoid VAE Decode (Tiled) if possible.

There is a documented ComfyUI issue where Wan output shows blinking/discoloration with tiled decode, while standard VAE Decode does not show the same problem. (github.com)

So the practical rule is:

  • preview flexibly
  • final decode carefully

If standard decode is too heavy at full size, render smaller for continuity approval, then do the final-quality pass only on the shots you are actually keeping.


The exact order I would follow

Pass 1 — still approval

Make and approve:

  • Anchor A
  • Anchor B
  • Anchor C

Pass 2 — motion approval

Generate:

  • Shot 1 walk
  • Shot 2 approach
  • Shot 3 sit
  • Shot 4 seated

Keep them short.

Pass 3 — continuity approval

At every shot boundary:

  • inspect last frame
  • repair if needed
  • feed forward

Pass 4 — final render approval

Render only the approved shots with your better decode path.

Pass 5 — edit

Assemble in your video editor.
Do not ask ComfyUI to do every bit of pacing that a normal edit can do more safely.


Suggested shot-by-shot checklist

Shot 1 — Walk

  • Input = Anchor A
  • Mode = I2V
  • Duration = short
  • Motion = gentle walk
  • Output = save best final frame

Shot 2 — Approach bench

  • Input = repaired last frame from Shot 1
  • Mode = I2V
  • Duration = short
  • Motion = approach / slow / prepare to sit
  • Output = save best standing-near-bench frame

Shot 3 — Sit

  • First frame = best standing-near-bench frame
  • Last frame = Anchor C or repaired seated still
  • Mode = FLF
  • Duration = short
  • Motion = simple sit-down transition
  • Output = save best seated final frame

Shot 4 — Seated

  • Input = best seated frame
  • Mode = I2V
  • Duration = short to medium
  • Motion = very subtle
  • Output = use this shot to create calm screen time

What not to do for this film

Do not:

  • generate each shot from text only
  • ask one shot to do too many state changes
  • stack multiple unknown LoRAs
  • rely on trigger words as your main continuity tool
  • start Wan video LoRA training on 8 GB as your first fix

The official toolchain direction is already telling you the same thing indirectly: 5B on 8 GB for inference, I2V/FLF for structure, and much higher memory expectations for training. (docs.comfy.org)


My simplest version of the plan

If you want the cleanest possible summary:

  • Walk = I2V from walking anchor
  • Approach bench = I2V from repaired last frame of walk
  • Sit = FLF from standing-near-bench to seated anchor
  • Seated = I2V from seated anchor
  • Repair only the bridge frames
  • Use standard VAE Decode for finals if possible
  • Use Lightning only for fast previews

That is the shot-by-shot workflow I would use on an 8 GB machine for your exact scene.