Best approach in 2026 to reproduce a specific cartoon style?

Hi everyone :waving_hand:
I’m looking for advice and I’m a bit stuck.

My goal is to reproduce a very specific cartoon-style graphic style from a small dataset (~35 images). I’m open to any workflow or architecture.

What I’ve tried so far:

  • LoRA on Replicate → disappointing
  • LoRA on Flux2 (Fal.ai) → best results so far, but characters are quite bad
  • PRUNA AI p-image → not great
  • QWEN2512 on Fal.ai with LoRA → not great

I’ve heard about DreamBooth and SD-based approaches, but I’m honestly not sure if they’re still the right tools in 2026 for learning a precise style (especially cartoon/illustrative).

I’ll attach a reference image of the style.
Given this kind of dataset and goal, what would you recommend today (models, training methods, workflows, etc.)?

For context: this style was easy to generate in 2023–2024, but it feels almost impossible to get from current models.

Thanks a lot :folded_hands:

1 Like

Training and creating Style LoRA and Character LoRA to use them together is the most stable and ideal method for any image generation model, but it’s difficult to teach the style effectively with only 35 samples…

If it were me, for famous anime or cartoons, I’d think there’s a high chance someone has already created a Style LoRA (and maybe even a Character LoRA), so I’d look for that first.
Also, regarding base models, you might find ones that have been fine-tuned after pre-merging such Style LoRAs (or even just ones specialized for cartoons).

However, if you have an environment where FLUX.2 works, there seem to be other methods available?


What changed, and why “exact cartoon style” is harder now

Modern foundation image models are generally optimized for broad coverage, strong photoreal priors, and “nice-looking” detail. A TV-cel cartoon look is closer to a rendering pipeline than a texture/style overlay:

  • stable outlines / line weight
  • simplified proportions that fight realism priors
  • flat/cel fills + limited shadow shapes
  • controlled palettes and low texture

With ~35 images, a single generic “style LoRA” often learns background vibe but fails at character construction (faces, eyes, mouths, hands).


Best approach in 2026 (given ~35 images)

Recommendation 1 (highest success rate): FLUX.2 multi-reference as the main generator

FLUX.2 is explicitly positioned for multi-reference consistency with up to 10 reference images (depending on the variant/provider), specifically calling out better consistency for character/product/style. (TECHCOMMUNITY.MICROSOFT.COM)

Why it fits your case

  • Your dataset is small, but reference-conditioning turns the problem from “learn this style forever” into “stay close to these anchors every time.”
  • This is the most direct path to “it looks like that show” without fighting a base model’s priors.

How to run it so characters stop collapsing

Curate your 35 images into anchor packs (you’ll reuse these repeatedly):

  1. Face pack (3–4 images)
    Close-ups, neutral lighting + one expressive face.
  2. Body/pose pack (3–4 images)
    Mid/full body, clear silhouette.
  3. Background/lighting pack (2–3 images)
    Interior/exterior, day/night if you have it.

Then for each generation:

  • feed 4–8 refs matched to the shot you want (closeups for closeups, etc.)
  • prompt mostly for staging/content (refs carry the style)
  • if a shot fails, swap in a closer anchor pack rather than “turning knobs”

Providers and docs (for concrete limits/behavior):

  • Microsoft’s FLUX.2 overview highlights “reference up to 10 images” for best style/character consistency. (TECHCOMMUNITY.MICROSOFT.COM)
  • Together’s FLUX.2 multi-reference notes variant-specific caps (e.g., 8 vs 10 refs). (Together AI)
  • ComfyUI’s FLUX.2 guide also mentions multi-reference up to 10 images. (ComfyUI)

If your goal is production (many images in the same look), this is usually the best “2026 answer.”


Recommendation 2 (for “portable style token”): SDXL illustration-first base + a split-adapter pipeline

If you want “type a prompt, no refs, get the look,” you’ll usually need open tooling and a base model whose priors already match 2D illustration.

Base model choice (important)

  • Illustrious-XL (SDXL) supports native 1536×1536 and is explicitly positioned as an illustration-oriented SDXL model—helpful for stable lines/flat shading at higher resolutions. (Hugging Face)
  • SDXL itself is significantly larger than SD 1.5 and uses a second text encoder and other conditioning changes; it’s the most mature ecosystem for control + adapters. (arXiv)

The key idea: don’t force one LoRA to do everything

Use two adapters instead of one:

  1. Style LoRA (rendering grammar: outlines, cel shading, palette discipline)
  2. Character LoRA (DreamBooth-LoRA) only if you need recurring named characters

Diffusers documents SDXL DreamBooth-LoRA training directly (script-based, reproducible). (Hugging Face)

Add the missing piece that fixes “bad characters”: structure + reference conditioning

Even with a good style LoRA, cartoons often need geometry control.

  • ControlNet adds spatial conditioning (edges/pose/depth/seg) to keep structure stable. (arXiv)
  • IP-Adapter adds image-prompt conditioning and is designed to remain compatible with text prompts and structural controls. (arXiv)

Practical SDXL stack (typical for TV-cel stability):

  • Base: Illustrious-XL (or another illustration-first SDXL checkpoint) (Hugging Face)
  • Style LoRA (yours)
  • IP-Adapter: 1–2 anchor frames for “style lock” (arXiv)
  • ControlNet lineart/edges/pose when characters drift (arXiv)

This combination is usually what stops the “background good / characters bad” outcome.


Recommendation 3 (when you can accept a 2-step pipeline): generate composition → enforce style with an edit model

If you can tolerate an edit/refine pass, it often produces the most faithful “cel pipeline” look.

Option: FLUX.1 Kontext for editing

Kontext is explicitly an in-context image generation/editing family: prompt with both text and images, extract and modify visual concepts. (Black Forest Labs)

Option: Train an edit LoRA (paired transformation)

fal’s FLUX.2 LoRA guide describes multi image-to-image LoRAs as sets (start image(s) → end image) and recommends at least ~20 sets for meaningful performance. (fal. i Blog)

In practice, you can create training pairs by:

  • generating “neutral/unstyled” versions of each frame (or lineart/base-color versions),
  • using the original frame as the target,
  • training the transformation.

This pushes the model to learn the renderer, not the world knowledge.


How I’d choose between these (decision rule)

If you need the exact look consistently, fastest

FLUX.2 multi-reference (no training) (TECHCOMMUNITY.MICROSOFT.COM)

If you need a reusable “style token” without references

SDXL illustration-first base + Style LoRA + IP-Adapter + ControlNet (Hugging Face)

If you need maximum style faithfulness and can do two passes

Composition model → Kontext/edit pass (Black Forest Labs)


What to do with your ~35 images (high-impact prep)

1) Rebalance the dataset by cropping for scale

Cartoon faces fail when the model never sees enough face-scale detail.

Create additional training samples by cropping:

  • face closeups
  • torso shots (hands/arms)
  • full body
  • background-only crops

(These are still “from your 35,” not new data.)

2) Remove compression artifacts aggressively

fal’s FLUX trainer page explicitly emphasizes no compression artifacts/noise and says 9–50 images can be sufficient for style training if consistent. (Fal.ai)
With cartoon linework, JPEG artifacts poison the “edge language” quickly.

3) Captioning (if you train anything)

Even for style-only:

  • use a single trigger phrase consistently
  • captions describe the scene content
    This is repeatedly emphasized in provider training guidance (and is where many style LoRAs silently fail). (fal. i Blog)

“Advanced” methods that matter specifically for style vs character separation

If you split style and subject adapters, merging/mixing becomes the next failure mode (one adapter cancels the other).

  • ZipLoRA is explicitly about merging independently trained style and subject LoRAs with better fidelity. (arXiv)
  • B-LoRA targets implicit style-content separation in SDXL+LoRA, aiming to reduce the overfitting/entanglement you typically see in small datasets. (arXiv)
  • If you need series consistency across many images, StyleAligned is an inference-time approach using shared attention for consistent style across a set. (arXiv)

These aren’t always “plug-and-play,” but they explain why “style vs character” is hard and what current best research focuses on.


My concrete “2026 best workflow” for your case

Phase 1 — Prove the look (no training)

  1. Use FLUX.2 multi-reference with curated anchor packs. (TECHCOMMUNITY.MICROSOFT.COM)
  2. Iterate until you can produce 20–30 varied shots with the look locked.

If this succeeds, you may not need training at all.

Phase 2 — Make it portable (open SDXL stack)

  1. Pick an illustration-first SDXL base (Illustrious-XL is a strong starting point). (Hugging Face)
  2. Train a style LoRA (only style; no character tokens).
  3. If you have a recurring character, train DreamBooth-LoRA SDXL for that character separately. (Hugging Face)
  4. Inference: style LoRA + (optional) character LoRA + IP-Adapter anchor + ControlNet when poses drift. (arXiv)

Phase 3 — If you still can’t hit the exact renderer

Add an edit step:

  • Kontext to push outputs into the final cel renderer consistently. (Black Forest Labs)

Common reasons your earlier results underperformed (mapped to fixes)

  • LoRA “sort of” changes style but characters get worse → base priors conflict; fix by multi-reference (FLUX.2) or structure control (ControlNet) + anchor conditioning (IP-Adapter). (TECHCOMMUNITY.MICROSOFT.COM)
  • Style training “should work with 35 images” but doesn’t → dataset consistency/quality and caption binding dominate at this scale; fal explicitly calls out quality + a 9–50 style range, but only if images are clean/consistent. (Fal.ai)
  • DreamBooth uncertainty → it’s still best used for identity/subject, not pure style; keep style and subject as separate adapters and merge intelligently (ZipLoRA/B-LoRA concepts). (Hugging Face)

Hi, thanks a lot for this, genuinely appreciate how detailed and thoughtful your answer is.

I’m going first to try FLUX.2 multi-reference as you suggest.

One thing I’m curious about: if you personally had to look for an existing Style LoRA for a well-known cartoon / TV-cel style, where would you look first?
I’ve already tried searching on Replicate, but honestly the results I found there weren’t very convincing.

I’m still relatively new to this whole ecosystem, so I don’t yet have a very clear sense of where the best places are to look for this kind of pretrained work.

1 Like

Oh… In my case, there’s a lot of missing information that Google search doesn’t crawl, but I first try searching Google to see “if it seems likely to exist?” (using search terms like ‘LoRA target content name’…).
If we’re lucky, we find it at this stage.

For sites, I first look at Civitai and Hugging Face, then a few others around them.
These sites each have powerful search functions, so it’s easier to search within the site itself. They also have APIs.

If this doesn’t work, you could consider creating a synthetic dataset using previously successful models and using it to augment your training data. With enough data, you might be able to train style LoRA on a new model.