Peft 0.18.1 crashing when fine-tuning - Part 2

Hi @John6666 @BenjaminB,

Thank you so much for your amazing feedback! I appreciate it! The previous topic closed and I couldn’t add to it so I created this new one.

By trial and error, I’m now using the following filter and it stopped the crash. I’m not sure if it’s working 100% for the fine-tuning. I’m still checking. Training Loss is still high. I don’t know if impacted by the filter or other factors e.g. dtype=torch.bfloat16 conversion …

target_modules=r"model\.language_model\..*\.(q_proj|k_proj|v_proj|o_proj|gate_proj|up_proj|down_proj)",

2 Likes

There may still be a bug on the framework side when using Gemma 4:


My read is that the online cases point to one primary conclusion: your run has probably moved past the original PEFT crash, but you are now in the harder phase where a run can be alive but still wrong in quieter ways. The public reports do not suggest “Gemma 4 fine-tuning is fundamentally broken.” They suggest a stack of early-support rough edges: unsupported wrapped modules in PEFT, silent partial target matching, and separate Gemma 4 training-input quirks in Transformers. (GitHub)

What these similar cases really mean

The exact Gemma 4 PEFT issue says QLoRA fails because PEFT does not recognize Gemma4ClippableLinear as a supported LoRA target type. The current Gemma 4 model code shows why: Gemma4ClippableLinear is an nn.Module wrapper that contains an inner self.linear = nn.Linear(...), not a subclass of nn.Linear. That explains why broad targeting can crash even though the model is “linear-heavy” in practice. (GitHub)

That matters because it changes the diagnosis. The original failure was not mainly about your dataset, BF16, or optimizer. It was first an architecture-to-adapter mismatch. When people online switched to targeting the inner leaf layers like q_proj.linear, k_proj.linear, and the rest of the usual attention/MLP set, the crash stopped. You can see that pattern in public Gemma 4 adapters that explicitly list q_proj.linear, k_proj.linear, v_proj.linear, o_proj.linear, gate_proj.linear, up_proj.linear, and down_proj.linear in their adapter config. (Hugging Face)

Why I do not think the regex is the whole story anymore

The most important similar case after the Gemma 4 PEFT bug is PEFT issue #1959. That issue says PEFT may only raise an error when none of the requested target modules are found. In practice that means a target rule can be partly wrong and still let training proceed. This is the most relevant background for your current situation, because it explains how you can go from a hard crash to a training run with unexpectedly high loss without seeing a new exception. (GitHub)

So my interpretation is:

  • your regex probably did something useful,
  • but “the crash is gone” is not enough evidence that LoRA is attached to the full intended layer set,
  • and high loss can absolutely be caused by partial target coverage rather than a broken regex in the narrow sense. (GitHub)

The bigger pattern behind the similar reports

This is not unique to Gemma 4. Similar PEFT failures have already appeared on other models when target selection lands on a wrapper, container, or custom module instead of a supported leaf layer. That is why the Gemma 4 issue is best understood as part of a broader PEFT pattern, not as a mysterious one-off regression. The practical lesson from those cases is the same every time: name-based target selection is fragile when the model uses nonstandard modules. (GitHub)

That is also why I think the “correctly define target_modules” advice is directionally right but incomplete. It solves the first problem only if it actually lands on the right leaf modules and covers enough of them. It does not solve the second-wave problems around batch fields, masking, and prompt formatting. (GitHub)

The second-wave problem is real

The strongest adjacent case is the open Transformers issue saying Gemma 4 currently requires mm_token_type_ids during text-only fine-tuning and should default them to zeros but does not yet do so. The PEFT Gemma 4 issue itself explicitly mentions this as a separate training-side problem that people will hit right after they bypass the LoRA injection failure. That is a strong signal that the ecosystem roughness is layered: first PEFT target injection, then training batch semantics. (GitHub)

This is why I would not put BF16 at the top of the suspect list. The public Gemma text fine-tuning guide uses BF16-capable setups as a normal path, and the mm_token_type_ids reproduction itself uses BF16 while isolating a different cause. That makes BF16 a secondary suspect unless you did something unusual with manual casting. (Google AI for Developers)

What I think is most likely in your case now

My ranking would be:

1. Partial or imperfect adapter coverage

Your regex is probably acting as a containment rule, which is good. But the remaining risk is that it matches fewer layers than you think, or misses the wrapped leaf modules that public Gemma 4 adapters explicitly target with .linear names. The public adapter configs matter here because they show what at least some successful Gemma 4 LoRA runs converged on. (Hugging Face)

2. Batch-field or collator mismatch

The mm_token_type_ids issue is not hypothetical. It is already public, current, and directly tied to Gemma 4 text-only fine-tuning. A run can look “basically correct” while still carrying the wrong batch structure. (GitHub)

3. Prompt formatting and loss masking

Hugging Face’s Gemma 4 launch post says the built-in chat template should be used to avoid subtle formatting mistakes. TRL’s docs also make clear that assistant_only_loss and completion_only_loss materially change what tokens contribute to the objective. A wrong masking choice can keep loss high even when the adapter wiring is fine. (Hugging Face)

4. Hyperparameters drifting away from the known-good baseline

Google’s current Gemma text QLoRA guide gives a concrete baseline, including r=16, lora_alpha=16, lora_dropout=0.05, learning_rate=5e-5, and saving lm_head plus embed_tokens. When a run differs a lot from that recipe, it becomes harder to know whether loss behavior comes from the model, the data, or the training recipe. (Google AI for Developers)

My practical view of your current regex

I think your regex is a reasonable temporary filter if your task is text-only and your goal is to stay out of Gemma 4’s multimodal wrappers. That lines up with current Gemma 4 guidance from Unsloth, which recommends starting with only the language, attention, and MLP layers and keeping vision or audio layers off unless the task truly needs them. (Unsloth - Train and Run Models Locally)

But I do not think a regex is the best final resting place. The public evidence points toward a better end state: inspect the actual model, discover the exact leaf names that exist in your build, then switch to an explicit target list. That is more auditable, and it matches what successful public Gemma 4 adapters appear to be doing with .linear targets. (Hugging Face)

What I would do next

First, verify what PEFT actually adapted. PEFT’s own docs recommend checking trainable parameters and the targeted module names after get_peft_model. Given issue #1959, this is not optional for a case like yours. (Hugging Face)

Second, check the batch keys on one real training batch and confirm that token_type_ids and mm_token_type_ids exist in the shapes you expect. Gemma 4’s current Transformers issue makes that a top-priority sanity check. (GitHub)

Third, confirm that your data pipeline uses the built-in Gemma 4 chat template and that your loss masking matches your dataset type. For conversational SFT, TRL’s masking settings can change the effective learning problem a lot. (Hugging Face)

Fourth, compare your hyperparameters against the official text QLoRA guide before blaming precision. The guide is the best current baseline for “boring, known-good” Gemma training. (Google AI for Developers)

Bottom line

My overall view is this:

  • The online cases confirm that your original crash was real and upstream, not just user error. (GitHub)
  • The online cases also warn that once the crash is gone, the next failures are often silent: partial target matching, missing mm_token_type_ids, wrong prompt formatting, or wrong loss masking. (GitHub)
  • So my main thought is that you should now treat this as a verification problem, not a crash-debugging problem. The question is no longer “can PEFT start,” but “is the model actually being trained on the right layers, with the right batch semantics, under the right objective.” (Hugging Face)

Use a 5-step isolation test. Do them in this order.

Step 1. Check where LoRA actually attached

PEFT itself recommends two checks for this exact situation: print_trainable_parameters() and targeted_module_names. That is the fastest way to tell whether your regex matched the layers you wanted, too few layers, or the wrong branch entirely. (Hugging Face)

peft_model.print_trainable_parameters()
print(len(peft_model.targeted_module_names))
print(peft_model.targeted_module_names[:100])

How to read it:

  • If you see vision or audio module names, your targeting is still too broad. The Gemma 4 PEFT issue says the original unsupported Gemma4ClippableLinear problem comes from the vision/audio encoder. (GitHub)
  • If you see only a small number of language-model targets, your regex is probably too narrow. PEFT docs say verifying the adapted layers is necessary when the trainable fraction looks lower or higher than expected. (Hugging Face)
  • If you see the expected language-model q/k/v/o/gate/up/down layers, move to Step 2.

Step 2. Check one real training batch

Gemma 4 currently has a separate training issue: mm_token_type_ids may be required even for text-only fine-tuning. The public Gemma 4 PEFT issue explicitly points this out as the next thing people hit after the LoRA crash. (GitHub)

batch = next(iter(trainer.get_train_dataloader()))
print(batch.keys())
for k, v in batch.items():
    if hasattr(v, "shape"):
        print(k, v.shape, getattr(v, "dtype", None))

What you want to see:

  • input_ids
  • attention_mask
  • labels
  • token_type_ids
  • mm_token_type_ids

If mm_token_type_ids is missing, that is a strong suspect. The problem is then in your collator or preprocessing, not in the regex. (GitHub)

Step 3. Check that loss is computed on the right tokens

TRL documents that assistant_only_loss=True computes loss only on assistant responses, while prompt-completion datasets default to completion-only loss. If this is set wrong, training can run but the loss can stay misleadingly high. (Hugging Face)

labels = batch["labels"]
mask = (labels != -100)
print("supervised ratio:", mask.float().mean().item())

How to read it:

  • Very low ratio: you may be masking out almost everything.
  • Very high ratio on chat data: you may be training on user/system/template tokens too.
  • If this is conversational SFT, check whether assistant_only_loss=True is actually what you want. TRL says that option is specifically for conversational datasets. (Hugging Face)

Step 4. Check the prompt format

Hugging Face’s Gemma 4 post says to use the built-in chat template because manual formatting can introduce subtle mistakes. That is especially important on Gemma 4. (Hugging Face)

sample = dataset["train"][0]
formatted = tokenizer.apply_chat_template(
    sample["messages"],
    tokenize=False,
    add_generation_prompt=False,
)
print(formatted[:3000])

What to look for:

  • correct role order
  • no duplicated BOS/EOS
  • no manual wrapper around the built-in template
  • no older prompt format copied into Gemma 4

If the formatted text looks odd, fix that before tuning anything else. (Hugging Face)

Step 5. Compare with the official Gemma baseline

Google’s current text QLoRA guide uses a concrete starting point: r=16, lora_alpha=16, lora_dropout=0.05, learning_rate=5e-5, max_grad_norm=0.3, and modules_to_save=["lm_head", "embed_tokens"] with ensure_weight_tying=True. The same guide enables bf16=True when the model dtype is torch.bfloat16. (Google AI for Developers)

So if your setup differs a lot from that baseline, change only one thing at a time. BF16 alone is not the first thing I would blame, because the official guide uses BF16 when the model dtype is BF16. (Google AI for Developers)

Fast diagnosis table

If this happens, the likely issue is:

  • Wrong or too few targeted_module_names → LoRA target problem. (Hugging Face)
  • Missing mm_token_type_ids → collator or preprocessing problem. (GitHub)
  • Very odd supervised-token ratio → masking or trainer config problem. (Hugging Face)
  • Prompt looks duplicated or malformed → chat-template problem. (Hugging Face)
  • All of the above look fine, but loss is still bad → hyperparameters or dataset quality.

The shortest possible plan

Run these four prints first:

print(peft_model.targeted_module_names[:80])
peft_model.print_trainable_parameters()

batch = next(iter(trainer.get_train_dataloader()))
print(batch.keys())
print((batch["labels"] != -100).float().mean().item())

That usually tells you which bucket the problem is in:

  • target selection
  • missing batch fields
  • bad masking
  • or something later like hyperparameters
1 Like

Thank you so much @John6666 for your amazing feedback and guidance. I’m truly grateful for you! I will apply the steps you listed and report back on the findings shortly :folded_hands: :heart:

1 Like