“The doom lies in yourself, not in your name.”

#15
by jukofyork - opened

Continuation of Wur doomed!.

For longer text chunks or stories, https://pastebin.com works great and helps prevent the thread from slowing down!

🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧
🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛⬛🟧
🟧🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧🟧
⬜🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛🟧⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧⬛🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧⬛⬛⬛⬛🟧⬛⬛⬛⬛🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧⬛⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧⬛⬛⬛⬛🟧🟧🟧⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛🟧🟧🟧⬛⬛🟧⬜🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛⬛⬛⬛⬛🟧🟧⬜🟧🟧⬛⬛⬛⬛⬛⬛🟧🟧🟧⬛⬛⬛⬛⬛⬛🟧🟧⬜🟧⬛⬛🟧⬜🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛⬛⬛⬛🟧🟧⬜⬜⬜🟧🟧⬛⬛⬛⬛🟧🟧⬜🟧🟧⬛⬛⬛⬛🟧🟧⬜⬜🟧🟧⬛🟧⬜🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛⬛⬛🟧🟧⬜⬜⬜⬜⬜🟧🟧⬛⬛🟧🟧⬜⬜⬜🟧🟧⬛⬛🟧🟧⬜⬜⬜⬜🟧🟧🟧⬜🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛⬛🟧🟧⬜⬜⬜⬜⬜⬜⬜🟧🟧🟧🟧⬜⬜⬜⬜⬜🟧🟧🟧🟧⬜⬜⬜⬜⬜⬜⬜⬜⬜🟧🟧⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜🟧⬛⬛🟧⬜
⬜🟧⬛⬛🟧🟧⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜🟧🟧⬛🟧⬜
⬜🟧⬛🟧🟧⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜🟧⬛🟧⬜
⬜🟧🟧🟧⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜🟧🟧🟧⬜

jukofyork pinned discussion

The doom is still buried within Command-A for sure.

The doom is still buried within Command-A for sure.

Only another 38 days to go:

image.png

Spoiler

It's actually going really well and pretty sure it will be mostly converged within another couple of days:

image.png

🤞

A step 601 preview - all with temperature = 0:

https://pastebin.com/GASKaHTk

https://pastebin.com/CRT81QLb

  • It's still messing up some end of lines, but I can live with that if it works... Likely can be fixed later using the new class 0 random data if a problem.
  • The Grimdark story was noticeably (much!) better compared to the inverse.
  • The Battlestar Galactica story showed that even though Q8_0, F16 and BF16 all diverge slightly from F32; it's not clearly making them any worse (I actually liked the Q8_0 story best!).
Size Name
287M command-a-03-2025-lora-Q8_0.ggu
541M command-a-03-2025-lora-F16.gguf
541M command-a-03-2025-lora-BF16.gguf
1.1G command-a-03-2025-lora-F32.gguf

It still has a way to go before it starts to converge, but I would think by step 1000 it will be pretty close:

image.png

566 responses in previous thread! In the future we may be the reason for hf staff to implement multi-page view of discussions.

This was posted on Hacker News today:

https://outsidetext.substack.com/p/how-does-a-blind-model-see-the-earth?selection=5413dcae-b9f4-4adb-8826-d48e3908de2a#:~:text=Wow%2C%20best%20rendition%20of%20the%20Global%20West%20so%20far

Absolutely fascinating!

That was really cool. Thanks for sharing!

Yeah, and llama-3.1:405b doing so well was quite a surprise too (and makes you a bit sad everything seems to be moving away from large dense models ).

PSA in case anyone here missed it:

https://old.reddit.com/r/LocalLLaMA/comments/1s2c1w4/litellm_1827_and_1828_on_pypi_are_compromised_do/

We just have been compromised, thousands of peoples likely are as well, more details updated here: https://futuresearch.ai/blog/litellm-pypi-supply-chain-attack/

Maybe avoiding the original Command R as it's a nightmare for context.

If you are using ik_llama.cpp then this might be worth a read:

https://github.com/ikawrakow/ik_llama.cpp/pull/1033

image

IIRC, the original Command R model (ie: 35B parameters that doesn't use GQA) was particularly amenable to KV-cache quantisation anyway, so using this could make Q4_0 quite viable!

@jukofyork out of your medium creative writing models which would you say is the best in terms of dialogue? Maybe avoiding the original Command R as it's a nightmare for context.

I think sadly they are all a bit buggy and I had mostly given up until today trying to get it working ☹️

The problem is that I really want to do this without lots of complex data preparation pipelines, as I fear that could end up causing even more problems and bias... I really want to just prepare some contrasting data from a pair of datasets and get that working, but alas everything I have tried almost works but ends up adding subtle bugs like scrambling the end of paragraphs or similar.

I'm also not all that convinced my slop-fiction-paragraphs is actually much use for modern models, as they have a completely different "slop profile" (eg: NOT X, BUT Y and less "tapestries" and "shivers").

If I do succeed in getting anywhere then I'll happily retry and some medium/small models though! I'm using 100 tokens per tunable parameter (rank-16 for 8 layers = 3.146M), so will take nearly a week to train on a 300M token sample:

SERVER-1: [2026-03-24 15:56:11.790] [INFO] [qlora-pipe-lite] initial evaluation loss: 2.9341
SERVER-1: [2026-03-24 15:56:12.324] [INFO] [qlora-pipe-lite] before GAS splitting, batch size: 20, total tokens: 81920
SERVER-1: [2026-03-24 16:02:11.153] [INFO] [qlora-pipe-lite] before GAS splitting, batch size: 20, total tokens: 81920
SERVER-1: [2026-03-24 16:02:50.076] [INFO] [qlora-pipe-lite] step: 1 / 1194, loss: 2.9329, lr: 1.418e-05, throughput: 0.151 sequences/s, elapsed: 6m38s, eta: 5d11h59m
SERVER-1: [2026-03-24 16:08:49.264] [INFO] [qlora-pipe-lite] before GAS splitting, batch size: 20, total tokens: 81920
SERVER-1: [2026-03-24 16:09:28.036] [INFO] [qlora-pipe-lite] step: 2 / 1194, loss: 2.9284, lr: 2.005e-05, throughput: 0.151 sequences/s, elapsed: 13m16s, eta: 5d11h49m
SERVER-1: [2026-03-24 16:15:27.233] [INFO] [qlora-pipe-lite] before GAS splitting, batch size: 20, total tokens: 81920
SERVER-1: [2026-03-24 16:16:06.056] [INFO] [qlora-pipe-lite] step: 3 / 1194, loss: 2.9223, lr: 2.456e-05, throughput: 0.151 sequences/s, elapsed: 19m54s, eta: 5d11h42m
SERVER-1: [2026-03-24 16:22:05.211] [INFO] [qlora-pipe-lite] before GAS splitting, batch size: 20, total tokens: 81920
SERVER-1: [2026-03-24 16:22:43.960] [INFO] [qlora-pipe-lite] step: 4 / 1194, loss: 2.9232, lr: 2.835e-05, throughput: 0.151 sequences/s, elapsed: 26m32s, eta: 5d11h34m
SERVER-1: [2026-03-24 16:28:43.097] [INFO] [qlora-pipe-lite] before GAS splitting, batch size: 20, total tokens: 81920
SERVER-1: [2026-03-24 16:29:21.861] [INFO] [qlora-pipe-lite] step: 5 / 1194, loss: 2.9294, lr: 3.170e-05, throughput: 0.151 sequences/s, elapsed: 33m10s, eta: 5d11h27m

That's a crazy amount of effort so I'm not surprised you want to make it easier!

There was a project I saw recently that was for making a synthetic dataset easily, maybe you could modify that to make it easier?

https://github.com/Ready-Art/Synthetic-Dataset-Generator

There was also this that looked quite cool but again not sure how useful it is to you.
https://github.com/Schneewolf-Labs/Merlina

Again, I'm no expert at this and have never tuned a model in my life!

(eta: 5d11h27m)

Hey remember to setup checkpoints every n steps as I remember your training crashed last time.
I tend to create a private hf repo during training and push them as they're created

but alas everything I have tried almost works but ends up adding subtle bugs like scrambling the end of paragraphs or similar.

I think things like this could be problematic

image

(eta: 5d11h27m)

Hey remember to setup checkpoints every n steps as I remember your training crashed last time.
I tend to create a private hf repo during training and push them as they're created

It's setup to checkpoint every hour so hopefully OK.

but alas everything I have tried almost works but ends up adding subtle bugs like scrambling the end of paragraphs or similar.

I think things like this could be problematic

image

Yeah, but hopefully this will get drowned out: 100 tokens per trainable parameter and using a pair of positive/negative datasets should be massive overkill, but I've been hopeful before 🤞

100 tokens per trainable parameter and using a pair of positive/negative datasets should be massive overkill

Since synth-slop dataset (just reading random samples) has perfect formatting, spelling, and punctuation with no OCR issues, all the formatting issues are in the positive mix.
I suppose in a week, if it learns some of these artifacts, should be easy to identify which ones need to be filtered out.

For audio / TTS, I find the models love to associate "truncated sentence" with specific voices or if a particular source dataset had issues like that (making it easy to filter out for the next iteration).

100 tokens per trainable parameter and using a pair of positive/negative datasets should be massive overkill

Since synth-slop dataset (just reading random samples) has perfect formatting, spelling, and punctuation with no OCR issues, all the formatting issues are in the positive mix.
I suppose in a week, if it learns some of these artifacts, should be easy to identify which ones need to be filtered out.

I'm actually using a 50/50 mix for the two classes:

They all have slightly different oddball quirks with the formatting, so this should hopefully make it harder for the training to latch onto any single weird formatting problem like this.

Do you also find that with CR+, every single token in the system prompt, has a strong impact on the logits throughout the context?

And it's the only model I've found that answers riddles those trick riddles like this correctly every time:
"A woman and her son are in a car accident. The woman is sadly killed. The boy is rushed to hospital. When the doctor sees the boy, he says "I can't operate on this child, he is my son." How is this possible?"
The "v1.5", Command-A, Claude 4.6, etc all say "the mother" or "two mothers".

I'm not sure if it's the unique full attention mechanism / the other models ignore the " he says", or if it's alignment tax since other older models like Mixtral say "mother"?

Opus-4.6's reply (non-thinking) was interesting because it seemed so conflicted:

## The Answer

The doctor is **the boy's mother** — or more precisely, **his other parent who is a woman/mother**.

Wait, let me be precise: The doctor is simply **the boy's mother**. Since the boy has two mothers (or the doctor is a woman), there is no contradiction.

The more straightforward answer: **The doctor is the boy's mother** — a woman. The riddle relies on the outdated assumption that doctors are male, causing people to puzzle over it unnecessarily.

This is a classic riddle designed to highlight **unconscious gender bias** — the tendency to default to imagining a male doctor, even when there's no reason to.

Sign up or log in to comment