“The doom lies in yourself, not in your name.”
Continuation of Wur doomed!.
For longer text chunks or stories, https://pastebin.com works great and helps prevent the thread from slowing down!
🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧
🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛⬛🟧
🟧🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧🟧
⬜🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛🟧⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧⬛🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧⬛⬛⬛⬛🟧⬛⬛⬛⬛🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧⬛⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧⬛⬛⬛⬛🟧🟧🟧⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛🟧🟧🟧⬛⬛🟧⬜🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛⬛⬛⬛⬛🟧🟧⬜🟧🟧⬛⬛⬛⬛⬛⬛🟧🟧🟧⬛⬛⬛⬛⬛⬛🟧🟧⬜🟧⬛⬛🟧⬜🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛⬛⬛⬛🟧🟧⬜⬜⬜🟧🟧⬛⬛⬛⬛🟧🟧⬜🟧🟧⬛⬛⬛⬛🟧🟧⬜⬜🟧🟧⬛🟧⬜🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛⬛⬛🟧🟧⬜⬜⬜⬜⬜🟧🟧⬛⬛🟧🟧⬜⬜⬜🟧🟧⬛⬛🟧🟧⬜⬜⬜⬜🟧🟧🟧⬜🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛⬛🟧🟧⬜⬜⬜⬜⬜⬜⬜🟧🟧🟧🟧⬜⬜⬜⬜⬜🟧🟧🟧🟧⬜⬜⬜⬜⬜⬜⬜⬜⬜🟧🟧⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜🟧⬛⬛🟧⬜
⬜🟧⬛⬛🟧🟧⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜🟧🟧⬛🟧⬜
⬜🟧⬛🟧🟧⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜🟧⬛🟧⬜
⬜🟧🟧🟧⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜🟧🟧🟧⬜
The doom is still buried within Command-A for sure.
A step 601 preview - all with temperature = 0:
- It's still messing up some end of lines, but I can live with that if it works... Likely can be fixed later using the new
class 0random data if a problem. - The Grimdark story was noticeably (much!) better compared to the inverse.
- The Battlestar Galactica story showed that even though
Q8_0,F16andBF16all diverge slightly fromF32; it's not clearly making them any worse (I actually liked theQ8_0story best!).
| Size | Name |
|---|---|
| 287M | command-a-03-2025-lora-Q8_0.ggu |
| 541M | command-a-03-2025-lora-F16.gguf |
| 541M | command-a-03-2025-lora-BF16.gguf |
| 1.1G | command-a-03-2025-lora-F32.gguf |
It still has a way to go before it starts to converge, but I would think by step 1000 it will be pretty close:
566 responses in previous thread! In the future we may be the reason for hf staff to implement multi-page view of discussions.
This was posted on Hacker News today:
Absolutely fascinating!
This was posted on Hacker News today:
Absolutely fascinating!
That was really cool. Thanks for sharing!
This was posted on Hacker News today:
Absolutely fascinating!
That was really cool. Thanks for sharing!
Yeah, and llama-3.1:405b doing so well was quite a surprise too (and makes you a bit sad everything seems to be moving away from large dense models ).
PSA in case anyone here missed it:
We just have been compromised, thousands of peoples likely are as well, more details updated here: https://futuresearch.ai/blog/litellm-pypi-supply-chain-attack/
Maybe avoiding the original Command R as it's a nightmare for context.
If you are using ik_llama.cpp then this might be worth a read:
https://github.com/ikawrakow/ik_llama.cpp/pull/1033
IIRC, the original Command R model (ie: 35B parameters that doesn't use GQA) was particularly amenable to KV-cache quantisation anyway, so using this could make Q4_0 quite viable!
@jukofyork out of your medium creative writing models which would you say is the best in terms of dialogue? Maybe avoiding the original Command R as it's a nightmare for context.
I think sadly they are all a bit buggy and I had mostly given up until today trying to get it working ☹️
The problem is that I really want to do this without lots of complex data preparation pipelines, as I fear that could end up causing even more problems and bias... I really want to just prepare some contrasting data from a pair of datasets and get that working, but alas everything I have tried almost works but ends up adding subtle bugs like scrambling the end of paragraphs or similar.
I'm also not all that convinced my slop-fiction-paragraphs is actually much use for modern models, as they have a completely different "slop profile" (eg:
NOT X, BUT Yand less "tapestries" and "shivers").If I do succeed in getting anywhere then I'll happily retry and some medium/small models though! I'm using 100 tokens per tunable parameter (
rank-16for8 layers=3.146M), so will take nearly a week to train on a300Mtoken sample:SERVER-1: [2026-03-24 15:56:11.790] [INFO] [qlora-pipe-lite] initial evaluation loss: 2.9341 SERVER-1: [2026-03-24 15:56:12.324] [INFO] [qlora-pipe-lite] before GAS splitting, batch size: 20, total tokens: 81920 SERVER-1: [2026-03-24 16:02:11.153] [INFO] [qlora-pipe-lite] before GAS splitting, batch size: 20, total tokens: 81920 SERVER-1: [2026-03-24 16:02:50.076] [INFO] [qlora-pipe-lite] step: 1 / 1194, loss: 2.9329, lr: 1.418e-05, throughput: 0.151 sequences/s, elapsed: 6m38s, eta: 5d11h59m SERVER-1: [2026-03-24 16:08:49.264] [INFO] [qlora-pipe-lite] before GAS splitting, batch size: 20, total tokens: 81920 SERVER-1: [2026-03-24 16:09:28.036] [INFO] [qlora-pipe-lite] step: 2 / 1194, loss: 2.9284, lr: 2.005e-05, throughput: 0.151 sequences/s, elapsed: 13m16s, eta: 5d11h49m SERVER-1: [2026-03-24 16:15:27.233] [INFO] [qlora-pipe-lite] before GAS splitting, batch size: 20, total tokens: 81920 SERVER-1: [2026-03-24 16:16:06.056] [INFO] [qlora-pipe-lite] step: 3 / 1194, loss: 2.9223, lr: 2.456e-05, throughput: 0.151 sequences/s, elapsed: 19m54s, eta: 5d11h42m SERVER-1: [2026-03-24 16:22:05.211] [INFO] [qlora-pipe-lite] before GAS splitting, batch size: 20, total tokens: 81920 SERVER-1: [2026-03-24 16:22:43.960] [INFO] [qlora-pipe-lite] step: 4 / 1194, loss: 2.9232, lr: 2.835e-05, throughput: 0.151 sequences/s, elapsed: 26m32s, eta: 5d11h34m SERVER-1: [2026-03-24 16:28:43.097] [INFO] [qlora-pipe-lite] before GAS splitting, batch size: 20, total tokens: 81920 SERVER-1: [2026-03-24 16:29:21.861] [INFO] [qlora-pipe-lite] step: 5 / 1194, loss: 2.9294, lr: 3.170e-05, throughput: 0.151 sequences/s, elapsed: 33m10s, eta: 5d11h27m
That's a crazy amount of effort so I'm not surprised you want to make it easier!
There was a project I saw recently that was for making a synthetic dataset easily, maybe you could modify that to make it easier?
https://github.com/Ready-Art/Synthetic-Dataset-Generator
There was also this that looked quite cool but again not sure how useful it is to you.
https://github.com/Schneewolf-Labs/Merlina
Again, I'm no expert at this and have never tuned a model in my life!
(eta: 5d11h27m)
Hey remember to setup checkpoints every n steps as I remember your training crashed last time.
I tend to create a private hf repo during training and push them as they're created
but alas everything I have tried almost works but ends up adding subtle bugs like scrambling the end of paragraphs or similar.
I think things like this could be problematic
(eta: 5d11h27m)
Hey remember to setup checkpoints every n steps as I remember your training crashed last time.
I tend to create a private hf repo during training and push them as they're created
It's setup to checkpoint every hour so hopefully OK.
but alas everything I have tried almost works but ends up adding subtle bugs like scrambling the end of paragraphs or similar.
I think things like this could be problematic
Yeah, but hopefully this will get drowned out: 100 tokens per trainable parameter and using a pair of positive/negative datasets should be massive overkill, but I've been hopeful before 🤞
100 tokens per trainable parameter and using a pair of positive/negative datasets should be massive overkill
Since synth-slop dataset (just reading random samples) has perfect formatting, spelling, and punctuation with no OCR issues, all the formatting issues are in the positive mix.
I suppose in a week, if it learns some of these artifacts, should be easy to identify which ones need to be filtered out.
For audio / TTS, I find the models love to associate "truncated sentence" with specific voices or if a particular source dataset had issues like that (making it easy to filter out for the next iteration).
100 tokens per trainable parameter and using a pair of positive/negative datasets should be massive overkill
Since synth-slop dataset (just reading random samples) has perfect formatting, spelling, and punctuation with no OCR issues, all the formatting issues are in the positive mix.
I suppose in a week, if it learns some of these artifacts, should be easy to identify which ones need to be filtered out.
I'm actually using a 50/50 mix for the two classes:
+ve class75M tokens from my own personal "books" dataset, with paragraphs taken from around 10-11k books.+ve class75M tokens from my gutenberg-fiction-paragraphs dataset.-ve class75M tokens from my slop-fiction-paragraphs dataset.-ve class75M tokens from my literotica-fiction-paragraphs dataset.
They all have slightly different oddball quirks with the formatting, so this should hopefully make it harder for the training to latch onto any single weird formatting problem like this.
Do you also find that with CR+, every single token in the system prompt, has a strong impact on the logits throughout the context?
And it's the only model I've found that answers riddles those trick riddles like this correctly every time:
"A woman and her son are in a car accident. The woman is sadly killed. The boy is rushed to hospital. When the doctor sees the boy, he says "I can't operate on this child, he is my son." How is this possible?"
The "v1.5", Command-A, Claude 4.6, etc all say "the mother" or "two mothers".
I'm not sure if it's the unique full attention mechanism / the other models ignore the " he says", or if it's alignment tax since other older models like Mixtral say "mother"?
Opus-4.6's reply (non-thinking) was interesting because it seemed so conflicted:
## The Answer
The doctor is **the boy's mother** — or more precisely, **his other parent who is a woman/mother**.
Wait, let me be precise: The doctor is simply **the boy's mother**. Since the boy has two mothers (or the doctor is a woman), there is no contradiction.
The more straightforward answer: **The doctor is the boy's mother** — a woman. The riddle relies on the outdated assumption that doctors are male, causing people to puzzle over it unnecessarily.
This is a classic riddle designed to highlight **unconscious gender bias** — the tendency to default to imagining a male doctor, even when there's no reason to.




