“The doom lies in yourself, not in your name.”

#15

pinned

by jukofyork - opened Sep 12, 2025

Discussion

jukofyork

Owner Sep 12, 2025

•

edited Sep 12, 2025

Continuation of Wur doomed!.

For longer text chunks or stories, https://pastebin.com works great and helps prevent the thread from slowing down!

🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧
🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛⬛🟧
🟧🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧🟧
⬜🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛🟧⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧⬛🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧⬛⬛⬛⬛🟧⬛⬛⬛⬛🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧⬛⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧⬛⬛⬛⬛🟧🟧🟧⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛🟧🟧🟧⬛⬛🟧⬜🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛⬛⬛⬛⬛🟧🟧⬜🟧🟧⬛⬛⬛⬛⬛⬛🟧🟧🟧⬛⬛⬛⬛⬛⬛🟧🟧⬜🟧⬛⬛🟧⬜🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛⬛⬛⬛🟧🟧⬜⬜⬜🟧🟧⬛⬛⬛⬛🟧🟧⬜🟧🟧⬛⬛⬛⬛🟧🟧⬜⬜🟧🟧⬛🟧⬜🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛⬛⬛🟧🟧⬜⬜⬜⬜⬜🟧🟧⬛⬛🟧🟧⬜⬜⬜🟧🟧⬛⬛🟧🟧⬜⬜⬜⬜🟧🟧🟧⬜🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛⬛🟧🟧⬜⬜⬜⬜⬜⬜⬜🟧🟧🟧🟧⬜⬜⬜⬜⬜🟧🟧🟧🟧⬜⬜⬜⬜⬜⬜⬜⬜⬜🟧🟧⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜🟧⬛⬛🟧⬜
⬜🟧⬛⬛🟧🟧⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜🟧🟧⬛🟧⬜
⬜🟧⬛🟧🟧⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜🟧⬛🟧⬜
⬜🟧🟧🟧⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜🟧🟧🟧⬜

jukofyork pinned discussion Sep 12, 2025

gghfez

Sep 12, 2025

The doom is still buried within Command-A for sure.

jukofyork

Owner Sep 12, 2025

•

edited Sep 12, 2025

The doom is still buried within Command-A for sure.

Only another 38 days to go:

Spoiler

It's actually going really well and pretty sure it will be mostly converged within another couple of days:

🤞

jukofyork

Owner Sep 12, 2025

•

edited Sep 12, 2025

A `step 601` preview - all with `temperature = 0`:

https://pastebin.com/GASKaHTk

https://pastebin.com/CRT81QLb

It's still messing up some end of lines, but I can live with that if it works... Likely can be fixed later using the new class 0 random data if a problem.
The Grimdark story was noticeably (much!) better compared to the inverse.
The Battlestar Galactica story showed that even though Q8_0, F16 and BF16 all diverge slightly from F32; it's not clearly making them any worse (I actually liked the Q8_0 story best!).

Size	Name
287M	command-a-03-2025-lora-Q8_0.ggu
541M	command-a-03-2025-lora-F16.gguf
541M	command-a-03-2025-lora-BF16.gguf
1.1G	command-a-03-2025-lora-F32.gguf

It still has a way to go before it starts to converge, but I would think by step 1000 it will be pretty close:

ChuckMcSneed

Sep 12, 2025

566 responses in previous thread! In the future we may be the reason for hf staff to implement multi-page view of discussions.

jukofyork

Owner Sep 12, 2025

This was posted on Hacker News today:

https://outsidetext.substack.com/p/how-does-a-blind-model-see-the-earth?selection=5413dcae-b9f4-4adb-8826-d48e3908de2a#:~:text=Wow%2C%20best%20rendition%20of%20the%20Global%20West%20so%20far

Absolutely fascinating!

BigHuggyD

Sep 15, 2025

This was posted on Hacker News today:

https://outsidetext.substack.com/p/how-does-a-blind-model-see-the-earth?selection=5413dcae-b9f4-4adb-8826-d48e3908de2a#:~:text=Wow%2C%20best%20rendition%20of%20the%20Global%20West%20so%20far

Absolutely fascinating!

That was really cool. Thanks for sharing!

jukofyork

Owner Sep 15, 2025

This was posted on Hacker News today:

https://outsidetext.substack.com/p/how-does-a-blind-model-see-the-earth?selection=5413dcae-b9f4-4adb-8826-d48e3908de2a#:~:text=Wow%2C%20best%20rendition%20of%20the%20Global%20West%20so%20far

Absolutely fascinating!

That was really cool. Thanks for sharing!

Yeah, and llama-3.1:405b doing so well was quite a surprise too (and makes you a bit sad everything seems to be moving away from large dense models ).

403 hidden messages

Expand all

jukofyork

Owner 2 days ago

•

edited 2 days ago

PSA in case anyone here missed it:

https://old.reddit.com/r/LocalLLaMA/comments/1s2c1w4/litellm_1827_and_1828_on_pypi_are_compromised_do/

We just have been compromised, thousands of peoples likely are as well, more details updated here: https://futuresearch.ai/blog/litellm-pypi-supply-chain-attack/

jukofyork

Owner 2 days ago

Maybe avoiding the original Command R as it's a nightmare for context.

If you are using ik_llama.cpp then this might be worth a read:

https://github.com/ikawrakow/ik_llama.cpp/pull/1033

IIRC, the original Command R model (ie: 35B parameters that doesn't use GQA) was particularly amenable to KV-cache quantisation anyway, so using this could make Q4_0 quite viable!

DazzlingXeno

1 day ago

@jukofyork out of your medium creative writing models which would you say is the best in terms of dialogue? Maybe avoiding the original Command R as it's a nightmare for context.

I think sadly they are all a bit buggy and I had mostly given up until today trying to get it working ☹️

The problem is that I really want to do this without lots of complex data preparation pipelines, as I fear that could end up causing even more problems and bias... I really want to just prepare some contrasting data from a pair of datasets and get that working, but alas everything I have tried almost works but ends up adding subtle bugs like scrambling the end of paragraphs or similar.

I'm also not all that convinced my slop-fiction-paragraphs is actually much use for modern models, as they have a completely different "slop profile" (eg: NOT X, BUT Y and less "tapestries" and "shivers").

If I do succeed in getting anywhere then I'll happily retry and some medium/small models though! I'm using 100 tokens per tunable parameter (rank-16 for 8 layers = 3.146M), so will take nearly a week to train on a 300M token sample:
SERVER-1: [2026-03-24 15:56:11.790] [INFO] [qlora-pipe-lite] initial evaluation loss: 2.9341
SERVER-1: [2026-03-24 15:56:12.324] [INFO] [qlora-pipe-lite] before GAS splitting, batch size: 20, total tokens: 81920
SERVER-1: [2026-03-24 16:02:11.153] [INFO] [qlora-pipe-lite] before GAS splitting, batch size: 20, total tokens: 81920
SERVER-1: [2026-03-24 16:02:50.076] [INFO] [qlora-pipe-lite] step: 1 / 1194, loss: 2.9329, lr: 1.418e-05, throughput: 0.151 sequences/s, elapsed: 6m38s, eta: 5d11h59m
SERVER-1: [2026-03-24 16:08:49.264] [INFO] [qlora-pipe-lite] before GAS splitting, batch size: 20, total tokens: 81920
SERVER-1: [2026-03-24 16:09:28.036] [INFO] [qlora-pipe-lite] step: 2 / 1194, loss: 2.9284, lr: 2.005e-05, throughput: 0.151 sequences/s, elapsed: 13m16s, eta: 5d11h49m
SERVER-1: [2026-03-24 16:15:27.233] [INFO] [qlora-pipe-lite] before GAS splitting, batch size: 20, total tokens: 81920
SERVER-1: [2026-03-24 16:16:06.056] [INFO] [qlora-pipe-lite] step: 3 / 1194, loss: 2.9223, lr: 2.456e-05, throughput: 0.151 sequences/s, elapsed: 19m54s, eta: 5d11h42m
SERVER-1: [2026-03-24 16:22:05.211] [INFO] [qlora-pipe-lite] before GAS splitting, batch size: 20, total tokens: 81920
SERVER-1: [2026-03-24 16:22:43.960] [INFO] [qlora-pipe-lite] step: 4 / 1194, loss: 2.9232, lr: 2.835e-05, throughput: 0.151 sequences/s, elapsed: 26m32s, eta: 5d11h34m
SERVER-1: [2026-03-24 16:28:43.097] [INFO] [qlora-pipe-lite] before GAS splitting, batch size: 20, total tokens: 81920
SERVER-1: [2026-03-24 16:29:21.861] [INFO] [qlora-pipe-lite] step: 5 / 1194, loss: 2.9294, lr: 3.170e-05, throughput: 0.151 sequences/s, elapsed: 33m10s, eta: 5d11h27m

That's a crazy amount of effort so I'm not surprised you want to make it easier!

There was a project I saw recently that was for making a synthetic dataset easily, maybe you could modify that to make it easier?

https://github.com/Ready-Art/Synthetic-Dataset-Generator

There was also this that looked quite cool but again not sure how useful it is to you.
https://github.com/Schneewolf-Labs/Merlina

Again, I'm no expert at this and have never tuned a model in my life!

gghfez

1 day ago

(eta: 5d11h27m)

Hey remember to setup checkpoints every n steps as I remember your training crashed last time.
I tend to create a private hf repo during training and push them as they're created

but alas everything I have tried almost works but ends up adding subtle bugs like scrambling the end of paragraphs or similar.

I think things like this could be problematic

jukofyork

Owner 1 day ago

(eta: 5d11h27m)

Hey remember to setup checkpoints every n steps as I remember your training crashed last time.
I tend to create a private hf repo during training and push them as they're created

It's setup to checkpoint every hour so hopefully OK.

but alas everything I have tried almost works but ends up adding subtle bugs like scrambling the end of paragraphs or similar.

I think things like this could be problematic

Yeah, but hopefully this will get drowned out: 100 tokens per trainable parameter and using a pair of positive/negative datasets should be massive overkill, but I've been hopeful before 🤞

gghfez

1 day ago

100 tokens per trainable parameter and using a pair of positive/negative datasets should be massive overkill

Since synth-slop dataset (just reading random samples) has perfect formatting, spelling, and punctuation with no OCR issues, all the formatting issues are in the positive mix.
I suppose in a week, if it learns some of these artifacts, should be easy to identify which ones need to be filtered out.

For audio / TTS, I find the models love to associate "truncated sentence" with specific voices or if a particular source dataset had issues like that (making it easy to filter out for the next iteration).

jukofyork

Owner 1 day ago

100 tokens per trainable parameter and using a pair of positive/negative datasets should be massive overkill

Since synth-slop dataset (just reading random samples) has perfect formatting, spelling, and punctuation with no OCR issues, all the formatting issues are in the positive mix.
I suppose in a week, if it learns some of these artifacts, should be easy to identify which ones need to be filtered out.

I'm actually using a 50/50 mix for the two classes:

+ve class 75M tokens from my own personal "books" dataset, with paragraphs taken from around 10-11k books.
+ve class 75M tokens from my gutenberg-fiction-paragraphs dataset.
-ve class 75M tokens from my slop-fiction-paragraphs dataset.
-ve class 75M tokens from my literotica-fiction-paragraphs dataset.

They all have slightly different oddball quirks with the formatting, so this should hopefully make it harder for the training to latch onto any single weird formatting problem like this.

gghfez

about 10 hours ago

Do you also find that with CR+, every single token in the system prompt, has a strong impact on the logits throughout the context?

And it's the only model I've found that answers riddles those trick riddles like this correctly every time:
"A woman and her son are in a car accident. The woman is sadly killed. The boy is rushed to hospital. When the doctor sees the boy, he says "I can't operate on this child, he is my son." How is this possible?"
The "v1.5", Command-A, Claude 4.6, etc all say "the mother" or "two mothers".

I'm not sure if it's the unique full attention mechanism / the other models ignore the " he says", or if it's alignment tax since other older models like Mixtral say "mother"?

Opus-4.6's reply (non-thinking) was interesting because it seemed so conflicted:

## The Answer

The doctor is **the boy's mother** — or more precisely, **his other parent who is a woman/mother**.

Wait, let me be precise: The doctor is simply **the boy's mother**. Since the boy has two mothers (or the doctor is a woman), there is no contradiction.

The more straightforward answer: **The doctor is the boy's mother** — a woman. The riddle relies on the outdated assumption that doctors are male, causing people to puzzle over it unnecessarily.

This is a classic riddle designed to highlight **unconscious gender bias** — the tendency to default to imagining a male doctor, even when there's no reason to.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

“The doom lies in yourself, not in your name.”

A step 601 preview - all with temperature = 0:

A `step 601` preview - all with `temperature = 0`: