anthracite-org (Anthracite)

grimjim

posted an update 16 days ago

Post

2895

I wanted to call attention to Arli Ai's success in applying my recent modifications to refusal ablation to a MoE model successfully. Nice work, @OwenArli !
ArliAI/GLM-4.5-Air-Derestricted
Ablation on a MoE model is no small thing; I expect preserving norms/magnitudes during intervention better respects routing compared to naive refusal ablation.

(I would have tagged their org earlier, but that feature seemed to be broken via "@")

ArliAI

4 replies

·

grimjim

posted an update 22 days ago

Post

3270

Going forward, I will be adopting the term Magnitude-Preserving Orthogonal Ablation (MPOA) for my recent work in mitigating model damage from abliteration. The technique potentially unlocks reasoning capacity previously occupied with safety refusal processing.

For details, start here: https://huggingface.co/blog/grimjim/norm-preserving-biprojected-abliteration

Showcase results: grimjim/gemma-3-12b-it-norm-preserved-biprojected-abliterated (outperforms base instruct on UGI Leaderboard NatInt)

(The existing name, while technically accurate, was a bit of a mouthful.)

2 replies

·

grimjim

posted an update 24 days ago

Post

5016

Implemented a proof of concept sampler in pure PyTorch and transformers.

Max P consists of a dynamic token filter which applies Winsorization to cap the probabilties of top tokens. Specifically, a base probability in the range of [0,1] is used to cap individual token probability; the sampler then redistributes excess proportionally.

https://github.com/jim-plus/maxp-sampler-poc

Combined with Temperature and Min P, this could represent a more intuitive way of reducing repetition in text generation.

2 replies

·

grimjim

posted an update 2 months ago

Post

786

I've uploaded abliteration code with support for sparsification of the refusal vector. It's poorly documented, but the code should be straightforward.
https://github.com/jim-plus/llm-abliteration
The code is built atop a fork that enabled abliteration to be performed on models loaded in 4-bit or 8-bit bitsandbytes quantization. TransformerLens is not required, just plain Transformers. For those previously unaware, this opens up abliteration experimentation to more people with local VRAM limitations.

Since performing abliteration on a quant involves precision and perplexity loss, it stands to reason that a small amount of magnitude sparsification could filter out some noise and possibly even reduce the damage inflicted on latent space via ablation of the refusal vector.

There's a small but real acceleration of ablation of the refusal vector by reducing outer product operations from O(d²×n) to O(d×n), and then by pushing said computation layerwise to GPU. The code is hardcoded for CUDA acceleration currently. Normalization of the refusal vector was deferred in order to allow sparsification. In principle other behavior vector interventions could also be added and explored.

4 replies

·

Delta-Vector

in anthracite-org/magnum-v4-12b 4 months ago

Recommended Parameters?

4

#10 opened 4 months ago by

Maelle23

grimjim

in anthracite-org/magnum-v4-12b 4 months ago

Recommended Parameters?

4

#10 opened 4 months ago by

Maelle23

lucyknada

in anthracite-org/c2_logs_32k_llama3_qwen2_v1.3 8 months ago

Question

2

#2 opened 8 months ago by

mrfakename

lucyknada

updated a dataset 8 months ago

anthracite-org/c2_logs_32k_llama3_qwen2_v1.3

Viewer • Updated Apr 12 • 11k • 127 • 5

Delta-Vector

in anthracite-org/c2_logs_32k_llama3_qwen2_v1.3 8 months ago

Question

2

#2 opened 8 months ago by

mrfakename

grimjim

posted an update 8 months ago

Post

2353

I recently have been looking at a paper titled "Why Warmup the Learning Rate? Underlying Mechanisms and Improvements", by Dayal Singh Kalra and Maissam Barkeshli, and was struck by "warmup" being analogous to simulated annealing.
https://arxiv.org/abs/2406.09405
Taking the physical analogy further, the "warmup" is a stochastic process to knock the system out of current local minima, allowing easier transition toward newer minima. It works because it reduces "fit" and therefore "friction".

lucyknada

in anthracite-org/README 9 months ago

knowledge and terminology

1

#1 opened 9 months ago by

Markobes

lucyknada

in anthracite-org/magnum-v2-72b-exl2 9 months ago

8.0bpw?

1

#3 opened 9 months ago by

svippixel

Delta-Vector

in anthracite-org/magnum-v2-72b-exl2 9 months ago

8.0bpw?

1

#3 opened 9 months ago by

svippixel

Undi95

posted an update 9 months ago

Post

12766

Hi there!

If you want to create your own thinking model or do a better MistralThinker, I just uploaded my entire dataset made on Deepseek R1 and the axolotl config. (well I made them public)

Axolotl config : Undi95/MistralThinker-v1.1

The dataset : Undi95/R1-RP-ShareGPT3

You can also read all I did on those two discord screenshot from two days ago, I'm a little lazy to rewrite all kek.

Hope you will use them!

6 replies

·

lucyknada

in anthracite-org/stheno-filtered-v1.1 10 months ago

License

4

#2 opened 10 months ago by

mrfakename

lucyknada

updated a dataset 10 months ago

anthracite-org/stheno-filtered-v1.1

Viewer • Updated Feb 15 • 26.8k • 124 • 12

grimjim

posted an update 10 months ago

Post

2439

This recent paper points to an explanation for the unreasonable effectiveness of Frankenmerges: Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach (2502.05171)

Specifically, the duplication of layers in Frankenmerges serves a purpose similar to what occurs in their recurrent-depth architecture. Successful frankenmerges that operate without additional fine-tuning are able to recover or "heal" from any damage due to abrupt transitions between layer blocks. Operational replicated layer blocks can provide functional benefits grounded in latent reasoning. Frankenmerges can also result in hybrid reasoning, by splicing together the latent reasoning of different models.

Back in April 2024, I was able to duplicate a few layers in the Llama 3 8B model, turning it into a 9B model, without harming benchmarks significantly, despite any transition damage.
grimjim/llama-3-experiment-v1-9B
My informal experimentation suggested that latent reasoning circuits could occupy continguous stacks of 2-4 layers, though the result was highly sensitive to the choice of transition location between layers.

1 reply

·

Delta-Vector

in anthracite-org/magnum-v4-72b 10 months ago

You should finetune original R1 671B

5

#6 opened 10 months ago by

Ainonake

grimjim

posted an update 10 months ago

Post

2630

I've made yet another merge of reasoning models with incremental gains on the current Open LLM leaderboard.
open-llm-leaderboard/open_llm_leaderboard

Merging in DeepSeek R1 distillation to Llama 3.1 8B (at 10% task arithmetic weight, using the Llama 3.1 8B base model as the case rather than the instruct model) with a prior best merge resulted in a slightly lower IFEval, but a higher result in every other benchmark save for MMLU-PRO, which went down only marginally. MATH Lvl5 and GPQA went up palpably.
grimjim/DeepSauerHuatuoSkywork-R1-o1-Llama-3.1-8B

This result is currently my best Llama 3.1 8B merge result to date. The actual R1 distillation itself scored quite badly, so this would seem to be another case of unexpected formatting (reflected in IFEval) hurting the evaluation results, obscuring the strength of a model.

It is also possible to use the text generation feature of this model to generate roleplay completions. Based on informal testing, this model's bias toward problem-solving will subtly impact narration.

grimjim

posted an update 11 months ago

Post

1967

A recent merge has provided another interesting result on the current Open LLM leaderboard.
open-llm-leaderboard/open_llm_leaderboard

Combining an o1 reasoning merge with VAGOsolutions's Llama-3.1 SauerkrautLM 8B Instruct model resulted in a lower IFEval, but a higher result in every other benchmark. This result is currently my best Llama 3.1 8B merge result to date.
grimjim/SauerHuatuoSkywork-o1-Llama-3.1-8B
The results suggest that defects in output format and/or output parsing may be limiting benchmark performance of various o1 models.

AI & ML interests

Team members 26

anthracite-org's activity

Recommended Parameters?

Recommended Parameters?

Question

Question

knowledge and terminology

8.0bpw?

8.0bpw?

License

You should finetune original R1 671B