Prometheus open source?

by Trilogix1 - opened 16 days ago

Hi, I tried to check your Prometheus linking to github 404. I would like to try you method.
Does the repo exist, if so can i have the link?

wangzhang

Owner 14 days ago

here you are: https://github.com/wuwangzhang1216/prometheus

jeffbenson1979

13 days ago

•

edited 13 days ago

Hello, as i see prometheus looks like heretic. But i have a quest .. u wrote Qwen3.5-9B 2/200 (1%) 0.0105 50 , 50 trials, i did 2000 and nothing i dont have something like ur digits :\ i mean this »
[Trial 167] Refusals: 59/100, KL divergence: 0.2764
[Trial 63] Refusals: 60/100, KL divergence: 0.2433
[Trial 133] Refusals: 61/100, KL divergence: 0.1853
[Trial 144] Refusals: 65/100, KL divergence: 0.1836
[Trial 135] Refusals: 67/100, KL divergence: 0.1631
[Trial 145] Refusals: 69/100, KL divergence: 0.1431
[Trial 142] Refusals: 73/100, KL divergence: 0.1249
[Trial 80] Refusals: 74/100, KL divergence: 0.0940
[Trial 138] Refusals: 75/100, KL divergence: 0.0868
[Trial 68] Refusals: 78/100, KL divergence: 0.0838
[Trial 114] Refusals: 79/100, KL divergence: 0.0565
[Trial 115] Refusals: 81/100, KL divergence: 0.0403
[Trial 177] Refusals: 89/100, KL divergence: 0.0313
[Trial 187] Refusals: 92/100, KL divergence: 0.0255
[Trial 189] Refusals: 96/100, KL divergence: 0.0165
[Trial 13] Refusals: 100/100, KL divergence: 0.0005

wangzhang

Owner 13 days ago

•

edited 13 days ago

Hey! There are two things that could explain why your results look so different.

First, make sure you start Prometheus with the LLM judge enabled. Without it, refusal detection falls back to keyword matching, which is essentially unreliable — it can flag a response as a refusal just because it contains words like "sorry" or "cannot", even when the model actually answered the question. With an LLM judge, each response gets semantically evaluated, so the refusal rate and KL divergence numbers become actually meaningful.

Second, dataset quality matters a lot. The eval dataset needs to contain genuinely adversarial / sensitive prompts that would actually trigger refusals in an unabliterated model — if your dataset is too mild or generic, you won't see a clean Pareto front because the model barely refuses anything to begin with, making the optimization signal very noisy. Make sure you're using a well-curated refusal benchmark dataset, not a general-purpose one.

Re-run with both of these in place and your numbers should be much closer to what's in the writeup.

wangzhang

Owner 13 days ago

I just uploaded the dataset that I used https://huggingface.co/datasets/wangzhang/prometheus-datasets.

Trilogix1

11 days ago

•

edited 9 days ago

Great, thanks. I will try, hope this one will not affect accuracy as much as the others.
Edit: It works alla grande :)

Thank you again. you are already cited, use doi for better citation.

Trilogix1 changed discussion status to closed 11 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment