LLaDA-Prometheus

Model Description

This model is a fine-tuned version of the LLaDA 8B Base model, obtained through a specialized Supervised Fine-Tuning (SFT) process. It innovatively discards the complex attention mask design typically associated with block diffusion, while preserving full attention mechanisms. This allows the model to achieve block diffusion-style inference efficiently—leveraging KV cache for streamlined generation, outputting an EOS token upon completion of the response to seamlessly exit the generation process.

Key innovations:

Full Attention Preservation: Maintains standard full attention without the overhead of intricate masking.
Block Diffusion Inference: Enables iterative block-wise generation via KV cache management, ensuring coherent and controlled outputs.
EOS Handling: Trained to naturally emit EOS tokens at response boundaries.

This approach balances computational efficiency with high-quality generation, making it suitable for tasks requiring structured, multi-step reasoning.

Usage

To load and use this model with Hugging Face Transformers:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "maomaocun/LLaDA-Prometheus-no-template"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, trust_remote_code=True).to("cuda")

prompt = "Can you tell me an engaging short story about a brave young astronaut who discovers an ancient alien civilization on a distant planet? Make it adventurous and heartwarming, with a twist at the end."

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
input_ids = inputs['input_ids']
attention_mask = inputs.get('attention_mask', torch.ones_like(input_ids))
for chunk in model.generate(
    input_ids=input_ids,
    attention_mask=attention_mask,
    max_gen_length=1024,
    block_length=64,
    threshold=0.9,
    streaming=True,
    eos_token_id=tokenizer.eos_token,
):
    all_generated_ids = torch.cat([input_ids, chunk], dim=-1)
    text = tokenizer.batch_decode(all_generated_ids, skip_special_tokens=False)[0].split(tokenizer.eos_token)[0]
    print(text, end='', flush=True)

For block diffusion-style inference, customize the generation loop to manage KV cache and block outputs as needed.

Benchmarks

The following table compares performance across key evaluation benchmarks. Results are reported as accuracy percentages where applicable.

Model	GSM8K	GPQA	BBH	MATH	HumanEval	MBPP	MMLU-Pro	MMLU-Generate
LLaDA 8B Base in Pure Diffusion	69.06	31.91	44.77	30.84	32.92	40.8	24.26	65.9
LLaDA 8B Instruct in Pure Diffusion	77.48	29.01	51.49	22.32	38.71	39.2	36.41	65.5
LLaDA-Prometheus in Block Diffusion	77.4	33.03	48.74	31.94	40.24	42	33.45	65.53

These results demonstrate competitive performance, particularly in code generation (HumanEval, MBPP) and reasoning tasks (BBH, MATH), with gains over the base instruct variant in several areas.

Downloads last month: 25

Safetensors

Model size

8B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support