WeDLM-8B-Base / README.md
exlaw's picture
Upload folder using huggingface_hub
5b429a8 verified
metadata
license: apache-2.0
language:
  - en
  - zh
base_model: Qwen/Qwen3-8B
pipeline_tag: text-generation
tags:
  - language model
  - parallel-decoding

WeDLM-8B

WeDLM-8B is a diffusion language model that performs parallel decoding under standard causal attention, initialized from Qwen3-8B.

This is the base (pretrained) version. For the instruction-tuned version, see WeDLM-8B-Instruct.

๐Ÿ“„ Paper (Coming Soon) | ๐ŸŒ Project Page | ๐Ÿ’ป GitHub

Model Details

Attribute Value
Initialized From Qwen3-8B
Parameters 8B
Context Length 32,768

Quick Start (Recommended)

For fast inference, use the wedlm engine:

pip install git+https://github.com/tencent/WeDLM.git
from wedlm import LLM, SamplingParams

llm = LLM(model="tencent/WeDLM-8B")

prompt = "The theory of relativity states that"
outputs = llm.generate([prompt], SamplingParams(max_tokens=256))

print(outputs[0]["text"])

HuggingFace Transformers

For training or simple forward passes, you can load via Transformers:

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("tencent/WeDLM-8B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "tencent/WeDLM-8B", 
    trust_remote_code=True,
    torch_dtype="auto",
    device_map="auto"
)

inputs = tokenizer("The theory of relativity", return_tensors="pt").to(model.device)
outputs = model(**inputs)

โš ๏ธ Note: The HuggingFace interface is for training/forward pass convenience. For optimized inference throughput, use the wedlm engine above.

Performance

Benchmark Qwen3-8B WeDLM-8B
ARC-C (0-shot) 92.66 92.92
GSM8K (3-shot) 85.97 90.20
MATH (4-shot) 50.80 53.60
HumanEval (4-shot) 68.90 75.00
MMLU (5-shot) 74.03 75.46
Average 72.61 74.72

Citation (Coming soon)

License

Apache 2.0