--- library_name: transformers language: - hsb - dsb datasets: - HuggingFaceFW/fineweb-2 - CohereLabs/aya_dataset - Magpie-Align/Magpie-Llama-3.1-Pro-MT-300K-Filtered - OpenAssistant/oasst2 - ai2-adapt-dev/flan_v2_converted - utter-project/EuroBlocks-SFT-Synthetic-1124 base_model: - Qwen/Qwen2.5-3B-Instruct --- # Qwen2.5-3B-Instruct-hsb-dsb This model is the TartuNLP submission to the **WMT25 Shared Task on Limited Resource Slavic Languages**, covering **Upper Sorbian** (hsb) and **Lower Sorbian** (dsb). It is based on **Qwen2.5-3B-Instruct** and adapted through continued pretraining on Sorbian monolingual and parallel data, plus general instruction-tuning datasets. The model jointly supports machine translation (MT) and question answering (QA) for both Sorbian languages, achieving the top rank in the shared task. ⚠️ **Note:** This model is research-focused and has not been tested for general usage. Use at your own risk. ## Example usage ``` from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "tartuNLP/Qwen2.5-3B-Instruct-hsb-dsb" model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype="auto", device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained(model_name) messages = [ {"role": "system", "content": "Translate the following text from German to Upper Sorbian."}, {"role": "user", "content": "Wie lange willst du noch bleiben?"} ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) model_inputs = tokenizer([text], return_tensors="pt").to(model.device) generated_ids = model.generate( **model_inputs, max_new_tokens=512 ) generated_ids = [ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) ] response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] ``` ## Shared task results Results shared by the organizers ([source](https://github.com/TUM-NLP/llms-limited-resources2025/blob/main/results.md)). **Upper Sorbian:** | | DE-HSB | points | HSB-QA | points | final points | |--------------|-----------|--------|-----------|--------|--------------| | **TartuNLP** | 86.33 | 4 | **58.10** | 4 | 8 | | NRC | **87.20** | 4 | 29.05 | 1 | 5 | | SDKM | 75.73 | 2 | 55.24 | 3 | 5 | | baseline | 13.88 | 1 | 42.86 | 2 | 3 | **Lower Sorbian:** | | DE-DSB | points | DSB-QA | points | final points | |--------------|-----------|--------|-----------|--------|--------------| | **TartuNLP** | 78.20 | 4 | **57.56** | 4 | 8 | | NRC | **78.24** | 4 | 32.20 | 1 | 5 | | SDKM | 64.34 | 2 | 51.71 | 3 | 5 | | baseline | 12.21 | 1 | 45.85 | 2 | 3 | ## Training details - Total training tokens: ~1.2B - Sequence length: 4096 - Training hardware: LUMI supercomputer (AMD MI250x GPUs) - Training time: ~139 GPU-hours ## Citation info To be announced.