---
license: apache-2.0
datasets:
- HuggingFaceH4/ultrachat_200k
- BAAI/Infinity-Instruct
- HuggingFaceH4/ultrafeedback_binarized
- Intel/orca_dpo_pairs
- argilla/OpenHermesPreferences
- BramVanroy/dolly-15k-dutch
base_model:
- Zyphra/Zamba2-1.2B-instruct
library_name: transformers
---

# Model Card for Zamba2-1.2B-instruct-Dutch

Zamba2-1.2B-instruct-Dutch is a Dutch language instruction-following model obtained through a two-stage fine-tuning process:

1. First stage (Base instruction model by Zyphra):
   - Zyphra fine-tuned Zamba2-1.2B to create Zamba2-1.2B-instruct through:
     - SFT training on [ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) and [Infinity-Instruct](https://huggingface.co/datasets/BAAI/Infinity-Instruct)
     - DPO training on [ultrafeedback_binarized](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized), [orca_dpo_pairs](https://huggingface.co/datasets/Intel/orca_dpo_pairs), and [OpenHermesPreferences](https://huggingface.co/datasets/argilla/OpenHermesPreferences)

2. Second stage (Dutch language adaptation):
   - Further fine-tuning of Zyphra's Zamba2-1.2B-instruct on the [dolly-15k-dutch](https://huggingface.co/datasets/BramVanroy/dolly-15k-dutch) dataset, specifically using the training split

The model maintains the core hybrid architecture of Zamba2 while being optimized for Dutch language understanding and generation.

## Quick start

### Prerequisites

To download Zamba2-1.2B-instruct-Dutch, clone Zyphra's fork of transformers:
1. `git clone https://github.com/Zyphra/transformers_zamba2.git`
2. `cd transformers_zamba2`
3. Install the repository: `pip install -e .`
4. `pip install accelerate`

### Inference

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Instantiate model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("Zyphra/Zamba2-1.2B-instruct-Dutch")
model = AutoModelForCausalLM.from_pretrained("Zyphra/Zamba2-1.2B-instruct-Dutch", device_map="cuda", torch_dtype=torch.bfloat16)

# Format the input as a chat template
prompt = "Wat zijn de belangrijkste oorzaken van de val van het Romeinse Rijk?"
sample = [{'role': 'user', 'content': prompt}]
chat_sample = tokenizer.apply_chat_template(sample, tokenize=False)

# Tokenize input and generate output
input_ids = tokenizer(chat_sample, return_tensors='pt', add_special_tokens=False).to("cuda")
outputs = model.generate(**input_ids, max_new_tokens=150, return_dict_in_generate=False, output_scores=False, use_cache=True, num_beams=1, do_sample=False)
print((tokenizer.decode(outputs[0])))
```

## Training Details

The model was fine-tuned using the following approach:

1. Started with the base Zamba2-1.2B-instruct model
2. Fine-tuned on the dolly-15k-dutch dataset using optimized learning rates
3. Implemented memory optimization through gradient checkpointing
4. Utilized mixed precision training (bf16)

### Fine-tuning Configuration

The model includes an advanced learning rate optimization system for fine-tuning, implemented through the custom `LROptimizerCallback` class which can be found in _lr_optimizer.py_:

```python
from transformers import AutoTokenizer, Trainer
from lr_optimizer import setup_training, LROptimizerCallback

callback = LROptimizerCallback(
    num_trials=10,
    lr_range=(1e-6, 1e-4)
)
trainer = Trainer(
    model=model,
    args=training_args,
    callbacks=[callback]
)

trainer.train()
```

## Model Architecture

Zamba2-1.2B-instruct-Dutch maintains the hybrid SSM-attention architecture of the base model:

- Backbone of Mamba2 layers interleaved with shared attention layers
- LoRA projection matrices for shared transformer blocks
- Rotary position embeddings in the shared attention layer
- Concatenated original embeddings for improved information maintenance

<center>
<img src="/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F65c05e75c084467acab2f84a%2FVay6htbnBcySR3Z6NEgwj.png" width="300" alt="Zamba architecture">
</center>

## Performance

The model maintains the efficient inference characteristics of the base Zamba2 architecture:
- Low latency inference
- Rapid generation
- Small memory footprint

Time to First Token (TTFT)             |  Output Generation
:-------------------------:|:-------------------------:
![image/png](/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F65c05e75c084467acab2f84a%2F5lpWDLdtPPVAk8COJq7gZ.png)  |  ![image/png](/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F65c05e75c084467acab2f84a%2FV2tS6eCOGbpKybEoZmOB7.png)

Memory overhead:
<center>
<img src="/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F65c05e75c084467acab2f84a%2Fm0YUmAmiVnRg6l9m10CEt.png" width="400" alt="Zamba inference and memory cost">
</center>

## Limitations

- The model is primarily focused on Dutch language understanding and generation
- Performance on other languages may be limited
- The training dataset size is relatively small compared to larger multilingual models
- No explicit content moderation mechanisms are included

## License

This model is released under the Apache 2.0 license.

Note: This is a temporary HuggingFace implementation. A standalone PyTorch implementation may be found at [Zamba2 GitHub repository](https://github.com/Zyphra/Zamba2).