|
|
--- |
|
|
license: mit |
|
|
datasets: |
|
|
- wmt/wmt19 |
|
|
language: |
|
|
- en |
|
|
- de |
|
|
pipeline_tag: translation |
|
|
--- |
|
|
|
|
|
# Seq2Seq German-English Translation Model |
|
|
|
|
|
A sequence-to-sequence neural machine translation model that translates German text to English, built using PyTorch with LSTM encoder-decoder architecture. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This model implements the classic seq2seq architecture from [Sutskever et al. (2014)](https://arxiv.org/abs/1409.3215) for German-English translation: |
|
|
|
|
|
- **Encoder**: 2-layer LSTM that processes German input sequences |
|
|
- **Decoder**: 2-layer LSTM that generates English output sequences |
|
|
- **Training Strategy**: Teacher forcing during training, autoregressive generation during inference |
|
|
- **Vocabulary**: 30k German words, 25k English words |
|
|
- **Dataset**: Trained on 2M sentence pairs from WMT19 (subset of full 35M dataset) |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
``` |
|
|
German Input β Embedding β LSTM Encoder β Context Vector β LSTM Decoder β Embedding β English Output |
|
|
``` |
|
|
|
|
|
**Hyperparameters:** |
|
|
- Embedding size: 256 |
|
|
- Hidden size: 512 |
|
|
- LSTM layers: 2 (both encoder/decoder) |
|
|
- Dropout: 0.3 |
|
|
- Batch size: 64 |
|
|
- Learning rate: 0.0003 |
|
|
|
|
|
## Training Data |
|
|
|
|
|
- **Dataset**: WMT19 German-English Translation Task |
|
|
- **Size**: 2M sentence pairs (filtered subset) |
|
|
- **Preprocessing**: Sentences filtered by length (5-50 tokens) |
|
|
- **Tokenization**: Custom word-level tokenizer with special tokens (`<PAD>`, `<UNK>`, `<START>`, `<END>`) |
|
|
|
|
|
## Performance |
|
|
|
|
|
**Training Results (5 epochs):** |
|
|
- Initial Training Loss: 4.0949 β Final: 3.1843 (91% improvement) |
|
|
- Initial Validation Loss: 4.1918 β Final: 3.8537 (34% improvement) |
|
|
- Training Device: Apple Silicon (MPS) |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Quick Start |
|
|
|
|
|
```python |
|
|
# This is a custom PyTorch model, not a Transformers model |
|
|
# Download the files and use with the provided inference script |
|
|
|
|
|
import requests |
|
|
from pathlib import Path |
|
|
|
|
|
# Download model files |
|
|
base_url = "https://huggingface.co/sumitdotml/seq2seq-de-en/resolve/main" |
|
|
files = ["best_model.pt", "german_tokenizer.pkl", "english_tokenizer.pkl"] |
|
|
|
|
|
for file in files: |
|
|
response = requests.get(f"{base_url}/{file}") |
|
|
Path(file).write_bytes(response.content) |
|
|
print(f"Downloaded {file}") |
|
|
``` |
|
|
|
|
|
### Translation Examples |
|
|
|
|
|
```bash |
|
|
# Interactive mode |
|
|
python inference.py --interactive |
|
|
|
|
|
# Single translation |
|
|
python inference.py --sentence "Hallo, wie geht es dir?" --verbose |
|
|
|
|
|
# Demo mode |
|
|
python inference.py |
|
|
``` |
|
|
|
|
|
**Example Translations:** |
|
|
- `"Das ist ein gutes Buch."` β `"this is a good idea."` |
|
|
- `"Wo ist der Bahnhof?"` β `"where is the <UNK>"` |
|
|
- `"Ich liebe Deutschland."` β `"i share."` |
|
|
|
|
|
## Files Included |
|
|
|
|
|
- `best_model.pt`: PyTorch model checkpoint (trained weights + architecture) |
|
|
- `german_tokenizer.pkl`: German vocabulary and tokenization logic |
|
|
- `english_tokenizer.pkl`: English vocabulary and tokenization logic |
|
|
|
|
|
## Installation & Setup |
|
|
|
|
|
1. **Clone the repository:** |
|
|
```bash |
|
|
git clone https://github.com/sumitdotml/seq2seq |
|
|
cd seq2seq |
|
|
``` |
|
|
|
|
|
2. **Set up environment:** |
|
|
```bash |
|
|
uv venv && source .venv/bin/activate # or python -m venv .venv |
|
|
uv pip install torch requests tqdm # or pip install torch requests tqdm |
|
|
``` |
|
|
|
|
|
3. **Download model:** |
|
|
```bash |
|
|
python scripts/download_pretrained.py |
|
|
``` |
|
|
|
|
|
4. **Start translating:** |
|
|
```bash |
|
|
python scripts/inference.py --interactive |
|
|
``` |
|
|
|
|
|
## Model Architecture Details |
|
|
|
|
|
The model uses a custom implementation with these components: |
|
|
|
|
|
- **Encoder** (`src/models/encoder.py`): LSTM-based encoder with embedding layer |
|
|
- **Decoder** (`src/models/decoder.py`): LSTM-based decoder with attention-free architecture |
|
|
- **Seq2Seq** (`src/models/seq2seq.py`): Main model combining encoder-decoder with generation logic |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- **Vocabulary constraints**: Limited to 30k German / 25k English words |
|
|
- **Training data**: Only 2M sentence pairs (vs 35M in full WMT19) |
|
|
- **No attention mechanism**: Basic encoder-decoder without attention |
|
|
- **Simple tokenization**: Word-level tokenization without subword units |
|
|
- **Translation quality**: Suitable for basic phrases, struggles with complex sentences |
|
|
|
|
|
## Training Details |
|
|
|
|
|
**Environment:** |
|
|
- Framework: PyTorch 2.0+ |
|
|
- Device: Apple Silicon (MPS acceleration) |
|
|
- Training time: ~5 epochs |
|
|
- Validation strategy: Hold-out validation set |
|
|
|
|
|
**Optimization:** |
|
|
- Optimizer: Adam (lr=0.0003) |
|
|
- Loss function: CrossEntropyLoss (ignoring padding) |
|
|
- Gradient clipping: 1.0 |
|
|
- Scheduler: StepLR (step_size=3, gamma=0.5) |
|
|
|
|
|
## Reproduce Training |
|
|
|
|
|
```bash |
|
|
# Full training pipeline |
|
|
python scripts/data_preparation.py # Download WMT19 data |
|
|
python src/data/tokenization.py # Build vocabularies |
|
|
python scripts/train.py # Train model |
|
|
|
|
|
# For full dataset training, modify data_preparation.py: |
|
|
# use_full_dataset = True # Line 133-134 |
|
|
``` |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{seq2seq-de-en, |
|
|
author = {sumitdotml}, |
|
|
title = {German-English Seq2Seq Translation Model}, |
|
|
year = {2025}, |
|
|
url = {https://huggingface.co/sumitdotml/seq2seq-de-en}, |
|
|
note = {PyTorch implementation of sequence-to-sequence translation} |
|
|
} |
|
|
``` |
|
|
|
|
|
## References |
|
|
|
|
|
- Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. NeurIPS. |
|
|
- WMT19 Translation Task: https://huggingface.co/datasets/wmt/wmt19 |
|
|
|
|
|
## License |
|
|
|
|
|
MIT License - See repository for full license text. |
|
|
|
|
|
## Contact |
|
|
|
|
|
For questions about this model or training code, please open an issue in the [GitHub repository](https://github.com/sumitdotml/seq2seq). |