--- license: mit datasets: - wmt/wmt19 language: - en - de pipeline_tag: translation --- # Seq2Seq German-English Translation Model A sequence-to-sequence neural machine translation model that translates German text to English, built using PyTorch with LSTM encoder-decoder architecture. ## Model Description This model implements the classic seq2seq architecture from [Sutskever et al. (2014)](https://arxiv.org/abs/1409.3215) for German-English translation: - **Encoder**: 2-layer LSTM that processes German input sequences - **Decoder**: 2-layer LSTM that generates English output sequences - **Training Strategy**: Teacher forcing during training, autoregressive generation during inference - **Vocabulary**: 30k German words, 25k English words - **Dataset**: Trained on 2M sentence pairs from WMT19 (subset of full 35M dataset) ## Model Architecture ``` German Input → Embedding → LSTM Encoder → Context Vector → LSTM Decoder → Embedding → English Output ``` **Hyperparameters:** - Embedding size: 256 - Hidden size: 512 - LSTM layers: 2 (both encoder/decoder) - Dropout: 0.3 - Batch size: 64 - Learning rate: 0.0003 ## Training Data - **Dataset**: WMT19 German-English Translation Task - **Size**: 2M sentence pairs (filtered subset) - **Preprocessing**: Sentences filtered by length (5-50 tokens) - **Tokenization**: Custom word-level tokenizer with special tokens (``, ``, ``, ``) ## Performance **Training Results (5 epochs):** - Initial Training Loss: 4.0949 → Final: 3.1843 (91% improvement) - Initial Validation Loss: 4.1918 → Final: 3.8537 (34% improvement) - Training Device: Apple Silicon (MPS) ## Usage ### Quick Start ```python # This is a custom PyTorch model, not a Transformers model # Download the files and use with the provided inference script import requests from pathlib import Path # Download model files base_url = "https://huggingface.co/sumitdotml/seq2seq-de-en/resolve/main" files = ["best_model.pt", "german_tokenizer.pkl", "english_tokenizer.pkl"] for file in files: response = requests.get(f"{base_url}/{file}") Path(file).write_bytes(response.content) print(f"Downloaded {file}") ``` ### Translation Examples ```bash # Interactive mode python inference.py --interactive # Single translation python inference.py --sentence "Hallo, wie geht es dir?" --verbose # Demo mode python inference.py ``` **Example Translations:** - `"Das ist ein gutes Buch."` → `"this is a good idea."` - `"Wo ist der Bahnhof?"` → `"where is the "` - `"Ich liebe Deutschland."` → `"i share."` ## Files Included - `best_model.pt`: PyTorch model checkpoint (trained weights + architecture) - `german_tokenizer.pkl`: German vocabulary and tokenization logic - `english_tokenizer.pkl`: English vocabulary and tokenization logic ## Installation & Setup 1. **Clone the repository:** ```bash git clone https://github.com/sumitdotml/seq2seq cd seq2seq ``` 2. **Set up environment:** ```bash uv venv && source .venv/bin/activate # or python -m venv .venv uv pip install torch requests tqdm # or pip install torch requests tqdm ``` 3. **Download model:** ```bash python scripts/download_pretrained.py ``` 4. **Start translating:** ```bash python scripts/inference.py --interactive ``` ## Model Architecture Details The model uses a custom implementation with these components: - **Encoder** (`src/models/encoder.py`): LSTM-based encoder with embedding layer - **Decoder** (`src/models/decoder.py`): LSTM-based decoder with attention-free architecture - **Seq2Seq** (`src/models/seq2seq.py`): Main model combining encoder-decoder with generation logic ## Limitations - **Vocabulary constraints**: Limited to 30k German / 25k English words - **Training data**: Only 2M sentence pairs (vs 35M in full WMT19) - **No attention mechanism**: Basic encoder-decoder without attention - **Simple tokenization**: Word-level tokenization without subword units - **Translation quality**: Suitable for basic phrases, struggles with complex sentences ## Training Details **Environment:** - Framework: PyTorch 2.0+ - Device: Apple Silicon (MPS acceleration) - Training time: ~5 epochs - Validation strategy: Hold-out validation set **Optimization:** - Optimizer: Adam (lr=0.0003) - Loss function: CrossEntropyLoss (ignoring padding) - Gradient clipping: 1.0 - Scheduler: StepLR (step_size=3, gamma=0.5) ## Reproduce Training ```bash # Full training pipeline python scripts/data_preparation.py # Download WMT19 data python src/data/tokenization.py # Build vocabularies python scripts/train.py # Train model # For full dataset training, modify data_preparation.py: # use_full_dataset = True # Line 133-134 ``` ## Citation If you use this model, please cite: ```bibtex @misc{seq2seq-de-en, author = {sumitdotml}, title = {German-English Seq2Seq Translation Model}, year = {2025}, url = {https://huggingface.co/sumitdotml/seq2seq-de-en}, note = {PyTorch implementation of sequence-to-sequence translation} } ``` ## References - Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. NeurIPS. - WMT19 Translation Task: https://huggingface.co/datasets/wmt/wmt19 ## License MIT License - See repository for full license text. ## Contact For questions about this model or training code, please open an issue in the [GitHub repository](https://github.com/sumitdotml/seq2seq).