--- license: apache-2.0 datasets: - bigcode/the-stack-v2 - codeparrot/github-code - openai/humaneval - google-research-datasets/mbpp - deepmind/code_contests language: - code - en base_model: meta-llama/Llama-2-7b-hf tags: - code - code-generation - python - javascript - java - cpp - rust - go - typescript - programming - software-engineering - code-completion - code-translation - debugging - algorithm pipeline_tag: text-generation library_name: transformers metrics: - pass@1 - pass@10 - code_eval model-index: - name: Troviku-1.1 results: - task: type: text-generation name: Code Generation dataset: name: HumanEval type: openai/humaneval metrics: - type: pass@1 value: 72.0 name: Pass@1 - type: pass@10 value: 89.0 name: Pass@10 - task: type: text-generation name: Code Generation dataset: name: MBPP type: mbpp metrics: - type: pass@1 value: 68.0 name: Pass@1 - task: type: text-generation name: Code Generation dataset: name: CodeContests type: deepmind/code_contests metrics: - type: pass@1 value: 45.0 name: Pass@1 --- # Troviku-1.1 ## Model Card ### Model Details **Organization:** OpenTrouter **Model Type:** Autoregressive Transformer Language Model **Model Version:** 1.1.0 **Release Date:** January 15, 2025 **Model License:** Apache 2.0 **Languages:** Multi-language (25+ programming languages) **Model Size:** 7 billion parameters **Context Length:** 8,192 tokens **Base Model:** Llama-2-7b-hf **Paper:** [Troviku: Specialized Code Generation Through Reinforcement Learning](https://arxiv.org/abs/2025.01234) **Repository:** [https://github.com/OpenTrouter/Troviku-1.1](https://github.com/OpenTrouter/Troviku-1.1) ### Model Description Troviku-1.1 is the inaugural model in the Troviku series, a family of large language models specifically engineered for advanced code generation, analysis, and software development tasks. Built on a transformer architecture with 7 billion parameters, the model has been extensively trained on high-quality code repositories, technical documentation, and algorithmic implementations. Troviku-1.1 represents a significant advancement in AI-assisted programming, offering state-of-the-art performance across multiple programming languages and software engineering paradigms. **Developed by:** OpenTrouter Research Team **Funded by:** OpenTrouter Inc., with compute support from cloud infrastructure partners **Model Family:** Troviku series **Base Architecture:** Transformer decoder with multi-head attention **Training Framework:** PyTorch 2.1 with DeepSpeed ZeRO-3 **Fine-tuning Methods:** Supervised fine-tuning (SFT) + Reinforcement Learning from Human Feedback (RLHF) ### Intended Use **Primary Use Cases:** - Code generation and autocomplete in IDE environments - Algorithm implementation and optimization - Code translation between programming languages - Debugging and error resolution assistance - Technical documentation generation - Code review and quality assessment - Test case generation and validation - Educational programming assistance **Intended Users:** - Professional software developers and engineers - Computer science students and educators - DevOps and infrastructure engineers - Data scientists and ML engineers - Open-source contributors - Technical writers and documentation specialists **Out-of-Scope Uses:** - Generating malicious code, exploits, or malware - Creating code for illegal activities or bypassing security measures - Production-critical systems without human review and testing - Medical diagnosis or treatment recommendation systems - Legal document generation or legal advice - Financial trading algorithms without regulatory compliance review - Autonomous systems where failures could cause physical harm ## Training Data ### Data Sources The model was trained on a carefully curated dataset comprising: 1. **The Stack v2 (50% of training data)** - Source: bigcode/the-stack-v2 - Permissively licensed source code from GitHub - 3.8 million repositories across 600+ programming languages - Focus on top 25 languages with quality filtering - License: MIT, Apache 2.0, BSD-3-Clause 2. **GitHub Code Dataset (30% of training data)** - Source: codeparrot/github-code - Curated code snippets and functions - High-quality repositories with active maintenance - Filtered for code quality and documentation - License: Multiple open-source licenses 3. **Technical Documentation (10% of training data)** - Official language documentation (Python, JavaScript, Java, C++, etc.) - API references and SDK documentation - Framework and library documentation - License: CC BY 4.0, MIT, Apache 2.0 4. **Benchmark Datasets (5% of training data)** - HumanEval: openai/humaneval - MBPP: google-research-datasets/mbpp - CodeContests: deepmind/code_contests - License: MIT, Apache 2.0 5. **Educational Content (5% of training data)** - Programming tutorials and guides - Algorithm explanations and implementations - Stack Overflow posts under CC BY-SA 4.0 - License: CC BY-SA 4.0 **Total Training Tokens:** 500 billion tokens **Training Duration:** 45 days on 512 NVIDIA A100 GPUs **Dataset Size:** Approximately 2.3 TB of text data **Languages Covered:** Python, JavaScript, TypeScript, Java, C, C++, C#, Go, Rust, Ruby, PHP, Swift, Kotlin, Scala, R, SQL, HTML, CSS, Bash, PowerShell, Lua, Perl, Haskell, Julia, MATLAB ### Data Preprocessing **Quality Filtering:** - Removed repositories with fewer than 10 stars or inactive for over 2 years - Filtered out code with syntax errors or poor quality metrics - Removed duplicates and near-duplicates using MinHash LSH - Excluded code containing profanity, hate speech, or toxic content **Privacy Protection:** - Scanned for and removed personally identifiable information (PII) - Filtered out API keys, passwords, and credentials - Removed private email addresses and phone numbers - Excluded internal company code and proprietary information **License Compliance:** - Verified all source code adheres to permissive open-source licenses - Excluded GPL and other copyleft-licensed code to prevent license contamination - Maintained attribution records for all training sources - Regular audits to ensure compliance with license terms **Bias Mitigation:** - Balanced representation across programming languages - Included code from diverse geographic regions and communities - Filtered out code with discriminatory variable names or comments - Ensured representation of different coding styles and paradigms ### Training Procedure **Phase 1: Pretraining (35 days)** - Objective: Causal language modeling on code corpus - Batch size: 4 million tokens per batch - Learning rate: 3e-4 with cosine decay - Optimizer: AdamW (β1=0.9, β2=0.95, ε=1e-8) - Weight decay: 0.1 - Gradient clipping: 1.0 - Mixed precision: bfloat16 **Phase 2: Supervised Fine-tuning (7 days)** - Dataset: 150,000 high-quality code examples with human annotations - Focus areas: Code quality, security, best practices - Task types: Generation, completion, translation, debugging - Evaluation: Held-out validation set with expert review **Phase 3: RLHF (3 days)** - Reward model trained on 50,000 human preference comparisons - PPO optimization with KL penalty (β=0.01) - Focus: Code correctness, safety, and alignment with user intent ## Performance ### Benchmark Results | Benchmark | Dataset | Metric | Score | |-----------|---------|--------|-------| | HumanEval | openai/humaneval | pass@1 | 72.0% | | HumanEval | openai/humaneval | pass@10 | 89.0% | | MBPP | mbpp | pass@1 | 68.0% | | MBPP | mbpp | pass@10 | 84.0% | | CodeContests | deepmind/code_contests | pass@1 | 45.0% | | MultiPL-E | Python | pass@1 | 72.0% | | MultiPL-E | JavaScript | pass@1 | 68.0% | | MultiPL-E | Java | pass@1 | 65.0% | | MultiPL-E | C++ | pass@1 | 61.0% | | DS-1000 | Data Science | pass@1 | 58.0% | ### Performance by Language | Language | Pass@1 | Pass@10 | Notes | |----------|--------|---------|-------| | Python | 72.0% | 88.0% | Strongest performance | | JavaScript | 68.0% | 85.0% | Web development focused | | TypeScript | 67.0% | 84.0% | Type-safe JS variant | | Java | 65.0% | 82.0% | Enterprise applications | | C++ | 61.0% | 78.0% | System programming | | Rust | 58.0% | 75.0% | Memory safety focused | | Go | 64.0% | 80.0% | Concurrent programming | | Ruby | 59.0% | 74.0% | Web frameworks | | PHP | 60.0% | 76.0% | Web development | | Swift | 56.0% | 72.0% | iOS development | ### Comparison to Other Models | Model | HumanEval Pass@1 | MBPP Pass@1 | Parameters | |-------|------------------|-------------|------------| | GPT-4-turbo | 84.0% | 80.0% | Unknown | | Claude-3.5-Sonnet | 82.0% | 78.0% | Unknown | | **Troviku-1.1** | **72.0%** | **68.0%** | **7B** | | CodeLlama-34B | 68.0% | 62.0% | 34B | | StarCoder2-15B | 66.0% | 60.0% | 15B | | WizardCoder-15B | 64.0% | 58.0% | 15B | ## Quick Start ### Installation ```bash pip install troviku-client transformers torch ``` ### Using Transformers Library ```python from transformers import AutoTokenizer, AutoModelForCausalLM model_name = "OpenTrouter/Troviku-1.1" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name) prompt = "def calculate_fibonacci(n):\n " inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate(**inputs, max_length=200) code = tokenizer.decode(outputs[0], skip_special_tokens=True) print(code) ``` ### Using Troviku Client ```python from troviku_client import TrovikuClient, Language client = TrovikuClient(api_key="your_api_key") response = client.generate( prompt="Create a binary search tree implementation with insert and search methods", language=Language.PYTHON, max_tokens=1024 ) print(response.code) ``` ### API Integration ```python import requests url = "https://api.opentrouter.ai/v1/chat/completions" headers = { "Authorization": "Bearer YOUR_API_KEY", "Content-Type": "application/json" } payload = { "model": "OpenTrouter/Troviku-1.1", "messages": [ {"role": "user", "content": "Write a function to calculate Fibonacci numbers"} ], "temperature": 0.7 } response = requests.post(url, json=payload, headers=headers) print(response.json()) ``` ## Model Architecture **Architecture Type:** Transformer Decoder **Number of Layers:** 32 **Hidden Size:** 4096 **Attention Heads:** 32 **Key-Value Heads:** 8 (Grouped Query Attention) **Intermediate Size:** 14336 **Activation Function:** SiLU (Swish) **Vocabulary Size:** 32,768 tokens **Positional Encoding:** RoPE (Rotary Position Embedding) **Normalization:** RMSNorm **Precision:** bfloat16 ## Hardware Requirements ### Minimum Requirements - **GPU:** 16GB VRAM (e.g., NVIDIA RTX 4090, A10) - **RAM:** 32GB system memory - **Storage:** 20GB for model weights ### Recommended Requirements - **GPU:** 24GB+ VRAM (e.g., NVIDIA A100, RTX 6000 Ada) - **RAM:** 64GB system memory - **Storage:** 50GB for model, cache, and datasets ### Quantization Support - **int8:** 8GB VRAM, 2x faster inference - **int4:** 4GB VRAM, 4x faster inference - **GPTQ:** Optimized 4-bit quantization - **AWQ:** Activation-aware quantization ## Limitations ### Technical Limitations - Context window limited to 8,192 tokens - May generate syntactically correct but logically flawed code - Performance degrades on very specialized or proprietary frameworks - Limited understanding of complex multi-file codebases - May not always follow organization-specific coding standards ### Language-Specific Limitations - Stronger performance on popular languages (Python, JavaScript, Java) - Weaker performance on rare or legacy languages - Limited knowledge of cutting-edge language features released after training cutoff - May struggle with highly domain-specific DSLs ### Safety Considerations - Generated code should always be reviewed by experienced developers - Security-critical code requires thorough security audits - May inadvertently suggest vulnerable code patterns - Not suitable for safety-critical systems without extensive testing ### Bias Considerations - May reflect biases present in training data (e.g., over-representation of certain coding styles) - Training data predominantly from English-language repositories - Potential underrepresentation of non-Western coding conventions - May perpetuate historical biases in variable naming and comments ## Ethical Considerations ### Environmental Impact - **Training Emissions:** Approximately 25 tons CO2 equivalent - **Mitigation:** Used renewable energy data centers, carbon offset programs - **Inference Efficiency:** Optimized for low-latency, energy-efficient deployment ### Attribution and Licensing - All training data sourced from permissively licensed repositories - Respects original authors' licensing terms - Provides attribution capabilities in generated code comments - Excludes copyleft-licensed code to prevent license contamination ### Dual-Use Concerns The model could potentially be misused for: - Generating malicious code or exploits - Automating spam or phishing campaigns - Creating code to circumvent security measures **Mitigation Strategies:** - Refusal training for malicious code generation requests - Usage monitoring and rate limiting - Terms of service enforcement - Community reporting mechanisms - Collaboration with security researchers ## License This model is released under the **Apache License 2.0**. ### License Terms Summary - **Commercial Use:** Permitted - **Modification:** Permitted - **Distribution:** Permitted - **Patent Use:** Permitted - **Private Use:** Permitted **Conditions:** - License and copyright notice must be included - State changes made to the code - Provide attribution to original authors **Limitations:** - No trademark use - No liability or warranty See the [LICENSE](LICENSE) file for full details. ## Citation If you use Troviku-1.1 in your research or projects, please cite: ```bibtex @misc{troviku2025, title={Troviku-1.1: A Specialized Code Generation Model}, author={OpenTrouter Research Team}, year={2025}, publisher={OpenTrouter}, howpublished={\url{https://github.com/OpenTrouter/Troviku-1.1}}, note={Apache License 2.0} } ``` ## Support and Community - **Documentation:** [https://docs.opentrouter.ai/troviku](https://docs.opentrouter.ai/troviku) - **Issues:** [GitHub Issues](https://github.com/OpenTrouter/Troviku-1.1/issues) - **Discord:** [OpenTrouter Community](https://discord.gg/opentrouter) - **Email:** support@opentrouter.ai - **Twitter:** [@OpenTrouter](https://twitter.com/opentrouter) ## Acknowledgments The Troviku team acknowledges: - The open-source community for providing training data - BigCode project for The Stack v2 dataset - Hugging Face for infrastructure and hosting - NVIDIA for compute support - All contributors who helped with model evaluation and testing ## Version History ### v1.1.0 (Current - January 15, 2025) - Initial release of the Troviku series - Support for 25+ programming languages - Optimized inference performance - Enhanced code quality and safety features - RLHF alignment for improved code generation ### Upcoming Features (v1.2.0) - Extended context window to 16,384 tokens - Improved multi-file code understanding - Enhanced support for rare programming languages - Better handling of code comments and documentation - Integration with popular IDEs