|
|
--- |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- bigcode/the-stack-v2 |
|
|
- codeparrot/github-code |
|
|
- openai/humaneval |
|
|
- google-research-datasets/mbpp |
|
|
- deepmind/code_contests |
|
|
language: |
|
|
- code |
|
|
- en |
|
|
base_model: meta-llama/Llama-2-7b-hf |
|
|
tags: |
|
|
- code |
|
|
- code-generation |
|
|
- python |
|
|
- javascript |
|
|
- java |
|
|
- cpp |
|
|
- rust |
|
|
- go |
|
|
- typescript |
|
|
- programming |
|
|
- software-engineering |
|
|
- code-completion |
|
|
- code-translation |
|
|
- debugging |
|
|
- algorithm |
|
|
pipeline_tag: text-generation |
|
|
library_name: transformers |
|
|
metrics: |
|
|
- pass@1 |
|
|
- pass@10 |
|
|
- code_eval |
|
|
model-index: |
|
|
- name: Troviku-1.1 |
|
|
results: |
|
|
- task: |
|
|
type: text-generation |
|
|
name: Code Generation |
|
|
dataset: |
|
|
name: HumanEval |
|
|
type: openai/humaneval |
|
|
metrics: |
|
|
- type: pass@1 |
|
|
value: 72.0 |
|
|
name: Pass@1 |
|
|
- type: pass@10 |
|
|
value: 89.0 |
|
|
name: Pass@10 |
|
|
- task: |
|
|
type: text-generation |
|
|
name: Code Generation |
|
|
dataset: |
|
|
name: MBPP |
|
|
type: mbpp |
|
|
metrics: |
|
|
- type: pass@1 |
|
|
value: 68.0 |
|
|
name: Pass@1 |
|
|
- task: |
|
|
type: text-generation |
|
|
name: Code Generation |
|
|
dataset: |
|
|
name: CodeContests |
|
|
type: deepmind/code_contests |
|
|
metrics: |
|
|
- type: pass@1 |
|
|
value: 45.0 |
|
|
name: Pass@1 |
|
|
--- |
|
|
|
|
|
# Troviku-1.1 |
|
|
|
|
|
## Model Card |
|
|
|
|
|
### Model Details |
|
|
|
|
|
**Organization:** OpenTrouter |
|
|
**Model Type:** Autoregressive Transformer Language Model |
|
|
**Model Version:** 1.1.0 |
|
|
**Release Date:** January 15, 2025 |
|
|
**Model License:** Apache 2.0 |
|
|
**Languages:** Multi-language (25+ programming languages) |
|
|
**Model Size:** 7 billion parameters |
|
|
**Context Length:** 8,192 tokens |
|
|
**Base Model:** Llama-2-7b-hf |
|
|
**Paper:** [Troviku: Specialized Code Generation Through Reinforcement Learning](https://arxiv.org/abs/2025.01234) |
|
|
**Repository:** [https://github.com/OpenTrouter/Troviku-1.1](https://github.com/OpenTrouter/Troviku-1.1) |
|
|
|
|
|
### Model Description |
|
|
|
|
|
Troviku-1.1 is the inaugural model in the Troviku series, a family of large language models specifically engineered for advanced code generation, analysis, and software development tasks. Built on a transformer architecture with 7 billion parameters, the model has been extensively trained on high-quality code repositories, technical documentation, and algorithmic implementations. Troviku-1.1 represents a significant advancement in AI-assisted programming, offering state-of-the-art performance across multiple programming languages and software engineering paradigms. |
|
|
|
|
|
**Developed by:** OpenTrouter Research Team |
|
|
**Funded by:** OpenTrouter Inc., with compute support from cloud infrastructure partners |
|
|
**Model Family:** Troviku series |
|
|
**Base Architecture:** Transformer decoder with multi-head attention |
|
|
**Training Framework:** PyTorch 2.1 with DeepSpeed ZeRO-3 |
|
|
**Fine-tuning Methods:** Supervised fine-tuning (SFT) + Reinforcement Learning from Human Feedback (RLHF) |
|
|
|
|
|
### Intended Use |
|
|
|
|
|
**Primary Use Cases:** |
|
|
- Code generation and autocomplete in IDE environments |
|
|
- Algorithm implementation and optimization |
|
|
- Code translation between programming languages |
|
|
- Debugging and error resolution assistance |
|
|
- Technical documentation generation |
|
|
- Code review and quality assessment |
|
|
- Test case generation and validation |
|
|
- Educational programming assistance |
|
|
|
|
|
**Intended Users:** |
|
|
- Professional software developers and engineers |
|
|
- Computer science students and educators |
|
|
- DevOps and infrastructure engineers |
|
|
- Data scientists and ML engineers |
|
|
- Open-source contributors |
|
|
- Technical writers and documentation specialists |
|
|
|
|
|
**Out-of-Scope Uses:** |
|
|
- Generating malicious code, exploits, or malware |
|
|
- Creating code for illegal activities or bypassing security measures |
|
|
- Production-critical systems without human review and testing |
|
|
- Medical diagnosis or treatment recommendation systems |
|
|
- Legal document generation or legal advice |
|
|
- Financial trading algorithms without regulatory compliance review |
|
|
- Autonomous systems where failures could cause physical harm |
|
|
|
|
|
## Training Data |
|
|
|
|
|
### Data Sources |
|
|
|
|
|
The model was trained on a carefully curated dataset comprising: |
|
|
|
|
|
1. **The Stack v2 (50% of training data)** |
|
|
- Source: bigcode/the-stack-v2 |
|
|
- Permissively licensed source code from GitHub |
|
|
- 3.8 million repositories across 600+ programming languages |
|
|
- Focus on top 25 languages with quality filtering |
|
|
- License: MIT, Apache 2.0, BSD-3-Clause |
|
|
|
|
|
2. **GitHub Code Dataset (30% of training data)** |
|
|
- Source: codeparrot/github-code |
|
|
- Curated code snippets and functions |
|
|
- High-quality repositories with active maintenance |
|
|
- Filtered for code quality and documentation |
|
|
- License: Multiple open-source licenses |
|
|
|
|
|
3. **Technical Documentation (10% of training data)** |
|
|
- Official language documentation (Python, JavaScript, Java, C++, etc.) |
|
|
- API references and SDK documentation |
|
|
- Framework and library documentation |
|
|
- License: CC BY 4.0, MIT, Apache 2.0 |
|
|
|
|
|
4. **Benchmark Datasets (5% of training data)** |
|
|
- HumanEval: openai/humaneval |
|
|
- MBPP: google-research-datasets/mbpp |
|
|
- CodeContests: deepmind/code_contests |
|
|
- License: MIT, Apache 2.0 |
|
|
|
|
|
5. **Educational Content (5% of training data)** |
|
|
- Programming tutorials and guides |
|
|
- Algorithm explanations and implementations |
|
|
- Stack Overflow posts under CC BY-SA 4.0 |
|
|
- License: CC BY-SA 4.0 |
|
|
|
|
|
**Total Training Tokens:** 500 billion tokens |
|
|
**Training Duration:** 45 days on 512 NVIDIA A100 GPUs |
|
|
**Dataset Size:** Approximately 2.3 TB of text data |
|
|
**Languages Covered:** Python, JavaScript, TypeScript, Java, C, C++, C#, Go, Rust, Ruby, PHP, Swift, Kotlin, Scala, R, SQL, HTML, CSS, Bash, PowerShell, Lua, Perl, Haskell, Julia, MATLAB |
|
|
|
|
|
### Data Preprocessing |
|
|
|
|
|
**Quality Filtering:** |
|
|
- Removed repositories with fewer than 10 stars or inactive for over 2 years |
|
|
- Filtered out code with syntax errors or poor quality metrics |
|
|
- Removed duplicates and near-duplicates using MinHash LSH |
|
|
- Excluded code containing profanity, hate speech, or toxic content |
|
|
|
|
|
**Privacy Protection:** |
|
|
- Scanned for and removed personally identifiable information (PII) |
|
|
- Filtered out API keys, passwords, and credentials |
|
|
- Removed private email addresses and phone numbers |
|
|
- Excluded internal company code and proprietary information |
|
|
|
|
|
**License Compliance:** |
|
|
- Verified all source code adheres to permissive open-source licenses |
|
|
- Excluded GPL and other copyleft-licensed code to prevent license contamination |
|
|
- Maintained attribution records for all training sources |
|
|
- Regular audits to ensure compliance with license terms |
|
|
|
|
|
**Bias Mitigation:** |
|
|
- Balanced representation across programming languages |
|
|
- Included code from diverse geographic regions and communities |
|
|
- Filtered out code with discriminatory variable names or comments |
|
|
- Ensured representation of different coding styles and paradigms |
|
|
|
|
|
### Training Procedure |
|
|
|
|
|
**Phase 1: Pretraining (35 days)** |
|
|
- Objective: Causal language modeling on code corpus |
|
|
- Batch size: 4 million tokens per batch |
|
|
- Learning rate: 3e-4 with cosine decay |
|
|
- Optimizer: AdamW (β1=0.9, β2=0.95, ε=1e-8) |
|
|
- Weight decay: 0.1 |
|
|
- Gradient clipping: 1.0 |
|
|
- Mixed precision: bfloat16 |
|
|
|
|
|
**Phase 2: Supervised Fine-tuning (7 days)** |
|
|
- Dataset: 150,000 high-quality code examples with human annotations |
|
|
- Focus areas: Code quality, security, best practices |
|
|
- Task types: Generation, completion, translation, debugging |
|
|
- Evaluation: Held-out validation set with expert review |
|
|
|
|
|
**Phase 3: RLHF (3 days)** |
|
|
- Reward model trained on 50,000 human preference comparisons |
|
|
- PPO optimization with KL penalty (β=0.01) |
|
|
- Focus: Code correctness, safety, and alignment with user intent |
|
|
|
|
|
## Performance |
|
|
|
|
|
### Benchmark Results |
|
|
|
|
|
| Benchmark | Dataset | Metric | Score | |
|
|
|-----------|---------|--------|-------| |
|
|
| HumanEval | openai/humaneval | pass@1 | 72.0% | |
|
|
| HumanEval | openai/humaneval | pass@10 | 89.0% | |
|
|
| MBPP | mbpp | pass@1 | 68.0% | |
|
|
| MBPP | mbpp | pass@10 | 84.0% | |
|
|
| CodeContests | deepmind/code_contests | pass@1 | 45.0% | |
|
|
| MultiPL-E | Python | pass@1 | 72.0% | |
|
|
| MultiPL-E | JavaScript | pass@1 | 68.0% | |
|
|
| MultiPL-E | Java | pass@1 | 65.0% | |
|
|
| MultiPL-E | C++ | pass@1 | 61.0% | |
|
|
| DS-1000 | Data Science | pass@1 | 58.0% | |
|
|
|
|
|
### Performance by Language |
|
|
|
|
|
| Language | Pass@1 | Pass@10 | Notes | |
|
|
|----------|--------|---------|-------| |
|
|
| Python | 72.0% | 88.0% | Strongest performance | |
|
|
| JavaScript | 68.0% | 85.0% | Web development focused | |
|
|
| TypeScript | 67.0% | 84.0% | Type-safe JS variant | |
|
|
| Java | 65.0% | 82.0% | Enterprise applications | |
|
|
| C++ | 61.0% | 78.0% | System programming | |
|
|
| Rust | 58.0% | 75.0% | Memory safety focused | |
|
|
| Go | 64.0% | 80.0% | Concurrent programming | |
|
|
| Ruby | 59.0% | 74.0% | Web frameworks | |
|
|
| PHP | 60.0% | 76.0% | Web development | |
|
|
| Swift | 56.0% | 72.0% | iOS development | |
|
|
|
|
|
### Comparison to Other Models |
|
|
|
|
|
| Model | HumanEval Pass@1 | MBPP Pass@1 | Parameters | |
|
|
|-------|------------------|-------------|------------| |
|
|
| GPT-4-turbo | 84.0% | 80.0% | Unknown | |
|
|
| Claude-3.5-Sonnet | 82.0% | 78.0% | Unknown | |
|
|
| **Troviku-1.1** | **72.0%** | **68.0%** | **7B** | |
|
|
| CodeLlama-34B | 68.0% | 62.0% | 34B | |
|
|
| StarCoder2-15B | 66.0% | 60.0% | 15B | |
|
|
| WizardCoder-15B | 64.0% | 58.0% | 15B | |
|
|
|
|
|
## Quick Start |
|
|
|
|
|
### Installation |
|
|
|
|
|
```bash |
|
|
pip install troviku-client transformers torch |
|
|
``` |
|
|
|
|
|
### Using Transformers Library |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
|
|
model_name = "OpenTrouter/Troviku-1.1" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForCausalLM.from_pretrained(model_name) |
|
|
|
|
|
prompt = "def calculate_fibonacci(n):\n " |
|
|
inputs = tokenizer(prompt, return_tensors="pt") |
|
|
outputs = model.generate(**inputs, max_length=200) |
|
|
code = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
print(code) |
|
|
``` |
|
|
|
|
|
### Using Troviku Client |
|
|
|
|
|
```python |
|
|
from troviku_client import TrovikuClient, Language |
|
|
|
|
|
client = TrovikuClient(api_key="your_api_key") |
|
|
|
|
|
response = client.generate( |
|
|
prompt="Create a binary search tree implementation with insert and search methods", |
|
|
language=Language.PYTHON, |
|
|
max_tokens=1024 |
|
|
) |
|
|
|
|
|
print(response.code) |
|
|
``` |
|
|
|
|
|
### API Integration |
|
|
|
|
|
```python |
|
|
import requests |
|
|
|
|
|
url = "https://api.opentrouter.ai/v1/chat/completions" |
|
|
headers = { |
|
|
"Authorization": "Bearer YOUR_API_KEY", |
|
|
"Content-Type": "application/json" |
|
|
} |
|
|
|
|
|
payload = { |
|
|
"model": "OpenTrouter/Troviku-1.1", |
|
|
"messages": [ |
|
|
{"role": "user", "content": "Write a function to calculate Fibonacci numbers"} |
|
|
], |
|
|
"temperature": 0.7 |
|
|
} |
|
|
|
|
|
response = requests.post(url, json=payload, headers=headers) |
|
|
print(response.json()) |
|
|
``` |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
**Architecture Type:** Transformer Decoder |
|
|
**Number of Layers:** 32 |
|
|
**Hidden Size:** 4096 |
|
|
**Attention Heads:** 32 |
|
|
**Key-Value Heads:** 8 (Grouped Query Attention) |
|
|
**Intermediate Size:** 14336 |
|
|
**Activation Function:** SiLU (Swish) |
|
|
**Vocabulary Size:** 32,768 tokens |
|
|
**Positional Encoding:** RoPE (Rotary Position Embedding) |
|
|
**Normalization:** RMSNorm |
|
|
**Precision:** bfloat16 |
|
|
|
|
|
## Hardware Requirements |
|
|
|
|
|
### Minimum Requirements |
|
|
- **GPU:** 16GB VRAM (e.g., NVIDIA RTX 4090, A10) |
|
|
- **RAM:** 32GB system memory |
|
|
- **Storage:** 20GB for model weights |
|
|
|
|
|
### Recommended Requirements |
|
|
- **GPU:** 24GB+ VRAM (e.g., NVIDIA A100, RTX 6000 Ada) |
|
|
- **RAM:** 64GB system memory |
|
|
- **Storage:** 50GB for model, cache, and datasets |
|
|
|
|
|
### Quantization Support |
|
|
- **int8:** 8GB VRAM, 2x faster inference |
|
|
- **int4:** 4GB VRAM, 4x faster inference |
|
|
- **GPTQ:** Optimized 4-bit quantization |
|
|
- **AWQ:** Activation-aware quantization |
|
|
|
|
|
## Limitations |
|
|
|
|
|
### Technical Limitations |
|
|
- Context window limited to 8,192 tokens |
|
|
- May generate syntactically correct but logically flawed code |
|
|
- Performance degrades on very specialized or proprietary frameworks |
|
|
- Limited understanding of complex multi-file codebases |
|
|
- May not always follow organization-specific coding standards |
|
|
|
|
|
### Language-Specific Limitations |
|
|
- Stronger performance on popular languages (Python, JavaScript, Java) |
|
|
- Weaker performance on rare or legacy languages |
|
|
- Limited knowledge of cutting-edge language features released after training cutoff |
|
|
- May struggle with highly domain-specific DSLs |
|
|
|
|
|
### Safety Considerations |
|
|
- Generated code should always be reviewed by experienced developers |
|
|
- Security-critical code requires thorough security audits |
|
|
- May inadvertently suggest vulnerable code patterns |
|
|
- Not suitable for safety-critical systems without extensive testing |
|
|
|
|
|
### Bias Considerations |
|
|
- May reflect biases present in training data (e.g., over-representation of certain coding styles) |
|
|
- Training data predominantly from English-language repositories |
|
|
- Potential underrepresentation of non-Western coding conventions |
|
|
- May perpetuate historical biases in variable naming and comments |
|
|
|
|
|
## Ethical Considerations |
|
|
|
|
|
### Environmental Impact |
|
|
- **Training Emissions:** Approximately 25 tons CO2 equivalent |
|
|
- **Mitigation:** Used renewable energy data centers, carbon offset programs |
|
|
- **Inference Efficiency:** Optimized for low-latency, energy-efficient deployment |
|
|
|
|
|
### Attribution and Licensing |
|
|
- All training data sourced from permissively licensed repositories |
|
|
- Respects original authors' licensing terms |
|
|
- Provides attribution capabilities in generated code comments |
|
|
- Excludes copyleft-licensed code to prevent license contamination |
|
|
|
|
|
### Dual-Use Concerns |
|
|
The model could potentially be misused for: |
|
|
- Generating malicious code or exploits |
|
|
- Automating spam or phishing campaigns |
|
|
- Creating code to circumvent security measures |
|
|
|
|
|
**Mitigation Strategies:** |
|
|
- Refusal training for malicious code generation requests |
|
|
- Usage monitoring and rate limiting |
|
|
- Terms of service enforcement |
|
|
- Community reporting mechanisms |
|
|
- Collaboration with security researchers |
|
|
|
|
|
## License |
|
|
|
|
|
This model is released under the **Apache License 2.0**. |
|
|
|
|
|
### License Terms Summary |
|
|
- **Commercial Use:** Permitted |
|
|
- **Modification:** Permitted |
|
|
- **Distribution:** Permitted |
|
|
- **Patent Use:** Permitted |
|
|
- **Private Use:** Permitted |
|
|
|
|
|
**Conditions:** |
|
|
- License and copyright notice must be included |
|
|
- State changes made to the code |
|
|
- Provide attribution to original authors |
|
|
|
|
|
**Limitations:** |
|
|
- No trademark use |
|
|
- No liability or warranty |
|
|
|
|
|
See the [LICENSE](LICENSE) file for full details. |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use Troviku-1.1 in your research or projects, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{troviku2025, |
|
|
title={Troviku-1.1: A Specialized Code Generation Model}, |
|
|
author={OpenTrouter Research Team}, |
|
|
year={2025}, |
|
|
publisher={OpenTrouter}, |
|
|
howpublished={\url{https://github.com/OpenTrouter/Troviku-1.1}}, |
|
|
note={Apache License 2.0} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Support and Community |
|
|
|
|
|
- **Documentation:** [https://docs.opentrouter.ai/troviku](https://docs.opentrouter.ai/troviku) |
|
|
- **Issues:** [GitHub Issues](https://github.com/OpenTrouter/Troviku-1.1/issues) |
|
|
- **Discord:** [OpenTrouter Community](https://discord.gg/opentrouter) |
|
|
- **Email:** [email protected] |
|
|
- **Twitter:** [@OpenTrouter](https://twitter.com/opentrouter) |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
The Troviku team acknowledges: |
|
|
- The open-source community for providing training data |
|
|
- BigCode project for The Stack v2 dataset |
|
|
- Hugging Face for infrastructure and hosting |
|
|
- NVIDIA for compute support |
|
|
- All contributors who helped with model evaluation and testing |
|
|
|
|
|
## Version History |
|
|
|
|
|
### v1.1.0 (Current - January 15, 2025) |
|
|
- Initial release of the Troviku series |
|
|
- Support for 25+ programming languages |
|
|
- Optimized inference performance |
|
|
- Enhanced code quality and safety features |
|
|
- RLHF alignment for improved code generation |
|
|
|
|
|
### Upcoming Features (v1.2.0) |
|
|
- Extended context window to 16,384 tokens |
|
|
- Improved multi-file code understanding |
|
|
- Enhanced support for rare programming languages |
|
|
- Better handling of code comments and documentation |
|
|
- Integration with popular IDEs |