--- language: - es - pt - en license: apache-2.0 library_name: transformers tags: - perplexity-estimation - tensorrt - data-quality-assessment - dataset-contamination-detection - a100-optimized - curriculum-learning - mlops pipeline_tag: text-classification --- # latam-gpt/Wayra-Perplexity-Estimator-55M **A100-optimized TensorRT version** of WayraPPL for high-throughput prediction of Perplexity. ![WayraPPL Architecture](./architecture.png) ## Use Cases - **High-throughput data quality assessment**: Evaluate quality of massive datasets by measuring text perplexity at 50,000+ samples/sec - **Real-time perplexity estimation**: Instant perplexity computation for live content filtering and moderation - **Large-scale dataset cleaning**: Process millions of documents to remove low-quality samples before model training - **Curriculum Learning**: Rank training examples by difficulty using perplexity for progressive learning - **Semantic Filtering**: Filter semantically relevant content based on perplexity thresholds - **Production MLOps pipelines**: Automated data quality gates in production ML workflows ## Hardware Requirements **This model works on NVIDIA A100 GPUs with:** - GPU Architecture: sm_80 (A100-80GB) - CUDA: 12.8+ - TensorRT: 10.13.x - Driver: 570.124.06+ ## Performance ![TensorRT Performance](./benchmarks.png) - **Throughput**: ~50,000+ samples/sec (A100) - **Latency**: <1ms per sample - **Batch Size**: Up to 2048 - **Memory**: ~2GB GPU memory ## Model Versions | Version | Throughput | Latency | Memory | Use Case | |---------|------------|---------|--------|----------| | **TensorRT (A100)** | **~50,000/sec** | **<1ms** | **2GB** | **Production inference** | | PyTorch Standard | ~1,000/sec | 10ms | 4GB | Research & development | ## Installation ```bash # Install requirements (A100 + CUDA 12.8+ required) pip install -r tensorrt_requirements.txt # Verify TensorRT installation python -c "import tensorrt; print(tensorrt.__version__)" # Should be 10.13.x ``` ## Usage ### TensorRT Engine (High Performance) - RECOMMENDED ```python from tensorrt_inference import WayraPPLTensorRT from transformers import AutoTokenizer # Load TensorRT model (A100 required) model = WayraPPLTensorRT("wayrappl_fp16_bs2048.engine") tokenizer = AutoTokenizer.from_pretrained("latam-gpt/Wayra-Perplexity-Estimator-55M") # Multilingual examples texts = [ # Spanish "La inteligencia artificial está transformando el mundo.", # Portuguese "A tecnologia blockchain promete revolucionar sistemas financeiros.", # English "Natural language processing enables human-computer communication." ] inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True) outputs = model.infer(inputs['input_ids'].numpy(), inputs['attention_mask'].numpy()) for i, text in enumerate(texts): print(f"Text: {text}") print(f"Perplexity: {outputs['ppl'][i]:.2f}\n") ``` ### PyTorch Model (Standard) ```python from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("latam-gpt/Wayra-Perplexity-Estimator-55M") model = AutoModel.from_pretrained("latam-gpt/Wayra-Perplexity-Estimator-55M") texts = ["Your text here"] inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True) outputs = model(**inputs) print(f"PPL: {outputs['ppl']}") ``` ### Performance Comparison: 100K Examples ```python # TensorRT: ~2 hours for 100,000 examples # PyTorch: ~28 hours for 100,000 examples # Speedup: 14x faster with TensorRT ``` ## Files Included ### TensorRT Engine (A100-optimized) - PRIMARY - **`wayrappl_fp16_bs2048.engine`** - TensorRT engine (A100 only) - **`tensorrt_config.json`** - Engine configuration - **`tensorrt_inference.py`** - Inference code with multilingual examples - **`tensorrt_requirements.txt`** - Dependencies ### PyTorch Model (Standard HuggingFace format) - `pytorch_model.bin` - Model weights - `config.json` - Model configuration - `tokenizer.json` - Tokenizer ## TensorRT Optimizations ![TensorRT Optimizations](./tensorrt_optimization.png) The A100-optimized engine includes advanced optimizations: **Layer Fusion:** - Embedding + Positional Encoding → Single kernel - LayerNorm + Linear → Combined operation - Attention QKV projections → Single matrix multiplication - Multi-head attention → Fused attention kernel **Memory Optimizations:** - Intermediate attention matrices eliminated - Key/Value cache optimized for batch processing - Activation recomputation removed (stored in optimized layout) **Graph Optimizations:** - Constant folding on positional embeddings - Dead code elimination of unused heads - Operator fusion for perplexity computation ## Benchmarks (A100) | Model Type | Throughput | Latency | Memory | GPU Util | 100K Examples | |------------------|------------|---------|--------|----------|---------------| | **Wayra TensorRT** | **~50,000/sec**| **<1ms** | **2GB** | **95%** | **~2 hours** | | Wayra PyTorch | ~1,000/sec | 10ms | 4GB | 60% | ~28 hours | | Llama 3 1B | ~200/sec | 50ms | 8GB | 40% | ~139 hours | ## Model Details - **Base**: Knowledge distillation from meta-llama/Llama-3.2-1B - **Architecture**: GPT2-based Transformer blocks with perplexity heads - **Languages**: Spanish, Portuguese, English - **Max Length**: 512 tokens - **Precision**: **FP16 (TensorRT)**, FP32 (PyTorch) - **Parameters**: 55M ## Troubleshooting **"TensorRT engine not compatible"** - Ensure you're using A100-SXM4-80GB GPU (sm_80 architecture) - Check CUDA version: `nvidia-smi` (should be 12.8+) - Verify TensorRT: `python -c "import tensorrt"` (should be 10.13.x) - Confirm driver version: `nvidia-smi` (should be 570.124.06+) **"CUDA out of memory"** - Reduce batch size in inference - Use smaller sequence lengths - Monitor GPU memory: `nvidia-smi -l 1` **"Import tensorrt failed"** - Reinstall TensorRT: `pip uninstall tensorrt && pip install tensorrt==10.13.0` - Check CUDA compatibility - Verify LD_LIBRARY_PATH includes TensorRT libs **Performance not as expected** - Ensure GPU is not throttling: `nvidia-smi -q -d PERFORMANCE` - Use dedicated GPU (not shared) - Enable persistence mode: `nvidia-smi -pm 1` ## Citation ```bibtex @software{WayraPPL, title={WayraPPL: High-Performance Perplexity Estimation of Data Novelty}, author={Omar U. Florez and LatamGPT Team}, year={2025}, url={https://huggingface.co/latam-gpt/Wayra-Perplexity-Estimator-55M} } ``` ## References - Dao et al. "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness" (2022) - Narang et al. "Do Transformer Modifications Transfer Across Implementations and Applications?" (2021) - Rabe & Staats "Self-attention Does Not Need O(n²) Memory" (2021) - Pope et al. "Efficiently Scaling Transformer Inference" (2022) ## License Apache 2.0 - See LICENSE file --- **Note**: This model is optimized for A100 GPUs. For other GPUs, use the PyTorch version or retrain the TensorRT engine for your specific hardware.