---
language:
- es
- pt  
- en
license: apache-2.0
library_name: transformers
tags:
- perplexity-estimation
- tensorrt
- data-quality-assessment
- dataset-contamination-detection
- a100-optimized
- curriculum-learning
- mlops
pipeline_tag: text-classification
---

# latam-gpt/Wayra-Perplexity-Estimator-55M

**A100-optimized TensorRT version** of WayraPPL for high-throughput prediction of Perplexity.

![WayraPPL Architecture](./architecture.png)


## Use Cases

- **High-throughput data quality assessment**: Evaluate quality of massive datasets by measuring text perplexity at 50,000+ samples/sec
- **Real-time perplexity estimation**: Instant perplexity computation for live content filtering and moderation
- **Large-scale dataset cleaning**: Process millions of documents to remove low-quality samples before model training
- **Curriculum Learning**: Rank training examples by difficulty using perplexity for progressive learning
- **Semantic Filtering**: Filter semantically relevant content based on perplexity thresholds
- **Production MLOps pipelines**: Automated data quality gates in production ML workflows


## Hardware Requirements

**This model works on NVIDIA A100 GPUs with:**
- GPU Architecture: sm_80 (A100-80GB)
- CUDA: 12.8+
- TensorRT: 10.13.x
- Driver: 570.124.06+

## Performance

![TensorRT Performance](./benchmarks.png)


- **Throughput**: ~50,000+ samples/sec (A100)
- **Latency**: <1ms per sample
- **Batch Size**: Up to 2048
- **Memory**: ~2GB GPU memory

## Model Versions

| Version | Throughput | Latency | Memory | Use Case |
|---------|------------|---------|--------|----------|
| **TensorRT (A100)** | **~50,000/sec** | **<1ms** | **2GB** | **Production inference** |
| PyTorch Standard | ~1,000/sec | 10ms | 4GB | Research & development |

## Installation

```bash
# Install requirements (A100 + CUDA 12.8+ required)
pip install -r tensorrt_requirements.txt

# Verify TensorRT installation
python -c "import tensorrt; print(tensorrt.__version__)"  # Should be 10.13.x
```

## Usage

### TensorRT Engine (High Performance) - RECOMMENDED

```python
from tensorrt_inference import WayraPPLTensorRT
from transformers import AutoTokenizer

# Load TensorRT model (A100 required)
model = WayraPPLTensorRT("wayrappl_fp16_bs2048.engine")
tokenizer = AutoTokenizer.from_pretrained("latam-gpt/Wayra-Perplexity-Estimator-55M")

# Multilingual examples
texts = [
    # Spanish
    "La inteligencia artificial está transformando el mundo.",
    # Portuguese  
    "A tecnologia blockchain promete revolucionar sistemas financeiros.",
    # English
    "Natural language processing enables human-computer communication."
]

inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
outputs = model.infer(inputs['input_ids'].numpy(), inputs['attention_mask'].numpy())

for i, text in enumerate(texts):
    print(f"Text: {text}")
    print(f"Perplexity: {outputs['ppl'][i]:.2f}\n")
```

### PyTorch Model (Standard)

```python
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("latam-gpt/Wayra-Perplexity-Estimator-55M")
model = AutoModel.from_pretrained("latam-gpt/Wayra-Perplexity-Estimator-55M")

texts = ["Your text here"]
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
outputs = model(**inputs)
print(f"PPL: {outputs['ppl']}")
```

### Performance Comparison: 100K Examples

```python
# TensorRT: ~2 hours for 100,000 examples
# PyTorch: ~28 hours for 100,000 examples  
# Speedup: 14x faster with TensorRT
```

## Files Included

### TensorRT Engine (A100-optimized) - PRIMARY
- **`wayrappl_fp16_bs2048.engine`** - TensorRT engine (A100 only)
- **`tensorrt_config.json`** - Engine configuration
- **`tensorrt_inference.py`** - Inference code with multilingual examples
- **`tensorrt_requirements.txt`** - Dependencies

### PyTorch Model (Standard HuggingFace format)
- `pytorch_model.bin` - Model weights
- `config.json` - Model configuration  
- `tokenizer.json` - Tokenizer

## TensorRT Optimizations

![TensorRT Optimizations](./tensorrt_optimization.png)


The A100-optimized engine includes advanced optimizations:

**Layer Fusion:**
- Embedding + Positional Encoding → Single kernel
- LayerNorm + Linear → Combined operation  
- Attention QKV projections → Single matrix multiplication
- Multi-head attention → Fused attention kernel

**Memory Optimizations:**
- Intermediate attention matrices eliminated
- Key/Value cache optimized for batch processing
- Activation recomputation removed (stored in optimized layout)

**Graph Optimizations:**
- Constant folding on positional embeddings
- Dead code elimination of unused heads
- Operator fusion for perplexity computation

## Benchmarks (A100)

| Model Type       | Throughput | Latency | Memory | GPU Util | 100K Examples |
|------------------|------------|---------|--------|----------|---------------|
| **Wayra TensorRT**   | **~50,000/sec**| **<1ms**    | **2GB**    | **95%**  | **~2 hours** |
| Wayra PyTorch    | ~1,000/sec | 10ms    | 4GB    | 60%      | ~28 hours    |
| Llama 3 1B       | ~200/sec   | 50ms    | 8GB    | 40%      | ~139 hours   |

## Model Details

- **Base**: Knowledge distillation from meta-llama/Llama-3.2-1B
- **Architecture**: GPT2-based Transformer blocks with perplexity heads
- **Languages**: Spanish, Portuguese, English
- **Max Length**: 512 tokens
- **Precision**: **FP16 (TensorRT)**, FP32 (PyTorch)
- **Parameters**: 55M

## Troubleshooting

**"TensorRT engine not compatible"**
- Ensure you're using A100-SXM4-80GB GPU (sm_80 architecture)
- Check CUDA version: `nvidia-smi` (should be 12.8+)
- Verify TensorRT: `python -c "import tensorrt"` (should be 10.13.x)
- Confirm driver version: `nvidia-smi` (should be 570.124.06+)

**"CUDA out of memory"**
- Reduce batch size in inference
- Use smaller sequence lengths
- Monitor GPU memory: `nvidia-smi -l 1`

**"Import tensorrt failed"**
- Reinstall TensorRT: `pip uninstall tensorrt && pip install tensorrt==10.13.0`
- Check CUDA compatibility
- Verify LD_LIBRARY_PATH includes TensorRT libs

**Performance not as expected**
- Ensure GPU is not throttling: `nvidia-smi -q -d PERFORMANCE`
- Use dedicated GPU (not shared)
- Enable persistence mode: `nvidia-smi -pm 1`

## Citation

```bibtex
@software{WayraPPL,
  title={WayraPPL: High-Performance Perplexity Estimation of Data Novelty},
  author={Omar U. Florez and LatamGPT Team},
  year={2025},
  url={https://huggingface.co/latam-gpt/Wayra-Perplexity-Estimator-55M}
}
```

## References

- Dao et al. "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness" (2022)
- Narang et al. "Do Transformer Modifications Transfer Across Implementations and Applications?" (2021)  
- Rabe & Staats "Self-attention Does Not Need O(n²) Memory" (2021)
- Pope et al. "Efficiently Scaling Transformer Inference" (2022)

## License

Apache 2.0 - See LICENSE file

---

**Note**: This model is optimized for A100 GPUs. For other GPUs, use the PyTorch version or retrain the TensorRT engine for your specific hardware.