---
license: mit
language:
- en
base_model:
- FacebookAI/roberta-base
---
---
language: en
license: mit
library_name: transformers
tags:
- token-classification
- ner
- plants
- botany
- roberta
- biology
- horticulture
datasets:
- custom
widget:
- text: "I have a Rosa damascena and some Quercus alba trees in my garden."
  example_title: "Scientific plant names"
- text: "My hibiscus and pachypodium plants need watering."
  example_title: "Common plant names"
- text: "The beautiful roses are blooming next to the oak tree."
  example_title: "Mixed plant references"
pipeline_tag: token-classification
model-index:
- name: roberta-plant-ner
  results:
  - task:
      type: token-classification
      name: Token Classification
    dataset:
      type: custom
      name: Plant NER Dataset
    metrics:
    - type: f1
      value: 0.92
      name: F1 Score
    - type: precision
      value: 0.90
      name: Precision
    - type: recall
      value: 0.94
      name: Recall
---

# RoBERTa Plant Named Entity Recognition

## Model Description

This model is a fine-tuned version of [FacebookAI/roberta-base](https://huggingface.co/FacebookAI/roberta-base) for **plant named entity recognition**. It identifies and classifies plant names in text into two categories:

- **PLANT_COMMON**: Common names for plants (e.g., "rose", "hibiscus", "oak tree")
- **PLANT_SCI**: Scientific/botanical names (e.g., "Rosa damascena", "Quercus alba")

## Intended Uses & Limitations

### Intended Uses
- **Botanical text analysis**: Extract plant mentions from research papers, articles, and documentation
- **Gardening applications**: Identify plants mentioned in gardening guides, forums, and care instructions
- **Agricultural text processing**: Parse agricultural documents and reports
- **Educational tools**: Assist in botany and horticulture education
- **Content management**: Automatically tag and categorize plant-related content

### Limitations
- Trained primarily on English text
- May have lower accuracy on rare or highly specialized plant species
- Performance may vary on informal text, social media, or heavily abbreviated content
- Does not distinguish between live plants and plant products (e.g., "rose oil")

## Training Data

The model was trained on a custom dataset containing:
- Botanical literature and research papers
- Gardening guides and plant care instructions
- Agricultural documents
- Horticultural databases
- Plant identification guides

**Data Format**: CoNLL-style IOB2 tagging with whole-word tokenization
**Training Examples**: Thousands of annotated sentences containing plant references

## Training Procedure

### Training Hyperparameters
- **Base Model**: FacebookAI/roberta-base
- **Training Framework**: Hugging Face Transformers
- **Tokenization**: RoBERTa tokenizer with whole-word alignment
- **Label Encoding**: IOB2 (Inside-Outside-Begin) format
- **Sequence Length**: 512 tokens maximum
- **Batch Size**: Optimized for training efficiency
- **Learning Rate**: Adaptive with warmup
- **Training Epochs**: Multiple epochs with early stopping

### Label Schema
```
O              # Outside any plant entity
B-PLANT_COMMON # Beginning of common plant name
I-PLANT_COMMON # Inside/continuation of common plant name  
B-PLANT_SCI    # Beginning of scientific plant name
I-PLANT_SCI    # Inside/continuation of scientific plant name
```

### Training Features
- **Whole-word tokenization**: Ensures proper handling of plant names
- **B-I-O validation**: Automatic correction of invalid tag sequences
- **Class balancing**: Weighted sampling for entity type balance
- **Data augmentation**: Synthetic examples for robustness

## Evaluation

The model achieves strong performance on plant entity recognition:

| Metric | Overall | PLANT_COMMON | PLANT_SCI |
|--------|---------|--------------|-----------|
| **Precision** | 0.90 | 0.88 | 0.92 |
| **Recall** | 0.94 | 0.96 | 0.91 |
| **F1-Score** | 0.92 | 0.92 | 0.91 |

### Performance Notes
- Excellent recall for common plant names (0.96)
- Strong precision for scientific names (0.92)
- Robust performance across different text types

## Usage

### Quick Start
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

# Load model and tokenizer
model_name = "Dudeman523/roberta-plant-ner"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Create pipeline
ner_pipeline = pipeline(
    "token-classification", 
    model=model, 
    tokenizer=tokenizer,
    aggregation_strategy="simple"
)

# Extract plant entities
text = "I love my Rosa damascena roses and the old oak tree in my garden."
entities = ner_pipeline(text)

for entity in entities:
    print(f"Plant: {entity['word']} | Type: {entity['entity_group']} | Confidence: {entity['score']:.2f}")
```

### Advanced Usage
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

# Load model
tokenizer = AutoTokenizer.from_pretrained("Dudeman523/roberta-plant-ner")
model = AutoModelForTokenClassification.from_pretrained("Dudeman523/roberta-plant-ner")

# Tokenize input
text = "The Pachypodium lamerei succulent needs minimal watering."
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)

# Get predictions
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)

# Process results
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
predicted_labels = torch.argmax(predictions, dim=-1)[0]

for token, label_id in zip(tokens, predicted_labels):
    label = model.config.id2label[label_id.item()]
    if label != "O":
        print(f"Token: {token} | Label: {label}")
```

### Batch Processing
```python
# Process multiple texts efficiently
texts = [
    "My hibiscus is blooming beautifully this spring.",
    "Quercus alba and Acer saccharum are common in this forest.",
    "I need care instructions for my Rosa damascena plant."
]

# Batch prediction
results = ner_pipeline(texts)

for i, (text, entities) in enumerate(zip(texts, results)):
    print(f"\nText {i+1}: {text}")
    for entity in entities:
        print(f"  🌱 {entity['word']} ({entity['entity_group']}) - {entity['score']:.2f}")
```

## Model Architecture

- **Base Architecture**: RoBERTa (Robustly Optimized BERT Pretraining Approach)
- **Parameters**: ~125M parameters  
- **Layers**: 12 transformer layers
- **Hidden Size**: 768
- **Attention Heads**: 12
- **Vocabulary**: 50,265 tokens
- **Classification Head**: Linear layer for 5-class token classification

## Ethical Considerations

### Bias and Fairness
- Model may reflect geographical and cultural biases present in training data
- Potential underrepresentation of plants from certain regions or cultures
- May perform better on commonly cultivated plants versus wild or rare species

### Environmental Impact
- Training computational cost: Moderate (fine-tuning only)
- Inference efficiency: Optimized for production use
- Carbon footprint: Minimal incremental impact over base model

## Technical Specifications

- **Input**: Text sequences up to 512 tokens
- **Output**: Token-level classifications with confidence scores
- **Inference Speed**: ~100-500 texts/second (depending on hardware)
- **Memory Requirements**: ~500MB RAM for inference
- **Supported Formats**: Raw text, tokenized input

## Citation

If you use this model in your research, please cite:

```bibtex
@misc{roberta-plant-ner,
  title={RoBERTa Plant Named Entity Recognition Model},
  author={Dudeman523},
  year={2024},
  publisher={Hugging Face},
  url={https://huggingface.co/Dudeman523/roberta-plant-ner}
}
```

## Contact

For questions, issues, or collaboration opportunities, please open an issue on the model repository or contact the model author.

---

**Model Version**: 1.0  
**Last Updated**: December 2024  
**Framework Compatibility**: transformers >= 4.21.0