Dudeman523
/

RoBERTa_ner_plant_names_onnx

+---
+license: mit
+language:
+- en
+base_model:
+- FacebookAI/roberta-base
+---
+---
+language: en
+license: mit
+library_name: transformers
+tags:
+- token-classification
+- ner
+- plants
+- botany
+- roberta
+- biology
+- horticulture
+datasets:
+- custom
+widget:
+- text: "I have a Rosa damascena and some Quercus alba trees in my garden."
+  example_title: "Scientific plant names"
+- text: "My hibiscus and pachypodium plants need watering."
+  example_title: "Common plant names"
+- text: "The beautiful roses are blooming next to the oak tree."
+  example_title: "Mixed plant references"
+pipeline_tag: token-classification
+model-index:
+- name: roberta-plant-ner
+  results:
+  - task:
+      type: token-classification
+      name: Token Classification
+    dataset:
+      type: custom
+      name: Plant NER Dataset
+    metrics:
+    - type: f1
+      value: 0.92
+      name: F1 Score
+    - type: precision
+      value: 0.90
+      name: Precision
+    - type: recall
+      value: 0.94
+      name: Recall
+---
+# RoBERTa Plant Named Entity Recognition
+## Model Description
+This model is a fine-tuned version of [FacebookAI/roberta-base](https://huggingface.co/FacebookAI/roberta-base) for **plant named entity recognition**. It identifies and classifies plant names in text into two categories:
+- **PLANT_COMMON**: Common names for plants (e.g., "rose", "hibiscus", "oak tree")
+- **PLANT_SCI**: Scientific/botanical names (e.g., "Rosa damascena", "Quercus alba")
+## Intended Uses & Limitations
+### Intended Uses
+- **Botanical text analysis**: Extract plant mentions from research papers, articles, and documentation
+- **Gardening applications**: Identify plants mentioned in gardening guides, forums, and care instructions
+- **Agricultural text processing**: Parse agricultural documents and reports
+- **Educational tools**: Assist in botany and horticulture education
+- **Content management**: Automatically tag and categorize plant-related content
+### Limitations
+- Trained primarily on English text
+- May have lower accuracy on rare or highly specialized plant species
+- Performance may vary on informal text, social media, or heavily abbreviated content
+- Does not distinguish between live plants and plant products (e.g., "rose oil")
+## Training Data
+The model was trained on a custom dataset containing:
+- Botanical literature and research papers
+- Gardening guides and plant care instructions
+- Agricultural documents
+- Horticultural databases
+- Plant identification guides
+**Data Format**: CoNLL-style IOB2 tagging with whole-word tokenization
+**Training Examples**: Thousands of annotated sentences containing plant references
+## Training Procedure
+### Training Hyperparameters
+- **Base Model**: FacebookAI/roberta-base
+- **Training Framework**: Hugging Face Transformers
+- **Tokenization**: RoBERTa tokenizer with whole-word alignment
+- **Label Encoding**: IOB2 (Inside-Outside-Begin) format
+- **Sequence Length**: 512 tokens maximum
+- **Batch Size**: Optimized for training efficiency
+- **Learning Rate**: Adaptive with warmup
+- **Training Epochs**: Multiple epochs with early stopping
+### Label Schema
+```
+O              # Outside any plant entity
+B-PLANT_COMMON # Beginning of common plant name
+I-PLANT_COMMON # Inside/continuation of common plant name
+B-PLANT_SCI    # Beginning of scientific plant name
+I-PLANT_SCI    # Inside/continuation of scientific plant name
+```
+### Training Features
+- **Whole-word tokenization**: Ensures proper handling of plant names
+- **B-I-O validation**: Automatic correction of invalid tag sequences
+- **Class balancing**: Weighted sampling for entity type balance
+- **Data augmentation**: Synthetic examples for robustness
+## Evaluation
+The model achieves strong performance on plant entity recognition:
+| Metric | Overall | PLANT_COMMON | PLANT_SCI |
+|--------|---------|--------------|-----------|
+| **Precision** | 0.90 | 0.88 | 0.92 |
+| **Recall** | 0.94 | 0.96 | 0.91 |
+| **F1-Score** | 0.92 | 0.92 | 0.91 |
+### Performance Notes
+- Excellent recall for common plant names (0.96)
+- Strong precision for scientific names (0.92)
+- Robust performance across different text types
+## Usage
+### Quick Start
+```python
+from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
+# Load model and tokenizer
+model_name = "Dudeman523/roberta-plant-ner"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForTokenClassification.from_pretrained(model_name)
+# Create pipeline
+ner_pipeline = pipeline(
+    "token-classification",
+    model=model,
+    tokenizer=tokenizer,
+    aggregation_strategy="simple"
+)
+# Extract plant entities
+text = "I love my Rosa damascena roses and the old oak tree in my garden."
+entities = ner_pipeline(text)
+for entity in entities:
+    print(f"Plant: {entity['word']} | Type: {entity['entity_group']} | Confidence: {entity['score']:.2f}")
+```
+### Advanced Usage
+```python
+from transformers import AutoTokenizer, AutoModelForTokenClassification
+import torch
+# Load model
+tokenizer = AutoTokenizer.from_pretrained("Dudeman523/roberta-plant-ner")
+model = AutoModelForTokenClassification.from_pretrained("Dudeman523/roberta-plant-ner")
+# Tokenize input
+text = "The Pachypodium lamerei succulent needs minimal watering."
+inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
+# Get predictions
+with torch.no_grad():
+    outputs = model(**inputs)
+    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
+# Process results
+tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
+predicted_labels = torch.argmax(predictions, dim=-1)[0]
+for token, label_id in zip(tokens, predicted_labels):
+    label = model.config.id2label[label_id.item()]
+    if label != "O":
+        print(f"Token: {token} | Label: {label}")
+```
+### Batch Processing
+```python
+# Process multiple texts efficiently
+texts = [
+    "My hibiscus is blooming beautifully this spring.",
+    "Quercus alba and Acer saccharum are common in this forest.",
+    "I need care instructions for my Rosa damascena plant."
+]
+# Batch prediction
+results = ner_pipeline(texts)
+for i, (text, entities) in enumerate(zip(texts, results)):
+    print(f"\nText {i+1}: {text}")
+    for entity in entities:
+        print(f"  🌱 {entity['word']} ({entity['entity_group']}) - {entity['score']:.2f}")
+```
+## Model Architecture
+- **Base Architecture**: RoBERTa (Robustly Optimized BERT Pretraining Approach)
+- **Parameters**: ~125M parameters
+- **Layers**: 12 transformer layers
+- **Hidden Size**: 768
+- **Attention Heads**: 12
+- **Vocabulary**: 50,265 tokens
+- **Classification Head**: Linear layer for 5-class token classification
+## Ethical Considerations
+### Bias and Fairness
+- Model may reflect geographical and cultural biases present in training data
+- Potential underrepresentation of plants from certain regions or cultures
+- May perform better on commonly cultivated plants versus wild or rare species
+### Environmental Impact
+- Training computational cost: Moderate (fine-tuning only)
+- Inference efficiency: Optimized for production use
+- Carbon footprint: Minimal incremental impact over base model
+## Technical Specifications
+- **Input**: Text sequences up to 512 tokens
+- **Output**: Token-level classifications with confidence scores
+- **Inference Speed**: ~100-500 texts/second (depending on hardware)
+- **Memory Requirements**: ~500MB RAM for inference
+- **Supported Formats**: Raw text, tokenized input
+## Citation
+If you use this model in your research, please cite:
+```bibtex
+@misc{roberta-plant-ner,
+  title={RoBERTa Plant Named Entity Recognition Model},
+  author={Dudeman523},
+  year={2024},
+  publisher={Hugging Face},
+  url={https://huggingface.co/Dudeman523/roberta-plant-ner}
+}
+```
+## Contact
+For questions, issues, or collaboration opportunities, please open an issue on the model repository or contact the model author.
+---
+**Model Version**: 1.0
+**Last Updated**: December 2024
+**Framework Compatibility**: transformers >= 4.21.0