--- license: mit language: - en base_model: - FacebookAI/roberta-base --- --- language: en license: mit library_name: transformers tags: - token-classification - ner - plants - botany - roberta - biology - horticulture datasets: - custom widget: - text: "I have a Rosa damascena and some Quercus alba trees in my garden." example_title: "Scientific plant names" - text: "My hibiscus and pachypodium plants need watering." example_title: "Common plant names" - text: "The beautiful roses are blooming next to the oak tree." example_title: "Mixed plant references" pipeline_tag: token-classification model-index: - name: roberta-plant-ner results: - task: type: token-classification name: Token Classification dataset: type: custom name: Plant NER Dataset metrics: - type: f1 value: 0.92 name: F1 Score - type: precision value: 0.90 name: Precision - type: recall value: 0.94 name: Recall --- # RoBERTa Plant Named Entity Recognition ## Model Description This model is a fine-tuned version of [FacebookAI/roberta-base](https://huggingface.co/FacebookAI/roberta-base) for **plant named entity recognition**. It identifies and classifies plant names in text into two categories: - **PLANT_COMMON**: Common names for plants (e.g., "rose", "hibiscus", "oak tree") - **PLANT_SCI**: Scientific/botanical names (e.g., "Rosa damascena", "Quercus alba") ## Intended Uses & Limitations ### Intended Uses - **Botanical text analysis**: Extract plant mentions from research papers, articles, and documentation - **Gardening applications**: Identify plants mentioned in gardening guides, forums, and care instructions - **Agricultural text processing**: Parse agricultural documents and reports - **Educational tools**: Assist in botany and horticulture education - **Content management**: Automatically tag and categorize plant-related content ### Limitations - Trained primarily on English text - May have lower accuracy on rare or highly specialized plant species - Performance may vary on informal text, social media, or heavily abbreviated content - Does not distinguish between live plants and plant products (e.g., "rose oil") ## Training Data The model was trained on a custom dataset containing: - Botanical literature and research papers - Gardening guides and plant care instructions - Agricultural documents - Horticultural databases - Plant identification guides **Data Format**: CoNLL-style IOB2 tagging with whole-word tokenization **Training Examples**: Thousands of annotated sentences containing plant references ## Training Procedure ### Training Hyperparameters - **Base Model**: FacebookAI/roberta-base - **Training Framework**: Hugging Face Transformers - **Tokenization**: RoBERTa tokenizer with whole-word alignment - **Label Encoding**: IOB2 (Inside-Outside-Begin) format - **Sequence Length**: 512 tokens maximum - **Batch Size**: Optimized for training efficiency - **Learning Rate**: Adaptive with warmup - **Training Epochs**: Multiple epochs with early stopping ### Label Schema ``` O # Outside any plant entity B-PLANT_COMMON # Beginning of common plant name I-PLANT_COMMON # Inside/continuation of common plant name B-PLANT_SCI # Beginning of scientific plant name I-PLANT_SCI # Inside/continuation of scientific plant name ``` ### Training Features - **Whole-word tokenization**: Ensures proper handling of plant names - **B-I-O validation**: Automatic correction of invalid tag sequences - **Class balancing**: Weighted sampling for entity type balance - **Data augmentation**: Synthetic examples for robustness ## Evaluation The model achieves strong performance on plant entity recognition: | Metric | Overall | PLANT_COMMON | PLANT_SCI | |--------|---------|--------------|-----------| | **Precision** | 0.90 | 0.88 | 0.92 | | **Recall** | 0.94 | 0.96 | 0.91 | | **F1-Score** | 0.92 | 0.92 | 0.91 | ### Performance Notes - Excellent recall for common plant names (0.96) - Strong precision for scientific names (0.92) - Robust performance across different text types ## Usage ### Quick Start ```python from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline # Load model and tokenizer model_name = "Dudeman523/roberta-plant-ner" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForTokenClassification.from_pretrained(model_name) # Create pipeline ner_pipeline = pipeline( "token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="simple" ) # Extract plant entities text = "I love my Rosa damascena roses and the old oak tree in my garden." entities = ner_pipeline(text) for entity in entities: print(f"Plant: {entity['word']} | Type: {entity['entity_group']} | Confidence: {entity['score']:.2f}") ``` ### Advanced Usage ```python from transformers import AutoTokenizer, AutoModelForTokenClassification import torch # Load model tokenizer = AutoTokenizer.from_pretrained("Dudeman523/roberta-plant-ner") model = AutoModelForTokenClassification.from_pretrained("Dudeman523/roberta-plant-ner") # Tokenize input text = "The Pachypodium lamerei succulent needs minimal watering." inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True) # Get predictions with torch.no_grad(): outputs = model(**inputs) predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) # Process results tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]) predicted_labels = torch.argmax(predictions, dim=-1)[0] for token, label_id in zip(tokens, predicted_labels): label = model.config.id2label[label_id.item()] if label != "O": print(f"Token: {token} | Label: {label}") ``` ### Batch Processing ```python # Process multiple texts efficiently texts = [ "My hibiscus is blooming beautifully this spring.", "Quercus alba and Acer saccharum are common in this forest.", "I need care instructions for my Rosa damascena plant." ] # Batch prediction results = ner_pipeline(texts) for i, (text, entities) in enumerate(zip(texts, results)): print(f"\nText {i+1}: {text}") for entity in entities: print(f" 🌱 {entity['word']} ({entity['entity_group']}) - {entity['score']:.2f}") ``` ## Model Architecture - **Base Architecture**: RoBERTa (Robustly Optimized BERT Pretraining Approach) - **Parameters**: ~125M parameters - **Layers**: 12 transformer layers - **Hidden Size**: 768 - **Attention Heads**: 12 - **Vocabulary**: 50,265 tokens - **Classification Head**: Linear layer for 5-class token classification ## Ethical Considerations ### Bias and Fairness - Model may reflect geographical and cultural biases present in training data - Potential underrepresentation of plants from certain regions or cultures - May perform better on commonly cultivated plants versus wild or rare species ### Environmental Impact - Training computational cost: Moderate (fine-tuning only) - Inference efficiency: Optimized for production use - Carbon footprint: Minimal incremental impact over base model ## Technical Specifications - **Input**: Text sequences up to 512 tokens - **Output**: Token-level classifications with confidence scores - **Inference Speed**: ~100-500 texts/second (depending on hardware) - **Memory Requirements**: ~500MB RAM for inference - **Supported Formats**: Raw text, tokenized input ## Citation If you use this model in your research, please cite: ```bibtex @misc{roberta-plant-ner, title={RoBERTa Plant Named Entity Recognition Model}, author={Dudeman523}, year={2024}, publisher={Hugging Face}, url={https://huggingface.co/Dudeman523/roberta-plant-ner} } ``` ## Contact For questions, issues, or collaboration opportunities, please open an issue on the model repository or contact the model author. --- **Model Version**: 1.0 **Last Updated**: December 2024 **Framework Compatibility**: transformers >= 4.21.0