--- license: mit datasets: - IT-ZBMED/Agriculture_NER_Dataset_for_FAIR_Metadata_Enrichment language: - en - de metrics: - seqeval base_model: - FacebookAI/xlm-roberta-large pipeline_tag: token-classification tags: - agriculture - ner - information-extraction - llm - roberta - encoder - crops - soil - location - time-statement --- # Model Card This model is a product of the use case "Increasing FAIRness of FAIRagro data through AI supported metadata enrichment" which is part of the [FAIRagro consortium](https://fairagro.net/en/). The model is fine-tuned based on the [annotated dataset](https://huggingface.co/datasets/IT-ZBMED/Agriculture_NER_Dataset_for_FAIR_Metadata_Enrichment) where it is trained to extract entities related to crops, soil, locations, and time statements from agriculture research datasets. It is used in the use case to extract this information from the legacy research data and publications. Its application is to enrich current metadata by extracting agricultural metadata from current unstructured parts of metadata (titles and abstracts). ## Model Details ### Model Description This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated. - **Developed by:** ZB MED - informationszentrum lebenswissenschaften - **Funded by:** DFG - Deutsche Forschungsgemeinschaft - **Model type:** Token-classification Model - **Language(s) (NLP):** English, German - **License:** MIT - **Finetuned from model:** FacebookAI/xlm-roberta-large ## Uses This model is intended to be used as an NER model for agriculture research. The entities it can extract are: ```json [ "soilReferenceGroup", "soilOrganicCarbon", "soilTexture", "startTime", "endTime", "city", "duration", "cropSpecies", "soilAvailableNitrogen", "soilDepth", "region", "country", "longitude", "latitude", "cropVariety", "soilPH", "soilBulkDensity" ] ``` ### Out-of-Scope Use This model is not intended to be used in other domains outside of agriculture research or the English or German languages. ## Bias, Risks, and Limitations This model is limited by its training dataset of entity-annotated agriculture titles and abstracts from a specific span of time. This may not generalize well for all use cases in different domains. Furthermore, the model occassionally tags subword tokens as entities where there are special signs (e.g., "-,/") and post-processing of results may be necessary to handle those cases. ### Recommendations Users (both direct and downstream) should be made aware of the risks, biases, and limitations of the model. It is recommended to post-process the raw outputs of the model. ## How to Get Started with the Model ### Code sample Use the code below to get started with the model. ```python from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline roberta_fairagro = AutoModelForTokenClassification.from_pretrained("IT-ZBMED/Agriculture_NER_Model_for_FAIR_Metadata_Enrichment") tokenizer = AutoTokenizer.from_pretrained("IT-ZBMED/Agriculture_NER_Model_for_FAIR_Metadata_Enrichment") nlp = pipeline("ner", model=roberta_fairagro, tokenizer=tokenizer, aggregation_strategy="simple") example = ( "In early spring 2025, maize and soybean seedlings established quickly in the loamy sand soil as warmer temperatures " "accelerated germination, while by late autumn, the clay loam field supported a robust barley crop that matured well despite the soil’s slower drainage." ) ner_results = nlp(example) print(ner_results) ``` ### Output ```bash [{'entity_group': 'startTime', 'score': 0.9885543, 'word': 'spring 2025', 'start': 9, 'end': 20}, {'entity_group': 'cropSpecies', 'score': 0.9997772, 'word': 'maize', 'start': 22, 'end': 27}, {'entity_group': 'cropSpecies', 'score': 0.98714954, 'word': 'soybean', 'start': 32, 'end': 39}, {'entity_group': 'soilTexture', 'score': 0.99048805, 'word': 'loamy sand', 'start': 77, 'end': 87}, {'entity_group': 'soilTexture', 'score': 0.97245836, 'word': 'clay loam', 'start': 167, 'end': 176}, {'entity_group': 'cropSpecies', 'score': 0.9997045, 'word': 'barley', 'start': 202, 'end': 208}] ``` [More Information Needed] ## Training Details ### Training Data [IT-ZBMED/Agriculture_NER_Dataset_for_FAIR_Metadata_Enrichment](https://huggingface.co/datasets/IT-ZBMED/Agriculture_NER_Dataset_for_FAIR_Metadata_Enrichment) ### Training Procedure The model was fine-tuned on the whole training dataset of the sentence-split. #### Training Hyperparameters | Parameter | Value | |-----------------------------|--------------------------------------| | batch_size | `4` | | learning_rate | `2.657488681466831e-05` | | warmup_ratio | `0.09938204231729805` | | num_train_epochs | `10` | | weight_decay | `0.010599758492599783` | | adam_beta1 | `0.9` | | adam_beta2 | `0.999` | | adam_epsilon | `1e-08` | | metric_for_best_model | `f1` | | lr_scheduler_type | `linear` | ## Evaluation The evaluation was done by the seqeval library based on precision, recall and f1 scores. ### Testing Data, Factors & Metrics #### Testing Data The test split of the following dataset's sentence version [IT-ZBMED/Agriculture_NER_Dataset_for_FAIR_Metadata_Enrichment](https://huggingface.co/datasets/IT-ZBMED/Agriculture_NER_Dataset_for_FAIR_Metadata_Enrichment) #### Metrics f1 score ### Results #### overall results | Metric | Value | |---------------|----------------------| | Precision | 0.7745 | | Recall | 0.7524 | | F1 Score | 0.7633 | | Macro F1 | 0.6189 | | Accuracy | 0.9789 | | Loss | 0.1476 | #### label-based results | Label | Precision | Recall | F1 Score | Support | |--------------------------|-----------|--------|----------|---------| | city | 0.8571 | 0.6667 | 0.7500 | 18 | | country | 0.9200 | 0.9200 | 0.9200 | 25 | | cropSpecies | 0.7742 | 0.8372 | 0.8045 | 86 | | cropVariety | 0.0000 | 0.0000 | 0.0000 | 3 | | duration | 0.6364 | 0.5833 | 0.6087 | 24 | | endTime | 0.7647 | 0.8125 | 0.7879 | 32 | | latitude | 0.0000 | 0.0000 | 0.0000 | 2 | | longitude | 0.6667 | 1.0000 | 0.8000 | 2 | | region | 0.6429 | 0.5294 | 0.5806 | 17 | | soilAvailableNitrogen | 1.0000 | 1.0000 | 1.0000 | 3 | | soilBulkDensity | 0.0000 | 0.0000 | 0.0000 | 1 | | soilDepth | 0.6250 | 0.8333 | 0.7143 | 6 | | soilOrganicCarbon | 0.7857 | 0.5500 | 0.6471 | 20 | | soilPH | 0.6000 | 0.7500 | 0.6667 | 4 | | soilReferenceGroup | 1.0000 | 1.0000 | 1.0000 | 1 | | soilTexture | 0.7500 | 0.2727 | 0.4000 | 11 | | startTime | 0.8030 | 0.8833 | 0.8413 | 60 | ## Environmental Impact Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). - **Hardware Type:** 1 A40 GPU - **Hours used:** less than one hour - **Cloud Provider:** High-Performance Computing (HPC) - University of Bonn ## Model Card Contact [Abanoub Abdelmalak](https://github.com/AbanoubAbdelmalak) Email: abdelmalak@zbmed.de ## Contributors [Murtuza Husain](https://github.com/murtuza10)