Model Card
This model is a product of the use case "Increasing FAIRness of FAIRagro data through AI supported metadata enrichment" which is part of the FAIRagro consortium. The model is fine-tuned based on the annotated dataset where it is trained to extract entities related to crops, soil, locations, and time statements from agriculture research datasets. It is used in the use case to extract this information from the legacy research data and publications. Its application is to enrich current metadata by extracting agricultural metadata from current unstructured parts of metadata (titles and abstracts).
Model Details
Model Description
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
- Developed by: ZB MED - informationszentrum lebenswissenschaften
- Funded by: DFG - Deutsche Forschungsgemeinschaft
- Model type: Token-classification Model
- Language(s) (NLP): English, German
- License: MIT
- Finetuned from model: FacebookAI/xlm-roberta-large
Uses
This model is intended to be used as an NER model for agriculture research. The entities it can extract are:
[
"soilReferenceGroup",
"soilOrganicCarbon",
"soilTexture",
"startTime",
"endTime",
"city",
"duration",
"cropSpecies",
"soilAvailableNitrogen",
"soilDepth",
"region",
"country",
"longitude",
"latitude",
"cropVariety",
"soilPH",
"soilBulkDensity"
]
Out-of-Scope Use
This model is not intended to be used in other domains outside of agriculture research or the English or German languages.
Bias, Risks, and Limitations
This model is limited by its training dataset of entity-annotated agriculture titles and abstracts from a specific span of time. This may not generalize well for all use cases in different domains. Furthermore, the model occassionally tags subword tokens as entities where there are special signs (e.g., "-,/") and post-processing of results may be necessary to handle those cases.
Recommendations
Users (both direct and downstream) should be made aware of the risks, biases, and limitations of the model. It is recommended to post-process the raw outputs of the model.
How to Get Started with the Model
Code sample
Use the code below to get started with the model.
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
roberta_fairagro = AutoModelForTokenClassification.from_pretrained("IT-ZBMED/Agriculture_NER_Model_for_FAIR_Metadata_Enrichment")
tokenizer = AutoTokenizer.from_pretrained("IT-ZBMED/Agriculture_NER_Model_for_FAIR_Metadata_Enrichment")
nlp = pipeline("ner", model=roberta_fairagro, tokenizer=tokenizer, aggregation_strategy="simple")
example = (
"In early spring 2025, maize and soybean seedlings established quickly in the loamy sand soil as warmer temperatures "
"accelerated germination, while by late autumn, the clay loam field supported a robust barley crop that matured well despite the soil’s slower drainage."
)
ner_results = nlp(example)
print(ner_results)
Output
[{'entity_group': 'startTime', 'score': 0.9885543, 'word': 'spring 2025', 'start': 9, 'end': 20},
{'entity_group': 'cropSpecies', 'score': 0.9997772, 'word': 'maize', 'start': 22, 'end': 27},
{'entity_group': 'cropSpecies', 'score': 0.98714954, 'word': 'soybean', 'start': 32, 'end': 39},
{'entity_group': 'soilTexture', 'score': 0.99048805, 'word': 'loamy sand', 'start': 77, 'end': 87},
{'entity_group': 'soilTexture', 'score': 0.97245836, 'word': 'clay loam', 'start': 167, 'end': 176},
{'entity_group': 'cropSpecies', 'score': 0.9997045, 'word': 'barley', 'start': 202, 'end': 208}]
[More Information Needed]
Training Details
Training Data
IT-ZBMED/Agriculture_NER_Dataset_for_FAIR_Metadata_Enrichment
Training Procedure
The model was fine-tuned on the whole training dataset of the sentence-split.
Training Hyperparameters
| Parameter | Value |
|---|---|
| batch_size | 4 |
| learning_rate | 2.657488681466831e-05 |
| warmup_ratio | 0.09938204231729805 |
| num_train_epochs | 10 |
| weight_decay | 0.010599758492599783 |
| adam_beta1 | 0.9 |
| adam_beta2 | 0.999 |
| adam_epsilon | 1e-08 |
| metric_for_best_model | f1 |
| lr_scheduler_type | linear |
Evaluation
The evaluation was done by the seqeval library based on precision, recall and f1 scores.
Testing Data, Factors & Metrics
Testing Data
The test split of the following dataset's sentence version IT-ZBMED/Agriculture_NER_Dataset_for_FAIR_Metadata_Enrichment
Metrics
f1 score
Results
overall results
| Metric | Value |
|---|---|
| Precision | 0.7745 |
| Recall | 0.7524 |
| F1 Score | 0.7633 |
| Macro F1 | 0.6189 |
| Accuracy | 0.9789 |
| Loss | 0.1476 |
label-based results
| Label | Precision | Recall | F1 Score | Support |
|---|---|---|---|---|
| city | 0.8571 | 0.6667 | 0.7500 | 18 |
| country | 0.9200 | 0.9200 | 0.9200 | 25 |
| cropSpecies | 0.7742 | 0.8372 | 0.8045 | 86 |
| cropVariety | 0.0000 | 0.0000 | 0.0000 | 3 |
| duration | 0.6364 | 0.5833 | 0.6087 | 24 |
| endTime | 0.7647 | 0.8125 | 0.7879 | 32 |
| latitude | 0.0000 | 0.0000 | 0.0000 | 2 |
| longitude | 0.6667 | 1.0000 | 0.8000 | 2 |
| region | 0.6429 | 0.5294 | 0.5806 | 17 |
| soilAvailableNitrogen | 1.0000 | 1.0000 | 1.0000 | 3 |
| soilBulkDensity | 0.0000 | 0.0000 | 0.0000 | 1 |
| soilDepth | 0.6250 | 0.8333 | 0.7143 | 6 |
| soilOrganicCarbon | 0.7857 | 0.5500 | 0.6471 | 20 |
| soilPH | 0.6000 | 0.7500 | 0.6667 | 4 |
| soilReferenceGroup | 1.0000 | 1.0000 | 1.0000 | 1 |
| soilTexture | 0.7500 | 0.2727 | 0.4000 | 11 |
| startTime | 0.8030 | 0.8833 | 0.8413 | 60 |
Environmental Impact
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).
- Hardware Type: 1 A40 GPU
- Hours used: less than one hour
- Cloud Provider: High-Performance Computing (HPC) - University of Bonn
Model Card Contact
Abanoub Abdelmalak Email: [email protected]
Contributors
- Downloads last month
- 18
Model tree for IT-ZBMED/Agriculture_NER_Model_for_FAIR_Metadata_Enrichment
Base model
FacebookAI/xlm-roberta-large