Model Card

This model is a product of the use case "Increasing FAIRness of FAIRagro data through AI supported metadata enrichment" which is part of the FAIRagro consortium. The model is fine-tuned based on the annotated dataset where it is trained to extract entities related to crops, soil, locations, and time statements from agriculture research datasets. It is used in the use case to extract this information from the legacy research data and publications. Its application is to enrich current metadata by extracting agricultural metadata from current unstructured parts of metadata (titles and abstracts).

Model Details

Model Description

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

Developed by: ZB MED - informationszentrum lebenswissenschaften
Funded by: DFG - Deutsche Forschungsgemeinschaft
Model type: Token-classification Model
Language(s) (NLP): English, German
License: MIT
Finetuned from model: FacebookAI/xlm-roberta-large

Uses

This model is intended to be used as an NER model for agriculture research. The entities it can extract are:

[
  "soilReferenceGroup",
  "soilOrganicCarbon",
  "soilTexture",
  "startTime",
  "endTime",
  "city",
  "duration",
  "cropSpecies",
  "soilAvailableNitrogen",
  "soilDepth",
  "region",
  "country",
  "longitude",
  "latitude",
  "cropVariety",
  "soilPH",
  "soilBulkDensity"
]

Out-of-Scope Use

This model is not intended to be used in other domains outside of agriculture research or the English or German languages.

Bias, Risks, and Limitations

This model is limited by its training dataset of entity-annotated agriculture titles and abstracts from a specific span of time. This may not generalize well for all use cases in different domains. Furthermore, the model occassionally tags subword tokens as entities where there are special signs (e.g., "-,/") and post-processing of results may be necessary to handle those cases.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases, and limitations of the model. It is recommended to post-process the raw outputs of the model.

How to Get Started with the Model

Code sample

Use the code below to get started with the model.

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

roberta_fairagro = AutoModelForTokenClassification.from_pretrained("IT-ZBMED/Agriculture_NER_Model_for_FAIR_Metadata_Enrichment")
tokenizer = AutoTokenizer.from_pretrained("IT-ZBMED/Agriculture_NER_Model_for_FAIR_Metadata_Enrichment")

nlp = pipeline("ner", model=roberta_fairagro, tokenizer=tokenizer, aggregation_strategy="simple")

example = (
    "In early spring 2025, maize and soybean seedlings established quickly in the loamy sand soil as warmer temperatures "
    "accelerated germination, while by late autumn, the clay loam field supported a robust barley crop that matured well despite the soil’s slower drainage."
)

ner_results = nlp(example)
print(ner_results)

Output

[{'entity_group': 'startTime', 'score': 0.9885543, 'word': 'spring 2025', 'start': 9, 'end': 20},
 {'entity_group': 'cropSpecies', 'score': 0.9997772, 'word': 'maize', 'start': 22, 'end': 27},
 {'entity_group': 'cropSpecies', 'score': 0.98714954, 'word': 'soybean', 'start': 32, 'end': 39},
 {'entity_group': 'soilTexture', 'score': 0.99048805, 'word': 'loamy sand', 'start': 77, 'end': 87},
 {'entity_group': 'soilTexture', 'score': 0.97245836, 'word': 'clay loam', 'start': 167, 'end': 176},
 {'entity_group': 'cropSpecies', 'score': 0.9997045, 'word': 'barley', 'start': 202, 'end': 208}]

[More Information Needed]

Training Details

Training Data

IT-ZBMED/Agriculture_NER_Dataset_for_FAIR_Metadata_Enrichment

Training Procedure

The model was fine-tuned on the whole training dataset of the sentence-split.

Training Hyperparameters

Parameter	Value
batch_size	`4`
learning_rate	`2.657488681466831e-05`
warmup_ratio	`0.09938204231729805`
num_train_epochs	`10`
weight_decay	`0.010599758492599783`
adam_beta1	`0.9`
adam_beta2	`0.999`
adam_epsilon	`1e-08`
metric_for_best_model	`f1`
lr_scheduler_type	`linear`

Evaluation

The evaluation was done by the seqeval library based on precision, recall and f1 scores.

Testing Data, Factors & Metrics

Testing Data

The test split of the following dataset's sentence version IT-ZBMED/Agriculture_NER_Dataset_for_FAIR_Metadata_Enrichment

Metrics

f1 score

Results

overall results

Metric	Value
Precision	0.7745
Recall	0.7524
F1 Score	0.7633
Macro F1	0.6189
Accuracy	0.9789
Loss	0.1476

label-based results

Label	Precision	Recall	F1 Score	Support
city	0.8571	0.6667	0.7500	18
country	0.9200	0.9200	0.9200	25
cropSpecies	0.7742	0.8372	0.8045	86
cropVariety	0.0000	0.0000	0.0000	3
duration	0.6364	0.5833	0.6087	24
endTime	0.7647	0.8125	0.7879	32
latitude	0.0000	0.0000	0.0000	2
longitude	0.6667	1.0000	0.8000	2
region	0.6429	0.5294	0.5806	17
soilAvailableNitrogen	1.0000	1.0000	1.0000	3
soilBulkDensity	0.0000	0.0000	0.0000	1
soilDepth	0.6250	0.8333	0.7143	6
soilOrganicCarbon	0.7857	0.5500	0.6471	20
soilPH	0.6000	0.7500	0.6667	4
soilReferenceGroup	1.0000	1.0000	1.0000	1
soilTexture	0.7500	0.2727	0.4000	11
startTime	0.8030	0.8833	0.8413	60

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: 1 A40 GPU
Hours used: less than one hour
Cloud Provider: High-Performance Computing (HPC) - University of Bonn

Model Card Contact

Abanoub Abdelmalak Email: [email protected]

Contributors

Murtuza Husain

Downloads last month: 18

Safetensors

Model size

0.6B params

Tensor type

F32

Model tree for IT-ZBMED/Agriculture_NER_Model_for_FAIR_Metadata_Enrichment

Base model

FacebookAI/xlm-roberta-large

Finetuned

(849)

this model

IT-ZBMED
/

Agriculture_NER_Model_for_FAIR_Metadata_Enrichment