Model Card

This model is a product of the use case "Increasing FAIRness of FAIRagro data through AI supported metadata enrichment" which is part of the FAIRagro consortium. The model is fine-tuned based on the annotated dataset where it is trained to extract entities related to crops, soil, locations, and time statements from agriculture research datasets. It is used in the use case to extract this information from the legacy research data and publications. Its application is to enrich current metadata by extracting agricultural metadata from current unstructured parts of metadata (titles and abstracts).

Model Details

Model Description

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

  • Developed by: ZB MED - informationszentrum lebenswissenschaften
  • Funded by: DFG - Deutsche Forschungsgemeinschaft
  • Model type: Token-classification Model
  • Language(s) (NLP): English, German
  • License: MIT
  • Finetuned from model: FacebookAI/xlm-roberta-large

Uses

This model is intended to be used as an NER model for agriculture research. The entities it can extract are:

[
  "soilReferenceGroup",
  "soilOrganicCarbon",
  "soilTexture",
  "startTime",
  "endTime",
  "city",
  "duration",
  "cropSpecies",
  "soilAvailableNitrogen",
  "soilDepth",
  "region",
  "country",
  "longitude",
  "latitude",
  "cropVariety",
  "soilPH",
  "soilBulkDensity"
]

Out-of-Scope Use

This model is not intended to be used in other domains outside of agriculture research or the English or German languages.

Bias, Risks, and Limitations

This model is limited by its training dataset of entity-annotated agriculture titles and abstracts from a specific span of time. This may not generalize well for all use cases in different domains. Furthermore, the model occassionally tags subword tokens as entities where there are special signs (e.g., "-,/") and post-processing of results may be necessary to handle those cases.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases, and limitations of the model. It is recommended to post-process the raw outputs of the model.

How to Get Started with the Model

Code sample

Use the code below to get started with the model.

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

roberta_fairagro = AutoModelForTokenClassification.from_pretrained("IT-ZBMED/Agriculture_NER_Model_for_FAIR_Metadata_Enrichment")
tokenizer = AutoTokenizer.from_pretrained("IT-ZBMED/Agriculture_NER_Model_for_FAIR_Metadata_Enrichment")

nlp = pipeline("ner", model=roberta_fairagro, tokenizer=tokenizer, aggregation_strategy="simple")

example = (
    "In early spring 2025, maize and soybean seedlings established quickly in the loamy sand soil as warmer temperatures "
    "accelerated germination, while by late autumn, the clay loam field supported a robust barley crop that matured well despite the soil’s slower drainage."
)

ner_results = nlp(example)
print(ner_results)

Output

[{'entity_group': 'startTime', 'score': 0.9885543, 'word': 'spring 2025', 'start': 9, 'end': 20},
 {'entity_group': 'cropSpecies', 'score': 0.9997772, 'word': 'maize', 'start': 22, 'end': 27},
 {'entity_group': 'cropSpecies', 'score': 0.98714954, 'word': 'soybean', 'start': 32, 'end': 39},
 {'entity_group': 'soilTexture', 'score': 0.99048805, 'word': 'loamy sand', 'start': 77, 'end': 87},
 {'entity_group': 'soilTexture', 'score': 0.97245836, 'word': 'clay loam', 'start': 167, 'end': 176},
 {'entity_group': 'cropSpecies', 'score': 0.9997045, 'word': 'barley', 'start': 202, 'end': 208}]

[More Information Needed]

Training Details

Training Data

IT-ZBMED/Agriculture_NER_Dataset_for_FAIR_Metadata_Enrichment

Training Procedure

The model was fine-tuned on the whole training dataset of the sentence-split.

Training Hyperparameters

Parameter Value
batch_size 4
learning_rate 2.657488681466831e-05
warmup_ratio 0.09938204231729805
num_train_epochs 10
weight_decay 0.010599758492599783
adam_beta1 0.9
adam_beta2 0.999
adam_epsilon 1e-08
metric_for_best_model f1
lr_scheduler_type linear

Evaluation

The evaluation was done by the seqeval library based on precision, recall and f1 scores.

Testing Data, Factors & Metrics

Testing Data

The test split of the following dataset's sentence version IT-ZBMED/Agriculture_NER_Dataset_for_FAIR_Metadata_Enrichment

Metrics

f1 score

Results

overall results

Metric Value
Precision 0.7745
Recall 0.7524
F1 Score 0.7633
Macro F1 0.6189
Accuracy 0.9789
Loss 0.1476

label-based results

Label Precision Recall F1 Score Support
city 0.8571 0.6667 0.7500 18
country 0.9200 0.9200 0.9200 25
cropSpecies 0.7742 0.8372 0.8045 86
cropVariety 0.0000 0.0000 0.0000 3
duration 0.6364 0.5833 0.6087 24
endTime 0.7647 0.8125 0.7879 32
latitude 0.0000 0.0000 0.0000 2
longitude 0.6667 1.0000 0.8000 2
region 0.6429 0.5294 0.5806 17
soilAvailableNitrogen 1.0000 1.0000 1.0000 3
soilBulkDensity 0.0000 0.0000 0.0000 1
soilDepth 0.6250 0.8333 0.7143 6
soilOrganicCarbon 0.7857 0.5500 0.6471 20
soilPH 0.6000 0.7500 0.6667 4
soilReferenceGroup 1.0000 1.0000 1.0000 1
soilTexture 0.7500 0.2727 0.4000 11
startTime 0.8030 0.8833 0.8413 60

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

  • Hardware Type: 1 A40 GPU
  • Hours used: less than one hour
  • Cloud Provider: High-Performance Computing (HPC) - University of Bonn

Model Card Contact

Abanoub Abdelmalak Email: [email protected]

Contributors

Murtuza Husain

Downloads last month
18
Safetensors
Model size
0.6B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for IT-ZBMED/Agriculture_NER_Model_for_FAIR_Metadata_Enrichment

Finetuned
(849)
this model

Dataset used to train IT-ZBMED/Agriculture_NER_Model_for_FAIR_Metadata_Enrichment