---
license: mit
datasets:
- IT-ZBMED/Agriculture_NER_Dataset_for_FAIR_Metadata_Enrichment
language:
- en
- de
metrics:
- seqeval
base_model:
- FacebookAI/xlm-roberta-large
pipeline_tag: token-classification
tags:
- agriculture
- ner
- information-extraction
- llm
- roberta
- encoder
- crops
- soil
- location
- time-statement
---

# Model Card

<!-- Provide a quick summary of what the model is/does. -->
This model is a product of the use case "Increasing FAIRness of FAIRagro data through AI supported metadata enrichment" which is part of the 
[FAIRagro consortium](https://fairagro.net/en/). The model is fine-tuned based on the [annotated dataset](https://huggingface.co/datasets/IT-ZBMED/Agriculture_NER_Dataset_for_FAIR_Metadata_Enrichment)
where it is trained to extract entities related to crops, soil, locations, and time statements from agriculture research datasets. It is used in the use case to
extract this information from the legacy research data and publications. Its application is to enrich current metadata by extracting agricultural metadata from current 
unstructured parts of metadata (titles and abstracts). 


## Model Details

### Model Description

<!-- Provide a longer summary of what this model is. -->

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

- **Developed by:** ZB MED - informationszentrum lebenswissenschaften
- **Funded by:** DFG - Deutsche Forschungsgemeinschaft
- **Model type:** Token-classification Model
- **Language(s) (NLP):** English, German
- **License:** MIT
- **Finetuned from model:** FacebookAI/xlm-roberta-large


## Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
This model is intended to be used as an NER model for agriculture research. The entities it can extract are:

```json
[
  "soilReferenceGroup",
  "soilOrganicCarbon",
  "soilTexture",
  "startTime",
  "endTime",
  "city",
  "duration",
  "cropSpecies",
  "soilAvailableNitrogen",
  "soilDepth",
  "region",
  "country",
  "longitude",
  "latitude",
  "cropVariety",
  "soilPH",
  "soilBulkDensity"
]
```


### Out-of-Scope Use

<!-- This section addresses misuse, malicious use, and uses for the model will not work well for. -->
This model is not intended to be used in other domains outside of agriculture research or the English or German languages.

## Bias, Risks, and Limitations

<!-- This section is meant to convey both technical and sociotechnical limitations. -->

This model is limited by its training dataset of entity-annotated agriculture titles and abstracts from a specific span of time. This may not generalize well for all use cases in different domains. Furthermore, the model occassionally tags subword tokens as entities where there are special signs (e.g., "-,/") and post-processing of results may be necessary to handle those cases.

### Recommendations

<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->

Users (both direct and downstream) should be made aware of the risks, biases, and limitations of the model.
It is recommended to post-process the raw outputs of the model.

## How to Get Started with the Model

### Code sample
Use the code below to get started with the model.
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

roberta_fairagro = AutoModelForTokenClassification.from_pretrained("IT-ZBMED/Agriculture_NER_Model_for_FAIR_Metadata_Enrichment")
tokenizer = AutoTokenizer.from_pretrained("IT-ZBMED/Agriculture_NER_Model_for_FAIR_Metadata_Enrichment")

nlp = pipeline("ner", model=roberta_fairagro, tokenizer=tokenizer, aggregation_strategy="simple")

example = (
    "In early spring 2025, maize and soybean seedlings established quickly in the loamy sand soil as warmer temperatures "
    "accelerated germination, while by late autumn, the clay loam field supported a robust barley crop that matured well despite the soil’s slower drainage."
)

ner_results = nlp(example)
print(ner_results)
```
### Output

```bash
[{'entity_group': 'startTime', 'score': 0.9885543, 'word': 'spring 2025', 'start': 9, 'end': 20},
 {'entity_group': 'cropSpecies', 'score': 0.9997772, 'word': 'maize', 'start': 22, 'end': 27},
 {'entity_group': 'cropSpecies', 'score': 0.98714954, 'word': 'soybean', 'start': 32, 'end': 39},
 {'entity_group': 'soilTexture', 'score': 0.99048805, 'word': 'loamy sand', 'start': 77, 'end': 87},
 {'entity_group': 'soilTexture', 'score': 0.97245836, 'word': 'clay loam', 'start': 167, 'end': 176},
 {'entity_group': 'cropSpecies', 'score': 0.9997045, 'word': 'barley', 'start': 202, 'end': 208}]
```


[More Information Needed]

## Training Details

### Training Data

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

[IT-ZBMED/Agriculture_NER_Dataset_for_FAIR_Metadata_Enrichment](https://huggingface.co/datasets/IT-ZBMED/Agriculture_NER_Dataset_for_FAIR_Metadata_Enrichment)

### Training Procedure

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
The model was fine-tuned on the whole training dataset of the sentence-split.


#### Training Hyperparameters

| Parameter                    | Value                               |
|-----------------------------|--------------------------------------|
| batch_size                  | `4`                                  |
| learning_rate               | `2.657488681466831e-05`              |
| warmup_ratio                | `0.09938204231729805`                |
| num_train_epochs            | `10`                                 |
| weight_decay                | `0.010599758492599783`               |
| adam_beta1                  | `0.9`                                |
| adam_beta2                  | `0.999`                              |
| adam_epsilon                | `1e-08`                              |
| metric_for_best_model       | `f1`                                 |
| lr_scheduler_type           | `linear`                             | <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->


## Evaluation

<!-- This section describes the evaluation protocols and provides the results. -->
The evaluation was done by the seqeval library based on precision, recall and f1 scores.

### Testing Data, Factors & Metrics

#### Testing Data

<!-- This should link to a Dataset Card if possible. -->

The test split of the following dataset's sentence version
[IT-ZBMED/Agriculture_NER_Dataset_for_FAIR_Metadata_Enrichment](https://huggingface.co/datasets/IT-ZBMED/Agriculture_NER_Dataset_for_FAIR_Metadata_Enrichment)


#### Metrics

<!-- These are the evaluation metrics being used, ideally with a description of why. -->

f1 score

### Results

#### overall results
| Metric        | Value                |
|---------------|----------------------|
| Precision     | 0.7745              |
| Recall        | 0.7524              |
| F1 Score      | 0.7633              |
| Macro F1      | 0.6189              |
| Accuracy      | 0.9789              |
| Loss          | 0.1476              |

#### label-based results
| Label                    | Precision | Recall | F1 Score | Support |
|--------------------------|-----------|--------|----------|---------|
| city                     | 0.8571    | 0.6667 | 0.7500   | 18      |
| country                  | 0.9200    | 0.9200 | 0.9200   | 25      |
| cropSpecies              | 0.7742    | 0.8372 | 0.8045   | 86      |
| cropVariety              | 0.0000    | 0.0000 | 0.0000   | 3       |
| duration                 | 0.6364    | 0.5833 | 0.6087   | 24      |
| endTime                  | 0.7647    | 0.8125 | 0.7879   | 32      |
| latitude                 | 0.0000    | 0.0000 | 0.0000   | 2       |
| longitude                | 0.6667    | 1.0000 | 0.8000   | 2       |
| region                   | 0.6429    | 0.5294 | 0.5806   | 17      |
| soilAvailableNitrogen    | 1.0000    | 1.0000 | 1.0000   | 3       |
| soilBulkDensity          | 0.0000    | 0.0000 | 0.0000   | 1       |
| soilDepth                | 0.6250    | 0.8333 | 0.7143   | 6       |
| soilOrganicCarbon        | 0.7857    | 0.5500 | 0.6471   | 20      |
| soilPH                   | 0.6000    | 0.7500 | 0.6667   | 4       |
| soilReferenceGroup       | 1.0000    | 1.0000 | 1.0000   | 1       |
| soilTexture              | 0.7500    | 0.2727 | 0.4000   | 11      |
| startTime                | 0.8030    | 0.8833 | 0.8413   | 60      |


## Environmental Impact

<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->

Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

- **Hardware Type:** 1 A40 GPU
- **Hours used:** less than one hour
- **Cloud Provider:** High-Performance Computing (HPC) - University of Bonn 


## Model Card Contact

[Abanoub Abdelmalak](https://github.com/AbanoubAbdelmalak)
Email: abdelmalak@zbmed.de

## Contributors
[Murtuza Husain](https://github.com/murtuza10)