File size: 3,922 Bytes
4e3efe4 dee148f 4e3efe4 8b4a214 4e3efe4 dee148f 4e3efe4 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 |
---
datasets:
- psytechlab/rus_rudeft_wcl-wiki
language:
- ru
base_model:
- DeepPavlov/rubert-base-cased
---
# RuBERT base fine-tuned on ruDEFT and WCL Wiki Ru datasets for NER
The model aims to extract terms and defenitions in a text.
Labels:
- Term - a word or phrase.
- Definition - the span that defines some term.
```python
import torch
import numpy as np
from transformers import AutoTokenizer, AutoModelForTokenClassification
tokenizer = AutoTokenizer.from_pretrained("psytechlab/wcl-wiki_rudeft__ner-model")
model = AutoModelForTokenClassification.from_pretrained("psytechlab/wcl-wiki_rudeft__ner-model")
model.eval()
inputs = tokenizer('оромо — это африканская этническая группа, проживающая в эфиопии и в меньшей степени в кении.', return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
predictions = torch.argmax(logits, dim=-1)[0].tolist()
tokens = inputs["input_ids"][0]
word_ids = inputs.word_ids(batch_index=0)
word_to_labels = {}
for token_id, word_id, label_id in zip(tokens, word_ids, predictions):
if word_id is None:
continue
if word_id not in word_to_labels:
word_to_labels[word_id] = []
word_to_labels[word_id].append(label_id)
word_level_predictions = [model.config.id2label[labels[0]] for labels in word_to_labels.values()]
print(word_level_predictions)
# ['B-Term', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
```
## Training procedure
### Training
The training was done with Trainier class that has next parameters:
```python
training_args = TrainingArguments(
eval_strategy="epoch",
save_strategy="epoch",
learning_rate=2e-5,
num_train_epochs=7,
weight_decay=0.01,
)
```
### Metrics
Metrics on combined set (ruDEFT + WCL Wiki Ru) `psytechlab/rus_rudeft_wcl-wiki`:
```python
precision recall f1-score support
I-Definition 0.75 0.90 0.82 3344
B-Definition 0.62 0.73 0.67 230
I-Term 0.80 0.85 0.82 524
O 0.97 0.91 0.94 11359
B-Term 0.96 0.93 0.94 2977
accuracy 0.91 18434
macro avg 0.82 0.87 0.84 18434
weighted avg 0.92 0.91 0.91 18434
```
Metrics only on `astromis/ruDEFT`:
```python
precision recall f1-score support
I-Definition 0.90 0.90 0.90 3344
B-Definition 0.74 0.73 0.74 230
I-Term 0.83 0.87 0.85 389
O 0.86 0.86 0.86 2222
B-Term 0.87 0.85 0.86 638
accuracy 0.87 6823
macro avg 0.84 0.84 0.84 6823
weighted avg 0.87 0.87 0.87 6823
```
Metrics only on `astromis/WCL_Wiki_Ru`:
```python
precision recall f1-score support
I-Definition 0.00 0.00 0.00 0
B-Definition 0.00 0.00 0.00 0
I-Term 0.72 0.78 0.75 135
O 1.00 0.93 0.96 9137
B-Term 0.99 0.95 0.97 2339
accuracy 0.93 11611
macro avg 0.54 0.53 0.54 11611
weighted avg 0.99 0.93 0.96 11611
```
# Citation
``
@article{Popov2025TransferringNL,
title={Transferring Natural Language Datasets Between Languages Using Large Language Models for Modern Decision Support and Sci-Tech Analytical Systems},
author={Dmitrii Popov and Egor Terentev and Danil Serenko and Ilya Sochenkov and Igor Buyanov},
journal={Big Data and Cognitive Computing},
year={2025},
url={https://api.semanticscholar.org/CorpusID:278179500}
}
``` |