File size: 3,922 Bytes
4e3efe4
 
 
 
 
 
 
 
 
 
 
 
dee148f
 
4e3efe4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8b4a214
 
 
 
 
 
 
 
 
4e3efe4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dee148f
 
 
 
 
 
 
 
 
 
 
 
 
4e3efe4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
---
datasets:
- psytechlab/rus_rudeft_wcl-wiki
language:
- ru
base_model:
- DeepPavlov/rubert-base-cased
---

# RuBERT base fine-tuned on ruDEFT and WCL Wiki Ru datasets for NER
The model aims to extract terms and defenitions in a text. 
Labels:
- Term - a word or phrase.
- Definition - the span that defines some term.

```python
import torch
import numpy as np
from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("psytechlab/wcl-wiki_rudeft__ner-model")
model = AutoModelForTokenClassification.from_pretrained("psytechlab/wcl-wiki_rudeft__ner-model")
model.eval()

inputs = tokenizer('оромо — это африканская этническая группа, проживающая в эфиопии и в меньшей степени в кении.', return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

logits = outputs.logits
predictions = torch.argmax(logits, dim=-1)[0].tolist() 

tokens = inputs["input_ids"][0]
word_ids = inputs.word_ids(batch_index=0)

word_to_labels = {}
for token_id, word_id, label_id in zip(tokens, word_ids, predictions):
    if word_id is None:
        continue
    if word_id not in word_to_labels:
        word_to_labels[word_id] = []
    word_to_labels[word_id].append(label_id)

word_level_predictions = [model.config.id2label[labels[0]] for labels in word_to_labels.values()]

print(word_level_predictions)
# ['B-Term', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
```

## Training procedure

### Training
The training was done with Trainier class that has next parameters:
```python
training_args = TrainingArguments(
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=7,
    weight_decay=0.01,
)
```

### Metrics
Metrics on combined set (ruDEFT + WCL Wiki Ru) `psytechlab/rus_rudeft_wcl-wiki`:
```python
              precision    recall  f1-score   support

I-Definition       0.75      0.90      0.82      3344
B-Definition       0.62      0.73      0.67       230
      I-Term       0.80      0.85      0.82       524
           O       0.97      0.91      0.94     11359
      B-Term       0.96      0.93      0.94      2977

    accuracy                           0.91     18434
   macro avg       0.82      0.87      0.84     18434
weighted avg       0.92      0.91      0.91     18434
```

Metrics only on `astromis/ruDEFT`:
```python
              precision    recall  f1-score   support

I-Definition       0.90      0.90      0.90      3344
B-Definition       0.74      0.73      0.74       230
      I-Term       0.83      0.87      0.85       389
           O       0.86      0.86      0.86      2222
      B-Term       0.87      0.85      0.86       638

    accuracy                           0.87      6823
   macro avg       0.84      0.84      0.84      6823
weighted avg       0.87      0.87      0.87      6823
```

Metrics only on `astromis/WCL_Wiki_Ru`:
```python
              precision    recall  f1-score   support

I-Definition       0.00      0.00      0.00         0
B-Definition       0.00      0.00      0.00         0
      I-Term       0.72      0.78      0.75       135
           O       1.00      0.93      0.96      9137
      B-Term       0.99      0.95      0.97      2339

    accuracy                           0.93     11611
   macro avg       0.54      0.53      0.54     11611
weighted avg       0.99      0.93      0.96     11611
```

# Citation

``
@article{Popov2025TransferringNL,
  title={Transferring Natural Language Datasets Between Languages Using Large Language Models for Modern Decision Support and Sci-Tech Analytical Systems},
  author={Dmitrii Popov and Egor Terentev and Danil Serenko and Ilya Sochenkov and Igor Buyanov},
  journal={Big Data and Cognitive Computing},
  year={2025},
  url={https://api.semanticscholar.org/CorpusID:278179500}
}

```