eurobert-geopolitical-multiclass / README.md

Add fine-tuned EuroBERT for multiclass geopolitical classification

46f53cd verified about 1 month ago

4.39 kB

	---
	library_name: transformers
	pipeline_tag: text-classification
	base_model: EuroBERT/EuroBERT-210m
	base_model_relation: finetune
	tags:
	- eurobert
	- fine-tuned
	- transformers
	- pytorch
	- sequence-classification
	- multiclass
	- geopolitics
	- multilingual
	language:
	- en
	- de
	- fr
	- es
	- it
	---


	# EuroBERT Geopolitical Classifier (Multiclass)

	Fine-tuned EuroBERT/EuroBERT-210m for detecting and categorizing geopolitical themes in (European) news text.

	- Task: Sequence classification (single-label multiclass)
	- Labels: 11 geopolitical topics
	- Intended use: Topic categorization of news on geopolitical tensions (best performance on full article-level text)
	- Languages: English, German, French, Spanish, Italian
	- Framework: 🤗 Transformers (PyTorch)

	---

	## Quick start

	### Inference with `transformers`

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	model_id = "durrani95/eurobert-geopolitical-multiclass"

	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForSequenceClassification.from_pretrained(model_id)


	texts = [
	"Russia cut off gas supplies to Europe amid rising tensions.",
	"Terrorist activity has increased along the southern border.",
	"New sanctions were imposed on financial institutions.",
	"Talks at the UN Security Council failed to reach consensus.",
	"Tarrifs on soybeans are applied to pressure China into a deal with the US" ,
	"Tom and Jerry have a fight! The mouse finally had enough.",
	]

	inputs = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors="pt")

	with torch.no_grad():
	logits = model(**inputs).logits
	probs = torch.softmax(logits, dim=1)

	for text, p in zip(texts, probs):
	label_id = int(p.argmax())
	label = model.config.id2label[label_id]
	confidence = float(p[label_id])
	print(f"{label:>28} {confidence:6.2%} \| {text}")
	```


	## Category Definitions

	\| Category \| Description \| Example \|
	\|-----------\|--------------\|----------\|
	\| war_military_conflict \| Armed conflicts, military operations, or war-related issues involving states or armed groups. \| Russia’s invasion of Ukraine \|
	\| terrorism_insurgency \| Terrorist attacks, counter-terrorism operations, or insurgent activity. \| 9/11 attacks \|
	\| cyber_warfare \| Cyberattacks or hacking by foreign states or international actors with strategic motives. \| North Korea’s Sony hack \|
	\| trade_disputes \| Tensions between states over trade policy, tariffs, or retaliation. \| U.S.–China trade wars \|
	\| financial_sanctions \| Economic penalties imposed by countries against targeted states, entities, or individuals. \| U.S. sanctions on Iran’s banking sector \|
	\| regional_disintegration \| Political developments that threaten the cohesion of existing regional entities. \| Brexit \|
	\| energy_resource_conflicts \| Disputes over energy access, distribution, or natural resource control. \| OPEC oil disputes \|
	\| global_governance \| Tensions involving international institutions or multilateral diplomacy. \| NATO expansion \|
	\| nuclear_proliferation \| Issues concerning the spread or control of nuclear weapons. \| Iran nuclear deal \|
	\| territorial_disputes \| Conflicts over land or maritime boundaries. \| South China Sea tensions \|
	\| non_geopol \| Texts without geopolitical relevance. \| Domestic politics or economic updates \|

	---

	## Training & Configuration

	- Base model: `EuroBERT/EuroBERT-210m`
	- Objective: Cross-entropy (single-label multiclass)
	- Number of labels: 11
	- Data: European news text labeled across geopolitical topics
	- Hardware: A100 GPU
	- Epochs: 1
	- Optimizer: AdamW with linear scheduler

	### Training setup

	\| Parameter \| Value \|
	\|------------\|--------\|
	\| Learning rate \| 3e-5 \|
	\| Desired (effective) batch size \| 64 \|
	\| Actual GPU batch size \| 16 \|
	\| Gradient accumulation \| 4 steps \|
	\| Weight decay \| 1e-5 \|
	\| Betas \| (0.9, 0.95) \|
	\| Epsilon \| 1e-8 \|
	\| Max epochs \| 1 \|


	---

	## Limitations & Risks

	- May be sensitive to domain shift (non-news, social media text)
	- The model predicts one dominant label per text; it is not multi-label.
	- Multilingual performance can vary across languages and registers

	---


	## How to cite

	If you use this model, please cite this repository and the EuroBERT base model.