LegalBERT Fine-Tuned on LEDGAR Dataset
This model is a fine-tuned version of LegalBERT on the LEDGAR dataset for legal clause classification.
It classifies legal clauses into one of 100 clause types (e.g., confidentiality, termination, liability, etc.).
Model Overview
- Base Model:
nlpaueb/legal-bert-base-uncased - Task: Multi-class clause classification
- Dataset: LEDGAR
- Language: English
- Number of labels: 100
- Fine-tuning epochs: 4
- Batch size: 32
- Optimizer: AdamW
- Mixed Precision (FP16): Enabled (when CUDA available)
Dataset Details
| Split | Samples | Description |
|---|---|---|
| Train | 60,000 | Used for model fine-tuning |
| Eval | 10,000 | Used for validation during training |
| Test | 10,000 | Held-out test set for final evaluation |
- Total samples: 80,000
- Number of labels: 100
- Text column:
text(contains the clause text) - Label column:
label
Evaluation Results (on Test Set)
| Metric | Score |
|---|---|
| Accuracy | 0.8678 |
| Macro F1 | 0.7779 |
| Macro Precision | 0.7917 |
| Macro Recall | 0.7763 |
| Evaluation Time | 38.37 sec |
How to Use
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# Load model and tokenizer
model_name = "FENTECH/Legal-BERT-Clause-Classification"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Example inference
text = "The contractor shall maintain confidentiality of all client information."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
predicted_label = outputs.logits.argmax(dim=-1).item()
print("Predicted label ID:", predicted_label)
- Downloads last month
- 652
Model tree for FENTECH/Legal-BERT-Clause-Classification
Base model
nlpaueb/legal-bert-base-uncased