BERT Research Paper Classifier
Model Description
bert_text_classifier is a fine-tuned BERT model specifically designed for classifying research papers into scientific disciplines. The model achieves 95.39% accuracy on a comprehensive dataset of 140,000+ research papers across 9 major scientific categories.
- Model type: BERT for sequence classification
- Language(s): English
- License: MIT
- Finetuned from: bert-base-uncased
Intended Uses & Limitations
Primary Use
This model is intended for:
- Automatic categorization of research papers and academic publications
- Building academic recommendation systems
- Organizing digital libraries and research databases
- Educational applications in scientific literature analysis
Limitations
- Trained primarily on Mendeley research catalog data
- Performance may vary on papers outside the 9 trained categories
- Best performance on formal academic writing style
Categories
The model classifies research papers into 9 scientific disciplines:
| Category | Key Subfields |
|---|---|
| Biology | Genetics, Ecology, Biochemistry, Physiology |
| Business | Marketing, Finance, Management, Entrepreneurship |
| Chemistry | Organic Chemistry, Analytical Chemistry, Biochemistry |
| Computer Science | AI, Cloud Computing, Cybersecurity, Software Engineering |
| Environmental Science | Climate Change, Conservation, Sustainability |
| Mathematics | Algebra, Calculus, Statistics, Optimization |
| Medicine | Cardiology, Surgery, Neurology, Pediatrics |
| Physics | Quantum Mechanics, Astrophysics, Particle Physics |
| Psychology | Clinical, Cognitive, Social, Neuropsychology |
Training Data
Dataset Statistics
- Source: Mendeley Research Catalog
- Total Papers: 140,004 (after cleaning)
- Training Samples: 27,953 evaluation set
- Cleaning Ratio: 89.81% (from original 155,882 records)
Data Distribution
- Psychology: 16,821 papers (12.0%)
- Chemistry: 16,675 papers (11.9%)
- Physics: 15,941 papers (11.4%)
- Business: 15,929 papers (11.4%)
- Mathematics: 15,464 papers (11.0%)
- Medicine: 15,361 papers (11.0%)
- Computer Science: 14,776 papers (10.6%)
- Biology: 14,729 papers (10.5%)
- Environmental Science: 14,308 papers (10.2%)
Performance
Evaluation Results
{
'eval_loss': 0.184,
'eval_accuracy': 0.9539,
'eval_runtime': 428.03,
'eval_samples_per_second': 65.306
}
Detailed Metrics
| Category | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| Biology | 0.94 | 0.93 | 0.94 | 3,177 |
| Business | 0.96 | 0.97 | 0.97 | 3,179 |
| Chemistry | 0.94 | 0.96 | 0.95 | 3,073 |
| Computer Science | 0.96 | 0.93 | 0.95 | 2,987 |
| Environmental Science | 0.95 | 0.94 | 0.95 | 2,850 |
| Mathematics | 0.93 | 0.96 | 0.95 | 3,091 |
| Medicine | 0.97 | 0.96 | 0.96 | 3,067 |
| Physics | 0.97 | 0.95 | 0.96 | 3,181 |
| Psychology | 0.97 | 0.97 | 0.97 | 3,348 |
Usage
Direct Inference
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("Emran025/bert_text_classifier")
model = AutoModelForSequenceClassification.from_pretrained("Emran025/bert_text_classifier")
# Example research paper abstract
text = """
This study explores novel deep learning architectures for protein structure
prediction using transformer-based models and attention mechanisms.
"""
# Preprocess and predict
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=512)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted_class = torch.argmax(predictions, dim=1).item()
# Map to category
categories = ['biology', 'business', 'chemistry', 'computerscience',
'environmentalscience', 'mathematics', 'medicine', 'physics', 'psychology']
print(f"Predicted category: {categories[predicted_class]}")
Using Pipeline
from transformers import pipeline
classifier = pipeline("text-classification",
model="Emran025/bert_text_classifier",
tokenizer="Emran025/bert_text_classifier")
result = classifier("Advanced quantum computing algorithms for molecular simulation")
print(result)
Training Details
Hyperparameters
路 Learning Rate: 2e-5 路 Batch Size: 16 路 Epochs: 3 路 Max Sequence Length: 512 tokens 路 Optimizer: AdamW
Training Environment
路 Framework: PyTorch with Transformers 路 Hardware: Google Colab GPU 路 Training Time: ~6 hours
Citation
If you use this model in your research, please cite:
@misc{bert_research_classifier_2024,
title = {BERT Research Paper Classification Model},
author = {Emran Nasser and Mohammed Alyafrosy and Ryadh Alizi},
year = {2024},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/Emran025/bert_text_classifier}}
}
Contributors
路 Emran Nasser (Emran025) 路 Mohammed Alyafrosy 路 Ryadh Alizi
License
MIT License - see LICENSE file for details.
Repository
https://github.com/Emran025/Research_Paper_Classification_model
- Downloads last month
- -