You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

BERT Research Paper Classifier

Model Description

bert_text_classifier is a fine-tuned BERT model specifically designed for classifying research papers into scientific disciplines. The model achieves 95.39% accuracy on a comprehensive dataset of 140,000+ research papers across 9 major scientific categories.

Model type: BERT for sequence classification
Language(s): English
License: MIT
Finetuned from: bert-base-uncased

Intended Uses & Limitations

Primary Use

This model is intended for:

Automatic categorization of research papers and academic publications
Building academic recommendation systems
Organizing digital libraries and research databases
Educational applications in scientific literature analysis

Limitations

Trained primarily on Mendeley research catalog data
Performance may vary on papers outside the 9 trained categories
Best performance on formal academic writing style

Category	Key Subfields
Biology	Genetics, Ecology, Biochemistry, Physiology
Business	Marketing, Finance, Management, Entrepreneurship
Chemistry	Organic Chemistry, Analytical Chemistry, Biochemistry
Computer Science	AI, Cloud Computing, Cybersecurity, Software Engineering
Environmental Science	Climate Change, Conservation, Sustainability
Mathematics	Algebra, Calculus, Statistics, Optimization
Medicine	Cardiology, Surgery, Neurology, Pediatrics
Physics	Quantum Mechanics, Astrophysics, Particle Physics
Psychology	Clinical, Cognitive, Social, Neuropsychology

Training Data

Dataset Statistics

Source: Mendeley Research Catalog
Total Papers: 140,004 (after cleaning)
Training Samples: 27,953 evaluation set
Cleaning Ratio: 89.81% (from original 155,882 records)

Data Distribution

Psychology: 16,821 papers (12.0%)
Chemistry: 16,675 papers (11.9%)
Physics: 15,941 papers (11.4%)
Business: 15,929 papers (11.4%)
Mathematics: 15,464 papers (11.0%)
Medicine: 15,361 papers (11.0%)
Computer Science: 14,776 papers (10.6%)
Biology: 14,729 papers (10.5%)
Environmental Science: 14,308 papers (10.2%)

Performance

Evaluation Results


{
'eval_loss': 0.184,
'eval_accuracy': 0.9539,
'eval_runtime': 428.03,
'eval_samples_per_second': 65.306
}

Detailed Metrics

Category	Precision	Recall	F1-Score	Support
Biology	0.94	0.93	0.94	3,177
Business	0.96	0.97	0.97	3,179
Chemistry	0.94	0.96	0.95	3,073
Computer Science	0.96	0.93	0.95	2,987
Environmental Science	0.95	0.94	0.95	2,850
Mathematics	0.93	0.96	0.95	3,091
Medicine	0.97	0.96	0.96	3,067
Physics	0.97	0.95	0.96	3,181
Psychology	0.97	0.97	0.97	3,348

Usage

Direct Inference

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("Emran025/bert_text_classifier")
model = AutoModelForSequenceClassification.from_pretrained("Emran025/bert_text_classifier")

# Example research paper abstract
text = """
This study explores novel deep learning architectures for protein structure 
prediction using transformer-based models and attention mechanisms.
"""

# Preprocess and predict
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=512)
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_class = torch.argmax(predictions, dim=1).item()

# Map to category
categories = ['biology', 'business', 'chemistry', 'computerscience', 
              'environmentalscience', 'mathematics', 'medicine', 'physics', 'psychology']
print(f"Predicted category: {categories[predicted_class]}")

Using Pipeline

from transformers import pipeline

classifier = pipeline("text-classification", 
                     model="Emran025/bert_text_classifier",
                     tokenizer="Emran025/bert_text_classifier")

result = classifier("Advanced quantum computing algorithms for molecular simulation")
print(result)

Training Details

Hyperparameters

· Learning Rate: 2e-5 · Batch Size: 16 · Epochs: 3 · Max Sequence Length: 512 tokens · Optimizer: AdamW

Training Environment

· Framework: PyTorch with Transformers · Hardware: Google Colab GPU · Training Time: ~6 hours

Citation

If you use this model in your research, please cite:

@misc{bert_research_classifier_2024,
  title = {BERT Research Paper Classification Model},
  author = {Emran Nasser and Mohammed Alyafrosy and Ryadh Alizi},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Emran025/bert_text_classifier}}
}

Contributors

· Emran Nasser (Emran025) · Mohammed Alyafrosy · Ryadh Alizi

License

MIT License - see LICENSE file for details.

Repository

https://github.com/Emran025/Research_Paper_Classification_model

Downloads last month: -

Safetensors

Model size

0.1B params

Tensor type

F32

Emran025
/

bert_text_classifier