You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

BERT Research Paper Classifier

Model Description

bert_text_classifier is a fine-tuned BERT model specifically designed for classifying research papers into scientific disciplines. The model achieves 95.39% accuracy on a comprehensive dataset of 140,000+ research papers across 9 major scientific categories.

  • Model type: BERT for sequence classification
  • Language(s): English
  • License: MIT
  • Finetuned from: bert-base-uncased

Intended Uses & Limitations

Primary Use

This model is intended for:

  • Automatic categorization of research papers and academic publications
  • Building academic recommendation systems
  • Organizing digital libraries and research databases
  • Educational applications in scientific literature analysis

Limitations

  • Trained primarily on Mendeley research catalog data
  • Performance may vary on papers outside the 9 trained categories
  • Best performance on formal academic writing style

Categories

The model classifies research papers into 9 scientific disciplines:

Category Key Subfields
Biology Genetics, Ecology, Biochemistry, Physiology
Business Marketing, Finance, Management, Entrepreneurship
Chemistry Organic Chemistry, Analytical Chemistry, Biochemistry
Computer Science AI, Cloud Computing, Cybersecurity, Software Engineering
Environmental Science Climate Change, Conservation, Sustainability
Mathematics Algebra, Calculus, Statistics, Optimization
Medicine Cardiology, Surgery, Neurology, Pediatrics
Physics Quantum Mechanics, Astrophysics, Particle Physics
Psychology Clinical, Cognitive, Social, Neuropsychology

Training Data

Dataset Statistics

  • Source: Mendeley Research Catalog
  • Total Papers: 140,004 (after cleaning)
  • Training Samples: 27,953 evaluation set
  • Cleaning Ratio: 89.81% (from original 155,882 records)

Data Distribution

  • Psychology: 16,821 papers (12.0%)
  • Chemistry: 16,675 papers (11.9%)
  • Physics: 15,941 papers (11.4%)
  • Business: 15,929 papers (11.4%)
  • Mathematics: 15,464 papers (11.0%)
  • Medicine: 15,361 papers (11.0%)
  • Computer Science: 14,776 papers (10.6%)
  • Biology: 14,729 papers (10.5%)
  • Environmental Science: 14,308 papers (10.2%)

Performance

Evaluation Results


{
'eval_loss': 0.184,
'eval_accuracy': 0.9539,
'eval_runtime': 428.03,
'eval_samples_per_second': 65.306
}

Detailed Metrics

Category Precision Recall F1-Score Support
Biology 0.94 0.93 0.94 3,177
Business 0.96 0.97 0.97 3,179
Chemistry 0.94 0.96 0.95 3,073
Computer Science 0.96 0.93 0.95 2,987
Environmental Science 0.95 0.94 0.95 2,850
Mathematics 0.93 0.96 0.95 3,091
Medicine 0.97 0.96 0.96 3,067
Physics 0.97 0.95 0.96 3,181
Psychology 0.97 0.97 0.97 3,348

Usage

Direct Inference

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("Emran025/bert_text_classifier")
model = AutoModelForSequenceClassification.from_pretrained("Emran025/bert_text_classifier")

# Example research paper abstract
text = """
This study explores novel deep learning architectures for protein structure 
prediction using transformer-based models and attention mechanisms.
"""

# Preprocess and predict
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=512)
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_class = torch.argmax(predictions, dim=1).item()

# Map to category
categories = ['biology', 'business', 'chemistry', 'computerscience', 
              'environmentalscience', 'mathematics', 'medicine', 'physics', 'psychology']
print(f"Predicted category: {categories[predicted_class]}")

Using Pipeline

from transformers import pipeline

classifier = pipeline("text-classification", 
                     model="Emran025/bert_text_classifier",
                     tokenizer="Emran025/bert_text_classifier")

result = classifier("Advanced quantum computing algorithms for molecular simulation")
print(result)

Training Details

Hyperparameters

路 Learning Rate: 2e-5 路 Batch Size: 16 路 Epochs: 3 路 Max Sequence Length: 512 tokens 路 Optimizer: AdamW

Training Environment

路 Framework: PyTorch with Transformers 路 Hardware: Google Colab GPU 路 Training Time: ~6 hours

Citation

If you use this model in your research, please cite:

@misc{bert_research_classifier_2024,
  title = {BERT Research Paper Classification Model},
  author = {Emran Nasser and Mohammed Alyafrosy and Ryadh Alizi},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Emran025/bert_text_classifier}}
}

Contributors

路 Emran Nasser (Emran025) 路 Mohammed Alyafrosy 路 Ryadh Alizi

License

MIT License - see LICENSE file for details.

Repository

https://github.com/Emran025/Research_Paper_Classification_model

Downloads last month
-
Safetensors
Model size
0.1B params
Tensor type
F32
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support