SciTopicSentimentClassifier

🔬 Overview

SciTopicSentimentClassifier is a multi-label classification model fine-tuned to simultaneously predict the primary scientific topic and the underlying sentiment (high-positive or low-negative) from a research paper's abstract text. This model is ideal for automated paper categorization, literature review triage, and scientific trend analysis.

The model was trained on the SciTopicSentimentDataset (a proprietary dataset similar to the generated Dataset 1), which links abstract text to predefined scientific topics and a binarized sentiment score derived from the original continuous value.

🧠 Model Architecture

This model is an adaptation of DistilBERT, a smaller, faster, and lighter version of BERT.

Base Model: distilbert-base-uncased
Modification: A custom classification head is added on top of the DistilBERT pooled output.
Output Layer: The final layer is a dense layer with 12 outputs (10 for scientific topics + 2 for sentiment classes), followed by a Sigmoid activation function to allow for multi-label prediction (an abstract can belong to multiple topics/sentiments).
Input: Tokenized abstract text (up to 512 tokens).
Task: Multi-Label Text Classification.

🚀 Intended Use

Automated Labeling: Automatically assign relevant topic tags to new scientific publication abstracts.
Research Triage: Quickly filter papers based on subject matter and the perceived 'success' or 'novelty' indicated by the abstract's sentiment.
Scientific Landscape Mapping: Analyze large corpora of papers to track emerging positive/negative trends in specific research areas.
Indexing Systems: Integration into library or repository indexing services.

⚠️ Limitations

Topic Granularity: The model is limited to the 10 predefined topics in its training set. It may perform poorly on highly niche or interdisciplinary topics outside this scope.
Sentiment Scope: The sentiment is coarse-grained (high vs. low) based on a metric derived from the abstract's language (e.g., using words like "novel," "significant," "limitations," "challenges"). It does not capture nuanced human-level emotional sentiment.
Language: Trained exclusively on English abstracts.
Max Length: Input texts longer than 512 tokens are truncated.

💻 Example Code

To use the model for prediction:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load the model and tokenizer
model_name = "your-username/SciTopicSentimentClassifier" # Replace with actual HuggingFace path
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Sample Abstract
abstract = "We propose a novel architecture combining convolutional and recurrent neural networks for multi-modal data fusion, demonstrating significant performance gains in complex classification tasks, overcoming prior limitations."

# Preprocess the input
inputs = tokenizer(abstract, return_tensors="pt", truncation=True, padding=True)

# Run inference
with torch.no_grad():
    logits = model(**inputs).logits

# Apply sigmoid for multi-label scores
probs = torch.sigmoid(logits)

# Get predicted labels (e.g., probability > 0.5)
labels = model.config.id2label
predictions = []
for i, prob in enumerate(probs[0]):
    if prob > 0.5:
        predictions.append(labels[i])

print(f"Abstract: {abstract[:80]}...")
print(f"Predicted Labels: {predictions}")
# Expected Output: ['Deep Learning/AI', 'High-Positive-Sentiment']

Downloads last month: 20