Spaces:
Sleeping
title: Arxiv Paper Classifier
emoji: π
colorFrom: purple
colorTo: indigo
sdk: docker
pinned: false
arXiv Paper Classification
A machine learning application that predicts arXiv categories for academic papers based on their title and abstract. This tool uses a fine-tuned SciBERT model to classify papers into arXiv subject categories. This task is completed as homework for YSDA ML 2 course
I personally hate jupyter-notebooks, so as a proof that i conducted experiments i made Comet ML logger project public.
Latest training logs, configs and other details can be found here https://www.comet.com/adenshulga/arxiv-papers-classification/ef1256f1d4eb4b588da881366eb27578?compareXAxis=step&experiment-tab=panels&showOutliers=true&smoothing=0&xAxis=step
Installation
There are two relatively close dockerfile configurations. container_setup folder contains scripts and dockerfile to setup interactive developmpent environment. Dockerfile in the root is for deploying a StreamlitApp.
Streamlit App Setup
Clone the repository:
git clone https://github.com/adenshulga/arxiv-paper-classification.git cd arxiv-paper-classificationGive permissions for executable scripts:
chmod +x scripts/pipeline.sh scripts/launch_app.shBuild and launch docker container:
docker build -t arxiv-paper-clf . docker run -p 9001:9001 arxiv-paper-clf
Configuration
You can modify the inference settings in config/inference_config.py:
model_name: Base model name from Hugging Facecheckpoint_path: Path to fine-tuned model checkpointtop_percent: Cumulative score threshold for showing predictionsminimal_score: Minimum confidence score to display
Development and model Training
To enter development environment
Fill container_setup/credentials file
Give executable permissions to build and launch scripts:
chmod +x container_setup/build.sh container_setup/launch_script.shSpecify resources constrains in ./container_setup/launch_container.sh
Build and launch docker container
./container_setup/build.sh ./container_setup/launch_container.shAttach to running container
docker attach <container-id>Install the dependencies
uv venv uv sync
To train the model:
- Load and unzip the arxiv dataset in the
datafolder(https://www.kaggle.com/datasets/neelshah18/arxivdataset) - Configure the process in config/pipeline_config.py
Run the training script:
scripts/pipeline.sh