Spaces:

adenshulga
/

arxiv-paper-classifier

Sleeping

App Files Files Community

arxiv-paper-classifier / README.md

adenshulga

Update README.md

448cc45 verified 9 months ago

preview code

raw

history blame contribute delete

2.62 kB

metadata

title: Arxiv Paper Classifier
emoji: 🌍
colorFrom: purple
colorTo: indigo
sdk: docker
pinned: false

arXiv Paper Classification

A machine learning application that predicts arXiv categories for academic papers based on their title and abstract. This tool uses a fine-tuned SciBERT model to classify papers into arXiv subject categories. This task is completed as homework for YSDA ML 2 course

I personally hate jupyter-notebooks, so as a proof that i conducted experiments i made Comet ML logger project public.

Latest training logs, configs and other details can be found here https://www.comet.com/adenshulga/arxiv-papers-classification/ef1256f1d4eb4b588da881366eb27578?compareXAxis=step&experiment-tab=panels&showOutliers=true&smoothing=0&xAxis=step

Installation

There are two relatively close dockerfile configurations. container_setup folder contains scripts and dockerfile to setup interactive developmpent environment. Dockerfile in the root is for deploying a StreamlitApp.

Streamlit App Setup

Clone the repository:

git clone https://github.com/adenshulga/arxiv-paper-classification.git
cd arxiv-paper-classification

Give permissions for executable scripts:

chmod +x scripts/pipeline.sh scripts/launch_app.sh

Build and launch docker container:

docker build -t arxiv-paper-clf .
docker run -p 9001:9001 arxiv-paper-clf

Configuration

You can modify the inference settings in config/inference_config.py:

model_name: Base model name from Hugging Face
checkpoint_path: Path to fine-tuned model checkpoint
top_percent: Cumulative score threshold for showing predictions
minimal_score: Minimum confidence score to display

Development and model Training

To enter development environment

Fill container_setup/credentials file

Give executable permissions to build and launch scripts:

chmod +x container_setup/build.sh container_setup/launch_script.sh

Specify resources constrains in ./container_setup/launch_container.sh

Build and launch docker container

./container_setup/build.sh
./container_setup/launch_container.sh

Attach to running container
```
docker attach <container-id>
```
Install the dependencies
```
uv venv
uv sync
```

To train the model:

Load and unzip the arxiv dataset in the data folder(https://www.kaggle.com/datasets/neelshah18/arxivdataset)
Configure the process in config/pipeline_config.py

Run the training script:

scripts/pipeline.sh