--- title: Arxiv Paper Classifier emoji: 🌍 colorFrom: purple colorTo: indigo sdk: docker pinned: false --- # arXiv Paper Classification A machine learning application that predicts arXiv categories for academic papers based on their title and abstract. This tool uses a fine-tuned SciBERT model to classify papers into arXiv subject categories. This task is completed as homework for YSDA ML 2 course I personally hate jupyter-notebooks, so as a proof that i conducted experiments i made Comet ML logger project public. Latest training logs, configs and other details can be found here https://www.comet.com/adenshulga/arxiv-papers-classification/ef1256f1d4eb4b588da881366eb27578?compareXAxis=step&experiment-tab=panels&showOutliers=true&smoothing=0&xAxis=step ## Installation There are two relatively close dockerfile configurations. container_setup folder contains scripts and dockerfile to setup interactive developmpent environment. Dockerfile in the root is for deploying a StreamlitApp. ### Streamlit App Setup 1. Clone the repository: ```bash git clone https://github.com/adenshulga/arxiv-paper-classification.git cd arxiv-paper-classification ``` 2. Give permissions for executable scripts: ``` chmod +x scripts/pipeline.sh scripts/launch_app.sh ``` 3. Build and launch docker container: ``` docker build -t arxiv-paper-clf . docker run -p 9001:9001 arxiv-paper-clf ``` ### Configuration You can modify the inference settings in `config/inference_config.py`: - `model_name`: Base model name from Hugging Face - `checkpoint_path`: Path to fine-tuned model checkpoint - `top_percent`: Cumulative score threshold for showing predictions - `minimal_score`: Minimum confidence score to display ## Development and model Training To enter development environment 1. Fill container_setup/credentials file 2. Give executable permissions to build and launch scripts: ``` chmod +x container_setup/build.sh container_setup/launch_script.sh ``` 3. Specify resources constrains in ./container_setup/launch_container.sh 4. Build and launch docker container ``` ./container_setup/build.sh ./container_setup/launch_container.sh ``` 5. Attach to running container ``` docker attach ``` 6. Install the dependencies ``` uv venv uv sync ``` To train the model: 1. Load and unzip the arxiv dataset in the `data` folder(https://www.kaggle.com/datasets/neelshah18/arxivdataset) 2. Configure the process in config/pipeline_config.py Run the training script: ``` scripts/pipeline.sh ```