Spaces:
Sleeping
Sleeping
Update README.md
Browse files
README.md
CHANGED
|
@@ -7,4 +7,83 @@ sdk: docker
|
|
| 7 |
pinned: false
|
| 8 |
---
|
| 9 |
|
| 10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
pinned: false
|
| 8 |
---
|
| 9 |
|
| 10 |
+
# arXiv Paper Classification
|
| 11 |
+
|
| 12 |
+
A machine learning application that predicts arXiv categories for academic papers based on their title and abstract. This tool uses a fine-tuned SciBERT model to classify papers into arXiv subject categories. This task is completed as homework for YSDA ML 2 course
|
| 13 |
+
|
| 14 |
+
I personally hate jupyter-notebooks, so as a proof that i conducted experiments i made Comet ML logger project public.
|
| 15 |
+
|
| 16 |
+
Latest training logs, configs and other details can be found here https://www.comet.com/adenshulga/arxiv-papers-classification/ef1256f1d4eb4b588da881366eb27578?compareXAxis=step&experiment-tab=panels&showOutliers=true&smoothing=0&xAxis=step
|
| 17 |
+
|
| 18 |
+
## Installation
|
| 19 |
+
|
| 20 |
+
There are two relatively close dockerfile configurations. container_setup folder contains scripts and dockerfile to setup interactive developmpent environment. Dockerfile in the root is for deploying a StreamlitApp.
|
| 21 |
+
|
| 22 |
+
### Streamlit App Setup
|
| 23 |
+
|
| 24 |
+
1. Clone the repository:
|
| 25 |
+
```bash
|
| 26 |
+
git clone https://github.com/adenshulga/arxiv-paper-classification.git
|
| 27 |
+
cd arxiv-paper-classification
|
| 28 |
+
```
|
| 29 |
+
|
| 30 |
+
2. Give permissions for executable scripts:
|
| 31 |
+
```
|
| 32 |
+
chmod +x scripts/pipeline.sh scripts/launch_app.sh
|
| 33 |
+
```
|
| 34 |
+
|
| 35 |
+
3. Build and launch docker container:
|
| 36 |
+
```
|
| 37 |
+
docker build -t arxiv-paper-clf .
|
| 38 |
+
docker run -p 9001:9001 arxiv-paper-clf
|
| 39 |
+
```
|
| 40 |
+
|
| 41 |
+
|
| 42 |
+
### Configuration
|
| 43 |
+
|
| 44 |
+
You can modify the inference settings in `config/inference_config.py`:
|
| 45 |
+
|
| 46 |
+
- `model_name`: Base model name from Hugging Face
|
| 47 |
+
- `checkpoint_path`: Path to fine-tuned model checkpoint
|
| 48 |
+
- `top_percent`: Cumulative score threshold for showing predictions
|
| 49 |
+
- `minimal_score`: Minimum confidence score to display
|
| 50 |
+
|
| 51 |
+
## Development and model Training
|
| 52 |
+
|
| 53 |
+
To enter development environment
|
| 54 |
+
1. Fill container_setup/credentials file
|
| 55 |
+
|
| 56 |
+
2. Give executable permissions to build and launch scripts:
|
| 57 |
+
```
|
| 58 |
+
chmod +x container_setup/build.sh container_setup/launch_script.sh
|
| 59 |
+
```
|
| 60 |
+
|
| 61 |
+
3. Specify resources constrains in ./container_setup/launch_container.sh
|
| 62 |
+
|
| 63 |
+
|
| 64 |
+
4. Build and launch docker container
|
| 65 |
+
```
|
| 66 |
+
./container_setup/build.sh
|
| 67 |
+
./container_setup/launch_container.sh
|
| 68 |
+
```
|
| 69 |
+
|
| 70 |
+
5. Attach to running container
|
| 71 |
+
```
|
| 72 |
+
docker attach <container-id>
|
| 73 |
+
```
|
| 74 |
+
|
| 75 |
+
6. Install the dependencies
|
| 76 |
+
```
|
| 77 |
+
uv venv
|
| 78 |
+
uv sync
|
| 79 |
+
```
|
| 80 |
+
|
| 81 |
+
To train the model:
|
| 82 |
+
|
| 83 |
+
1. Load and unzip the arxiv dataset in the `data` folder(https://www.kaggle.com/datasets/neelshah18/arxivdataset)
|
| 84 |
+
2. Configure the process in config/pipeline_config.py
|
| 85 |
+
|
| 86 |
+
Run the training script:
|
| 87 |
+
```
|
| 88 |
+
scripts/pipeline.sh
|
| 89 |
+
```
|