adenshulga commited on
Commit
448cc45
·
verified ·
1 Parent(s): c106aec

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +80 -1
README.md CHANGED
@@ -7,4 +7,83 @@ sdk: docker
7
  pinned: false
8
  ---
9
 
10
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  pinned: false
8
  ---
9
 
10
+ # arXiv Paper Classification
11
+
12
+ A machine learning application that predicts arXiv categories for academic papers based on their title and abstract. This tool uses a fine-tuned SciBERT model to classify papers into arXiv subject categories. This task is completed as homework for YSDA ML 2 course
13
+
14
+ I personally hate jupyter-notebooks, so as a proof that i conducted experiments i made Comet ML logger project public.
15
+
16
+ Latest training logs, configs and other details can be found here https://www.comet.com/adenshulga/arxiv-papers-classification/ef1256f1d4eb4b588da881366eb27578?compareXAxis=step&experiment-tab=panels&showOutliers=true&smoothing=0&xAxis=step
17
+
18
+ ## Installation
19
+
20
+ There are two relatively close dockerfile configurations. container_setup folder contains scripts and dockerfile to setup interactive developmpent environment. Dockerfile in the root is for deploying a StreamlitApp.
21
+
22
+ ### Streamlit App Setup
23
+
24
+ 1. Clone the repository:
25
+ ```bash
26
+ git clone https://github.com/adenshulga/arxiv-paper-classification.git
27
+ cd arxiv-paper-classification
28
+ ```
29
+
30
+ 2. Give permissions for executable scripts:
31
+ ```
32
+ chmod +x scripts/pipeline.sh scripts/launch_app.sh
33
+ ```
34
+
35
+ 3. Build and launch docker container:
36
+ ```
37
+ docker build -t arxiv-paper-clf .
38
+ docker run -p 9001:9001 arxiv-paper-clf
39
+ ```
40
+
41
+
42
+ ### Configuration
43
+
44
+ You can modify the inference settings in `config/inference_config.py`:
45
+
46
+ - `model_name`: Base model name from Hugging Face
47
+ - `checkpoint_path`: Path to fine-tuned model checkpoint
48
+ - `top_percent`: Cumulative score threshold for showing predictions
49
+ - `minimal_score`: Minimum confidence score to display
50
+
51
+ ## Development and model Training
52
+
53
+ To enter development environment
54
+ 1. Fill container_setup/credentials file
55
+
56
+ 2. Give executable permissions to build and launch scripts:
57
+ ```
58
+ chmod +x container_setup/build.sh container_setup/launch_script.sh
59
+ ```
60
+
61
+ 3. Specify resources constrains in ./container_setup/launch_container.sh
62
+
63
+
64
+ 4. Build and launch docker container
65
+ ```
66
+ ./container_setup/build.sh
67
+ ./container_setup/launch_container.sh
68
+ ```
69
+
70
+ 5. Attach to running container
71
+ ```
72
+ docker attach <container-id>
73
+ ```
74
+
75
+ 6. Install the dependencies
76
+ ```
77
+ uv venv
78
+ uv sync
79
+ ```
80
+
81
+ To train the model:
82
+
83
+ 1. Load and unzip the arxiv dataset in the `data` folder(https://www.kaggle.com/datasets/neelshah18/arxivdataset)
84
+ 2. Configure the process in config/pipeline_config.py
85
+
86
+ Run the training script:
87
+ ```
88
+ scripts/pipeline.sh
89
+ ```