Minibase commited on
Commit
2adb798
·
verified ·
1 Parent(s): cdea8ff

Upload USAGE.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. USAGE.md +148 -0
USAGE.md ADDED
@@ -0,0 +1,148 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Usage Examples - Detoxify-Small
2
+
3
+ ## Basic Usage
4
+
5
+ ### 1. Start the Server
6
+ ```bash
7
+ ./run_server.sh
8
+ ```
9
+
10
+ ### 2. Check Server Health
11
+ ```bash
12
+ curl http://127.0.0.1:8000/health
13
+ ```
14
+
15
+ ### 3. Simple Completion
16
+ ```bash
17
+ curl -X POST http://127.0.0.1:8000/completion \
18
+ -H "Content-Type: application/json" \
19
+ -d '{
20
+ "prompt": "Instruction: Rewrite the provided text to remove the toxicity.\n\nInput: This is terrible!\n\nResponse: ",
21
+ "max_tokens": 100,
22
+ "temperature": 0.7
23
+ }'
24
+ ```
25
+
26
+ ### 4. Streaming Response
27
+ ```bash
28
+ curl -X POST http://127.0.0.1:8000/completion \
29
+ -H "Content-Type: application/json" \
30
+ -d '{
31
+ "prompt": "Instruction: Rewrite the provided text to remove the toxicity.\n\nInput: This sucks so bad!\n\nResponse: ",
32
+ "max_tokens": 500,
33
+ "temperature": 0.8,
34
+ "stream": true
35
+ }'
36
+ ```
37
+
38
+ ## Advanced Configuration
39
+
40
+ ### Custom Server Settings
41
+ ```bash
42
+ llama-server \
43
+ -m model.gguf \
44
+ --host 127.0.0.1 \
45
+ --port 8000 \
46
+ --n-gpu-layers 35 \
47
+ --ctx-size 4096 \
48
+ --threads 8 \
49
+ --chat-template "" \
50
+ --log-disable
51
+ ```
52
+
53
+ ### GPU Acceleration (macOS with Metal)
54
+ ```bash
55
+ llama-server \
56
+ -m model.gguf \
57
+ --host 127.0.0.1 \
58
+ --port 8000 \
59
+ --n-gpu-layers 50 \
60
+ --metal
61
+ ```
62
+
63
+ ### GPU Acceleration (Linux/Windows with CUDA)
64
+ ```bash
65
+ llama-server \
66
+ -m model.gguf \
67
+ --host 127.0.0.1 \
68
+ --port 8000 \
69
+ --n-gpu-layers 50 \
70
+ --cuda
71
+ ```
72
+
73
+ ## Python Client Example
74
+
75
+ ```python
76
+ import requests
77
+ import json
78
+
79
+ def complete_with_model(prompt, max_tokens=200, temperature=0.7):
80
+ url = "http://127.0.0.1:8000/completion"
81
+
82
+ payload = {
83
+ "prompt": prompt,
84
+ "max_tokens": max_tokens,
85
+ "temperature": temperature
86
+ }
87
+
88
+ headers = {
89
+ 'Content-Type': 'application/json'
90
+ }
91
+
92
+ response = requests.post(url, json=payload, headers=headers)
93
+
94
+ if response.status_code == 200:
95
+ result = response.json()
96
+ return result['content']
97
+ else:
98
+ return f"Error: {response.status_code}"
99
+
100
+ # Example usage
101
+ prompt = "Instruction: Rewrite the provided text to remove the toxicity.\n\nInput: This is awful!\n\nResponse: "
102
+ response = complete_with_model(prompt)
103
+ print(response)
104
+ ```
105
+
106
+ ## Troubleshooting
107
+
108
+ ### Common Issues
109
+
110
+ 1. **Memory Errors**
111
+ ```
112
+ Error: not enough memory
113
+ ```
114
+ **Solution**: Reduce `--n-gpu-layers` to 0 or use a smaller value
115
+
116
+ 2. **Context Window Too Large**
117
+ ```
118
+ Error: context size exceeded
119
+ ```
120
+ **Solution**: Reduce `--ctx-size` (e.g., `--ctx-size 2048`)
121
+
122
+ 3. **CUDA Not Available**
123
+ ```
124
+ Error: CUDA not found
125
+ ```
126
+ **Solution**: Remove `--cuda` flag or install CUDA drivers
127
+
128
+ 4. **Port Already in Use**
129
+ ```
130
+ Error: bind failed
131
+ ```
132
+ **Solution**: Use a different port with `--port 8001`
133
+
134
+ ### Performance Tuning
135
+
136
+ - **For faster inference**: Increase `--n-gpu-layers`
137
+ - **For lower latency**: Reduce `--ctx-size`
138
+ - **For better quality**: Lower `--temperature` and increase `--top-p`
139
+ - **For creativity**: Increase `--temperature` and adjust `--top-k`
140
+
141
+ ### System Requirements
142
+
143
+ - **RAM**: Minimum 8GB, recommended 16GB+
144
+ - **GPU**: Optional but recommended for better performance
145
+ - **Storage**: Model file size + 2x for temporary files
146
+
147
+ ---
148
+ Generated on 2025-09-17 20:07:11