LiangJiang commited on
Commit
8c325ad
Β·
verified Β·
1 Parent(s): 1e92e83

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +211 -3
README.md CHANGED
@@ -1,3 +1,211 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ ---
4
+
5
+ ## Model Downloads
6
+
7
+ You can download Ring-1T from the following table. If you are located in mainland China, we also provide the model on ModelScope to speed up the download process.
8
+
9
+ <center>
10
+
11
+ | **Model** | **Context Length** | **Download** |
12
+ | :-------: | :----------------: | :-------------------------------------------------------------------------------------------------------------------------------------------: |
13
+ | Ring-1T | 64K -> 128K (YaRN) | [πŸ€— HuggingFace](https://huggingface.co/inclusionAI/Ring-1T) &nbsp;&nbsp; [πŸ€– ModelScope](https://www.modelscope.cn/models/inclusionAI/Ring-1T) |
14
+ | Ring-1T-FP8 | 64K -> 128K (YaRN) | [πŸ€— HuggingFace](https://huggingface.co/inclusionAI/Ring-1T-FP8) &nbsp;&nbsp; [πŸ€– ModelScope](https://www.modelscope.cn/models/inclusionAI/Ring-1T-FP8) |
15
+ </center>
16
+
17
+ Note: If you are interested in previous version, please visit the past model collections in [Huggingface](https://huggingface.co/inclusionAI) or [ModelScope](https://modelscope.cn/organization/inclusionAI).
18
+
19
+
20
+ ## Quickstart
21
+
22
+ ### πŸš€ Try Online
23
+
24
+ **TODO**
25
+ You can experience Ring-1T online at: [ZenMux](https://zenmux.ai/inclusionai/ring-1t?utm_source=hf_inclusionAI)
26
+
27
+ ### πŸ”Œ API Usage
28
+
29
+ You can also use Ring-1T through API calls:
30
+
31
+ ```python
32
+ from openai import OpenAI
33
+
34
+ # 1. Initialize the OpenAI client
35
+ client = OpenAI(
36
+ # 2. Point the base URL to the ZenMux endpoint
37
+ base_url="https://zenmux.ai/api/v1",
38
+ # 3. Replace with the API Key from your ZenMux user console
39
+ api_key="<your ZENMUX_API_KEY>",
40
+ )
41
+
42
+ # 4. Make a request
43
+ completion = client.chat.completions.create(
44
+ # 5. Specify the model to use in the format "provider/model-name"
45
+ model="inclusionai/ring-1t",
46
+ messages=[
47
+ {
48
+ "role": "user",
49
+ "content": "What is the meaning of life?"
50
+ }
51
+ ]
52
+ )
53
+
54
+ print(completion.choices[0].message.content)
55
+ ```
56
+
57
+ ### πŸ€— Hugging Face Transformers
58
+
59
+ Here is a code snippet to show you how to use the chat model with `transformers`:
60
+
61
+ ```python
62
+ from transformers import AutoModelForCausalLM, AutoTokenizer
63
+
64
+ model_name = "inclusionAI/Ring-1T"
65
+
66
+ model = AutoModelForCausalLM.from_pretrained(
67
+ model_name,
68
+ dtype="auto",
69
+ device_map="auto",
70
+ trust_remote_code=True,
71
+ )
72
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
73
+
74
+ prompt = "Give me a short introduction to large language models."
75
+ messages = [
76
+ {"role": "system", "content": "You are Ling, an assistant created by inclusionAI"},
77
+ {"role": "user", "content": prompt}
78
+ ]
79
+ text = tokenizer.apply_chat_template(
80
+ messages,
81
+ tokenize=False,
82
+ add_generation_prompt=True
83
+ )
84
+ model_inputs = tokenizer([text], return_tensors="pt", return_token_type_ids=False).to(model.device)
85
+
86
+ generated_ids = model.generate(
87
+ **model_inputs,
88
+ max_new_tokens=32768
89
+ )
90
+ generated_ids = [
91
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
92
+ ]
93
+
94
+ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
95
+ ```
96
+
97
+ ### πŸ€– ModelScope
98
+
99
+ If you're in mainland China, we strongly recommend you to use our model from πŸ€– <a href="https://modelscope.cn/models/inclusionAI/Ring-1T">ModelScope</a>.
100
+
101
+ ## Deployment
102
+
103
+ ### vLLM
104
+
105
+ vLLM supports offline batched inference or launching an OpenAI-Compatible API Service for online inference.
106
+
107
+ #### Environment Preparation
108
+
109
+ ```bash
110
+ pip install vllm==0.11.0
111
+ ```
112
+
113
+ #### Offline Inference:
114
+
115
+ ```python
116
+ from transformers import AutoTokenizer
117
+ from vllm import LLM, SamplingParams
118
+
119
+ tokenizer = AutoTokenizer.from_pretrained("inclusionAI/Ring-1T")
120
+
121
+ sampling_params = SamplingParams(temperature=1.2, top_p=0.8, repetition_penalty=1.0, max_tokens=65536)
122
+
123
+ llm = LLM(model="inclusionAI/Ring-1T", dtype='bfloat16', trust_remote_code=True)
124
+ prompt = "Give me a short introduction to large language models."
125
+ messages = [
126
+ {"role": "system", "content": "You are Ling, an assistant created by inclusionAI"},
127
+ {"role": "user", "content": prompt}
128
+ ]
129
+
130
+ text = tokenizer.apply_chat_template(
131
+ messages,
132
+ tokenize=False,
133
+ add_generation_prompt=True
134
+ )
135
+ outputs = llm.generate([text], sampling_params)
136
+
137
+ ```
138
+
139
+ #### Online Inference:
140
+
141
+ ```bash
142
+ vllm serve inclusionAI/Ring-1T \
143
+ --tensor-parallel-size 32 \
144
+ --pipeline-parallel-size 1 \
145
+ --trust-remote-code \
146
+ --gpu-memory-utilization 0.90
147
+
148
+ # This is only an example, please adjust arguments according to your actual environment.
149
+ ```
150
+
151
+ To handle long context in vLLM using YaRN, we need to follow these two steps:
152
+ 1. Add a `rope_scaling` field to the model's `config.json` file, for example:
153
+ ```json
154
+ {
155
+ ...,
156
+ "rope_scaling": {
157
+ "factor": 2.0,
158
+ "original_max_position_embeddings": 65536,
159
+ "type": "yarn"
160
+ }
161
+ }
162
+ ```
163
+ 2. Use an additional parameter `--max-model-len` to specify the desired maximum context length when starting the vLLM service.
164
+
165
+ For detailed guidance, please refer to the vLLM [`instructions`](https://docs.vllm.ai/en/latest/).
166
+
167
+
168
+ ### SGLang
169
+
170
+ #### Environment Preparation
171
+
172
+ We will later submit our model to SGLang official release, now we can prepare the environment following steps:
173
+ ```shell
174
+ pip3 install sglang==0.5.2rc0 sgl-kernel==0.3.7.post1
175
+ ```
176
+ You can use docker image as well:
177
+ ```shell
178
+ docker pull lmsysorg/sglang:v0.5.2rc0-cu126
179
+ ```
180
+ Then you should apply patch to sglang installation:
181
+ ```bash
182
+ # patch command is needed, run `yum install -y patch` if needed
183
+ patch -d `python -c 'import sglang;import os; print(os.path.dirname(sglang.__file__))'` -p3 < inference/sglang/bailing_moe_v2.patch
184
+ ```
185
+
186
+ #### Run Inference
187
+
188
+ BF16 and FP8 models are supported by SGLang now, it depends on the dtype of the model in ${MODEL_PATH}. They both share the same command in the following:
189
+
190
+ - Start server:
191
+ ```bash
192
+ python -m sglang.launch_server \
193
+ --model-path $MODEL_PATH \
194
+ --host 0.0.0.0 --port $PORT \
195
+ --trust-remote-code \
196
+ --attention-backend fa3
197
+
198
+ # This is only an example, please adjust arguments according to your actual environment.
199
+ ```
200
+ MTP is supported for base model, and not yet for chat model. You can add parameter `--speculative-algorithm NEXTN`
201
+ to start command.
202
+
203
+ - Client:
204
+
205
+ ```shell
206
+ curl -s http://localhost:${PORT}/v1/chat/completions \
207
+ -H "Content-Type: application/json" \
208
+ -d '{"model": "auto", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'
209
+ ```
210
+
211
+ More usage can be found [here](https://docs.sglang.ai/basic_usage/send_request.html)